Sr. Big Data Engineer Resume
O Fallon, MO
SUMMARY
- Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data - centric solutions.
- Experience working with data modeling tools like Erwin and ER/Studio.
- Experience in designing star schema, Snowflake schema for Data Warehouse.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing big data as per the requirement.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Adept at multitasking, working independently and as part of a team as required. Very flexible at adapting to changing client needs and deadlines. Possessing strong problem solving and communication skill.
- A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
- Developing ETL pipelines in and out of data warehouse using combination of Python and SnowSQL.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core, SparkSQL.
- Good knowledge of Spark Scala's functional style programming techniques like Anonymous Functions (Closures), Higher Order Functions and Pattern Matching.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS, CSV, etc.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experience in Designing and implementing data structures and commonly used data business intelligence tools for data analysis.
- Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
- Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS).
TECHNICAL SKILLS
Hadoop/Big-Data: HDFS, Hive, Sqoop, Flume & Zookeeper (Cloudera Plattform) 2.0, Data Frame & Spark SQL, Impala, AirFlow, Stone Branch, Nifi.
Python/Scala: Python, Pandas & Java, PCA, Dimension Reduction, TSNE, CDF, Regression & Classification, Navi Byes, KNN
Cloud: AWS/GCP Cloud, GCP Cloud/S3 Bucket & EC2, Data -Proc, Big Query, RDS, EMR, CICD AWS., RDS, Redshift, Kinesis API, Pub- Sub API, Looker, Data Flow.
Automation Tool: RPA Automation (UI PATH & Win Automation ) & Tool, Maven, Jenkins, GIT & AWS Cloud CICD.
Servers: TOMCAT 5.0, 6.0, Web Logic, WebSphere 7.0, 6.1.
Database: SQL,MYSQL,DB2,SQLDBX,ORACLE 9I, 10G
OS: DOS, Windows 98, 2000/NT, UNIX.
Tool: Putty, SSh, Filezila, Winscp, Manage Now, IT2B, VSS, & RCC. Build Fordge, Informatica visioning tool, Tivoli, ICD Tool, Service nowSplunk, Jira
PROFESSIONAL EXPERIENCE
Confidential, O’fallon, MO
Sr. Big Data Engineer
Responsibilities:
- Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
- Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
- Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, and CNN).
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
- Performing data analysis, statistical analysis, generated reports, listings and graphs using SAS tools, SAS/Graph, SAS/SQL, SAS/Connect and SAS/Access.
- Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL andPythoncode.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used Apache NiFi to copy data from local file system to HDP.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System
- Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
- Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
- Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.
- Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
- Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
- Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
- Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
- Architect and design server less application CI/CD by using AWS Server less (Lambda) application model.
- Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
Environment: Hadoop, Map Reduce, HDFS, Hive, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Talend, Big Data Integration, Solr, Impala.
Confidential, Vernon Hills, IL
Big Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.
- I have been a lead and organized the learnings and have given the presentation for the new people to learn about the process that they have to do in the real time project.
- I have guided/lead the group of new people for the application process.
- Strong understanding of AWS components such as EC2 and S3
- Performed Data Migration to GCP
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
- Loaded application analytics data into data warehouse in regular intervals of time
- Designed & build infrastructure for the Google Cloud environment from scratch
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
- Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using Hive QL.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark-SQL to Load JSON data and create Schema R DD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Optimized the Tensor flow Model for efficiency
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
Environment: AWS, GCP, Bigquery, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Cloud Sql, MySQL, Postgres, Sql Server, Tableau, Python, Scala, Spark, Hive, Spark-Sql, Git.
Confidential, Denver, CO
Big Data Engineer
Responsibilities:
- Migrating data from FS to Snowflake within the organization
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
- Worked on to retrieve the data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: Snowflake, AWS S3, GitHub, EMR, Nebula, Impala, Jira, Confluence, Shell/Perl Scripting, Python, Kafka, Hive, Scala, Teradata, SQL Server, Apache Spark, Sqoop
Confidential
Data Engineer
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytic tools including Hive and Sqoop.
- Develop data pipeline using Sqoop and MapReduce to ingest current data and historical data in data staging area.
- Responsible for defining data flow in Hadoop ecosystem to different teams.
- Wrote Pig scripts for data cleansing and data transformation as ETL tool before loading in HDFS.
- Worked on importing normalize data from staging area to HDFS using Scoop and perform analysis using Hive Query Language (HQL).
- Create Managed tables and External tables in Hive and load data from HDFS.
- Performed query optimization for HiveQL and de normalized Hive tables to increase speed of data retrieval.
- Transferred analyzed data from HDFS to BI team for visualization and to data scientist team for predictive modelling.
- Experience in scheduling workflows using Autosys.
- Experience in running Hive queries on Spark execution engine.
- Design whole SDLC of the Project and high level and detail deign plan.
- Create different SAS reports like bar charts, tabular reports, cross tab reports etc. using SAS Web Report Studio and Create pages and Portlets in SAS information delivery Portal.
- Publish the reports in SAS Information Delivery Portal and give access to different group of users.
- Improving the project quality in terms of performance and the related documentation.
- Performed Impact Analysis of the changes done to the existing mappings and provided the feedback
- Participated in providing the project estimates for development team efforts for the offshore as well as on-site.
- Coordinated and monitored the project progress to ensure the timely flow and complete delivery of the project
- Demonstrable experience designing and implementing complex applications and distributed systems into public cloud infrastructure (AWS, GCP, Azure, etc…)
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL dataware house environment.
Environment: Linux, MapReduce, Azure, YARN, Spark, Workflows, AWS, S3, EMR, Cloudera, HBASE, SQOOP, Oozie, Scala, Kafka, Python, maven, Cloudera, SAS, SQL, Data Stage, Pig, Hive, Oracle.
Confidential
Data Engineer
Responsibilities:
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Processed some simple statistical analysis of data profiling like cancel rate, var, skew, Kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
- Provided business intelligence analysis to decision-makers using an interactive OLAP tool
- Created T/SQL statements (select, insert, update, delete) and stored procedures.
- Defined Data requirements and elements used in XML transactions.
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
- Performed Tableau administering by using tableau admin commands.
- Involved in defining the source to target Data mappings, business rules and Data definitions.
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
- Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Used Git for version control with colleagues.
Environment: Spark, Python, Tableau, GitHub, AWS, SQL, PL/SQL, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP.
