We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

5.00/5 (Submit Your Rating)

CA

PROFESSIONAL SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Experienced in developing Python code to retrieve and manipulate data from AWS Redshift, Oracle, T-SQL, MongoDB, MS SQL Server, Excel and Flat files.
  • Extract, Transform and load data from sources systems to Azure Data Storage services using a combination of Azure data factory, T-SQL, Spark SQL. Data ingestion to one or more Azure services (Azure Data Lake, Azure storage, Azure SQL) and processing data in Azure Data bricks.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK's.
  • Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Worked on SnowSQL and Snowpipe
  • Converted Talend Joblets to support the snowflake functionality.
  • Created Snowpipe for continuous data load.
  • Used COPY to bulk load the data.
  • Created data sharing between two snowflake accounts.
  • Created internal and external stage and transformed data during load.
  • Redesigned the Views in snowflake to increase the performance.
  • Unit tested the data between Redshift and Snowflake.
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape.
  • Creating Reports in Looker based on Snowflake Connections
  • Hadoop and S3 Buckets and AWS Services for redshift.
  • Validating the data from SQL Server to Snowflake to make sure it has Apple to Apple match.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprises
  • Building solutions once for all with no band-aid approach.
  • Implemented Change Data Capture technology in Talend in order to load deltas to a Data Warehouse.
  • Develop stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Design, develop, test, implement and support of Data Warehousing ETL using Talend.
  • Very good knowledge of RDBMS topics, ability to write complex SQL, PL/SQL
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Experience in designing, developing, scheduling reports/dashboards using Tableau and OBIEE.
  • Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.

PROFESSIONAL EXPERIENCE

Senior Data Engineer

Confidential, CA

Responsibilities:

  • Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
  • Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up-to-date for reporting purpose by Pig.
  • Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, postgreSQL, Data Frame,OpenShift, Talend, pair RDD's
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
  • Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.

Environment: Spark (PySpark, SparkSQL, SparkMLIib), Python 3.x (Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig.

Senior Data Engineer

Confidential, Costa Mesa, CA

Responsibilities:

  • Created a La Hands on experience using Azure Data Factory (ADF) to perform data ingestion into Azure Data Lake Storage (ADLS).
  • Worked on SnowSQL and Snowpipe
  • Converted Talend Joblets to support the snowflake functionality.
  • Created Snowpipe for continuous data load.
  • Used COPY to bulk load the data.
  • Created data sharing between two snowflake accounts.
  • Created internal and external stage and transformed data during load.
  • Redesigned the Views in snowflake to increase the performance.
  • Unit tested the data between Redshift and Snowflake.
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape.
  • Creating Reports in Looker based on Snowflake Connections
  • Experience in working with AWS, Azure and Google data services
  • Validation of Looker report with Redshift database.
  • Good working knowledge of any ETL tool (Informatica or SSIS).
  • Created Talend Mappings to populate the data into dimensions and fact tables.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using java and Talend.
  • Used Talend big data components like Hadoop and S3 Buckets and AWS Services for redshift.
  • Validating the data from SQL Server to Snowflake to make sure it has Apple to Apple match.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring the data driven culture across the enterprises
  • Building solutions once for all with no band-aid approach.
  • Define virtual warehouse sizing for Snowflake for different type of workloads.
  • Implemented Change Data Capture technology in Talend in order to load deltas to a Data Warehouse.
  • Develop stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Design, develop, test, implement and support of Data Warehousing ETL using Talend.
  • Very good knowledge of RDBMS topics, ability to write complex SQL, PL/SQL

Environment python 3.6, Snowflake, Redshift, SQL server, AWS, AZURE, TALEND, JENKINS and SQL.

Data Engineer

Confidential, Bentonville, AR

Responsibilities:

  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandMigrating on premise databases toAzure Data lake storeusing Azure Data factory.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Good understanding ofSpark Architectureincluding Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
  • Good understanding ofBig Data Hadoopand Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node.
  • Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python
  • Developing architecture to move the project from Abinitio to pyspark and scala spark.
  • Implemented enterprise grade platform (Mark logic) for ETL from mainframe to NoSQL (cassandra).
  • Building distributed data scalable using Hadoop.
  • Using Sqoop to load data from HDFS, Hive, MySQL and many other sources on daily bases.
  • Creating MapReduce programs to enable data for transformation, extraction, and aggregation of multiple formats like Avro, Parquet, XML, JSON, CSV and other compressed file formats.
  • Use Python, Scala programming on a daily basis to perform transformations for applying business logic.
  • Writing Hive Queries in Spark-SQL for analysis and processing the data.
  • Setting up Hbase column-based storage repository for archiving data on daily bases.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Using Enterprise data lake to support various use cases including Analytics, Storing, and reporting of Voluminous, structured and unstructured, rapidly changing data.
  • Exported the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Converting data load pipeline algorithms written in python and SQL to scala spark and pyspark.
  • Mentor and support other members of the team (both on-shore and off-shore) to assist in completing tasks and meet objectives.

Environment: Hadoop, Spark, Hive, Hbase, Abinitio, Scala, Python, ETL, NoSQL (Cassandra), Azure Databricks, HDFS, MapReduce, Azure Data Lake Analytics, Spark SQL, T-SQL, U-SQL, Azure SQL, Sqoop, Apache Airflow.

Data Engineer

Confidential

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala,Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Delivered data engineer services like data exploration, ad-hoc ingestions, and subject-matter-expertise to Data scientists in using big data technologies.
  • Build machine learning models to showcase big data capabilities using Pyspark and MLlib.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Conducted statistical analysis to validate data and interpretations using Python and R, as well as presented Research findings, status reports and assisted with collecting user feedback to improve the processes and tools.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Data Analyst

Confidential

Responsibilities

  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Upgraded Informatica, DAC and OBIEE from lower version to updated versions and resolved various issues associated with the upgrade. Modified SQL queries in Informatica mappings post upgrade.
  • Analysis of functional and non-functional categorical data elements for data profiling and mapping from source to target data using SSRS.
  • Involved with data profiling/validation for multiple sources using Aginity Pro, AWS Redshift, MySQL.
  • Worked with data investigation, discovery, and mapping tools to validate the data across various environments.
  • Extensively used ETL for supporting data extraction, transformations and loading processing, in a complex EDW using Talend/Data stage.
  • Metrics reporting, data mining and trends in helpdesk environment using Access.
  • Written complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2
  • Extensively used MS Access to pull the data from various data bases and integrate the data.
  • Worked on SAS for data analytics and data quality checks.
  • Worked with the business in gathering requirements from the existing reporting tool.
  • Working closely with business in creating dashboards based on the data from various Data sources.
  • Worked on Tableau in migrating the Excel reports and Business objects to Tableau Dashboards.
  • Lead team of 3 tableau developers in assigning and coordinating the work to meet the quality and deadlines.
  • Utilizing Tableau server to publish and share the reports with the business users.
  • Developing Tableau data visualization using Donut charts, Waterfall, Cross Map, Scatter Plots, Geographic Map, Pie Charts, Bar Charts, Dual axis, Triple axis, scorecards, dashboards using stack bars, scattered plots, Gantt charts and various other charts based on the business desired.
  • Involved in extensive DATA validation by writing several complex SQL queries and involved in back-end testing and worked with data quality issues.
  • Developed regression test scripts for the application and involved in metrics gathering, analysis and reporting to concerned team and tested the testing programs.

Environment: Oracle 11G, Teradata SQL Assistant 12.0, AWS Redshift, Tableau, MS-Excel.

TECHNICAL SKILLS

Big Data Tools: HBase 1.2, Hive 2.3, Pig 0.17, HDFS, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop3.0, Spark

Methodologies: JAD, System Development Life Cycle (SDLC), Agile, Waterfall Model.

ETL Tools: Informatica 9.6/9.1 and Tableau.

Data Modeling Tools: Erwin Data Modeler 9.8, ER Studio v17, and Power Designer 16.6.

Databases: Oracle 12c, Teradata R15, MS SQL Server 2016, DB2.

Cloud Platform: AWS, Azure, Google Cloud, Cloud Stack/Open Stack

Programming Languages: SQL, PL/SQL, Python, UNIX shell Scripting

Operating System: Windows, UNIX

We'd love your feedback!