We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Plano, TexaS

SUMMARY

  • Having over 5+ years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.
  • In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
  • Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
  • Hands-on experience with AWS EC2, S3, RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SQS, Lambda, EMR and other services.
  • Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations).
  • Strong Knowledge on Architecture of Distributed Systems and Parallel Processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
  • Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, Text Files into AWS Redshift.
  • Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD's and worked explicitly on PySpark and Scala.
  • Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables.
  • Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop, Pig, Oozie, Flume, HBase and Zookeeper.
  • Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data processing using Spark Streaming.
  • Extensively worked with Kafka as middleware for real-time data pipelines.
  • Writing UDFs and integrating with Hive and Pig using Java.
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL.
  • Created Java apps to handle data in MongoDB and HBase.
  • Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
  • Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
  • Experienced in fact dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension).
  • Building and productionizing predictive models on large datasets by utilizing advanced statistical modelling, machine learning, or other data mining techniques.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, Nifi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

Databases: Oracle, MySQL, SQL Server, Mongo DB, Cassandra, Dynamo DB, PostgreSQL, Teradata, Cosmos

Programming: Python, PySpark, Scala, Java, C, C++, Shell Script, Perl Script, SQL

Cloud Technologies: AWS(Lambda, EC2, EMR, Amazon S3, Kinesis, Sagemaker)

Frameworks: Django REST Framework, MVC, Hortonworks

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Versioning Tools: SVN, Git, GitHub (Version Control)

Network Security: Kerberos

Database Modelling: Dimension Modelling, ER Modelling, Star Schema Modelling, Snowflake Modelling

Monitoring Tool: Apache Airflow, Agile, Jira, Rally

Visualization/Reporting: Tableau, ggplot2, Matplotlib, SSRS and Power BI

Machine learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative Rules, NLP and Clustering

Machine Learning Tools: Scilit-learn, Pandas, TensorFlow, SparkML, SAS, R, Keras

PROFESSIONAL EXPERIENCE

Confidential, Plano, Texas

Data Engineer

Responsibilities:

  • Analysed and cleansed raw data using HiveQL.
  • Experience in data transformations using Map-Reduce, Hive for different file formats.
  • Analyse the user needs, interact with various SOR's to understand their incoming data structure and ran POC's with best possible processing framework in big data platform.
  • Documented the results with various tools and technologies which can be implemented accordingly based on the business use case.
  • Provide the direction of our data engineering and architecture. Determine the right tools for the right jobs. We collaborate on the requirements and then you call the shots on what gets built. Seriously.
  • Create and maintain analytics data pipelines that generates data + insight to power business decision making.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift. worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Played a key role in Finalizing the tech stack for our project and ran end to end vigorous testing qualifying the user needs as well as tech requirements.
  • Generated data visualizations using tools such as Tableau, Python Matplotlib, Python Seaborn, R.
  • I used to Monitor BigQuery, Dataproc and Cloud Data flow jobs via Stackdriver for all the environments.
  • Install and configure Apache Airflow for AWS S3 bucket and created DAGs to run the Airflow.
  • Involved in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s.
  • Built data pipelines in airflow in GCP for ETL related jobs using different airflow operations.
  • Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
  • Automated Regular AWS tasks like snapshots creation using Python scripts.
  • Hands on experience in Architecting Legacy Data Migration projects such as Teradata to AWS Redshift migration and from on-premises to AWS Cloud.
  • Designed, built, and deployed a multitude application utilizing almost all AWS stack (Including EC2, R53, S3, RDS, HSM Dynamo DB, SQS, IAM, and EMR, focusing on high-availability, fault tolerance, and auto-scaling
  • Performed Kafka analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • Worked on end-to-end machine learning workflow, implemented python code for gathering the data from AWS Snowflake, data pre-processing, feature extraction, feature engineering, modelling, evaluating the model, deployment.
  • Used Pandas, Numpy, SciPy, Scikit-learn, NLTK in Python for scientific computing and data analysis.
  • Developed Python code using version control tools like GIT hub and SVN on vagrant machines.
  • Collaborated with intra applications teams to fit our business models on existing on-Prem platform setup.
  • Experience in creating tables, dropping and altered at run time without blocking updates and queries using HBase and Hive.
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Encoded and decoded json objects using PySpark to create and modify the data frames in Apache Spark
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
  • Used AWS services like EC2 and S3 for small data sets.
  • Writing Scala Applications which runs on Amazon EMR cluster that fetches data from the Amazon S3 location and queue it in the Amazon SQS (simple Queue Services) queue.
  • Created an AWS Lambda function and configured it to receive events from your S3 bucket.
  • Developed the ETL Data pipeline for data loading from centralized Data Lake/ AWS service S3 as a data source to Postgres (RDBMS) using Spark.
  • Used Cloud watch to monitor logs and log metrics generated by applications.
  • Integrated with Restful APIs to create Service now Incidents when there is a process failure within the batch job.
  • Analysed the SQL scripts and designed the solution to implement using PySpark.
  • Developed a capability to implement audit logging at required stages while applying business logic.
  • Re-formatted the end results to SOR's requested formats.

Confidential

Data Engineer

Responsibilities:

  • Worked on requirement gathering, analysis, and designing of the systems
  • Actively involved in designing Hadoop ecosystem pipeline
  • Developed spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters
  • Worked with the data science team to build Statistical models with Spark MLLIB and PySpark
  • Performed SQL Joins among Hive tables to get input for Spark batch process
  • Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL
  • Worked on creating data models for Cassandra from existing Oracle data model
  • Involved in importing data from various sources to the Cassandra cluster using Sqoop
  • Responsible for importing real time data to pull the data from sources to Kafka clusters
  • Implemented Spark RDD transformation to Map business analysis and apply actions on top of transformations
  • Involved in designing Kafka for multi data center cluster and monitoring it
  • Worked with Spark techniques like refreshing the table and handling parallelly and modifying the spark defaults for performance tuning
  • Used Sqoop to import functionality for loading historical data present in RDBMS to HDFS
  • Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement
  • Designed workflows and coordinators in Oozie to automate and parallelize hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2)
  • Configured hive bolts and written data to hive in Hortonworks as a part of POC
  • Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster
  • Developed Python script for starting a job and ending a job smoothly for a UC4 workflow
  • Developed Oozie workflow for scheduling & orchestrating the ETL process
  • Created Data pipelines as per the business requirements and scheduled it using Oozie Coordinators
  • Worked extensively on Apache Nifi to build Nifi flows for the existing Oozie jobs to get the incremental load, full load, and semi structured data and to get data from Rest API into Hadoop and automate all the Nifi flow runs incrementally
  • Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures
  • Developed shell scripts to periodically perform incremental import of data from third party API to Amazon AWS
  • Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud
  • Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using spark framework
  • Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL
  • Developed Spark streaming application to consume JSON messages from Kafka and perform transformations
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive
  • Involved in developing a MapReduce framework that filters unnecessary records
  • Ingested data from RDBMS and performed data transformations, and then exported the transformed data to Cassandra as per the business requirement
  • Involved in converting Hive/SQL queries into Spark transformation using Spark RDDs with Scala
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive

Confidential

Data Analyst

Responsibilities:

  • Worked as a Data Analyst on Data Migration project to establish an enterprise data warehouse for enabling efficient marketing algorithms and developed sales based KPI reports for further consumption by the data science teams for AI supported marketing
  • Created documentation of the business requirements & solution charters provided by the engineering team and coordinating accordingly with Business partners
  • Contributed into developing a work plan for Datawarehouse project and resolved design issues
  • Extracted enterprise data from third party sources like Equifax, CMHC and transformed and loaded it as per the business requirements for consumption by the BU’s
  • Coordinated with Application developers and DBAs to diagnose and resolved SQL query performance problems as well as the Migration of enterprise data between the different ETL tools
  • Created Complex workbooks and Dashboard by connecting to multiple data source using data blending using Tableau
  • Used SQL queries at the custom SQL level to pull the data into Tableau desktop and validated the results in Tableau by running SQL queries
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift
  • Responsible for gathering requirements from end clients and SMEs to understand system functionalities for extraction and transformation and loading of legacy data in preparation for conversion to an SQL server system platform
  • Documented ETL specifications documents Source to Target Mapping documents
  • Utilized MSSQL Server views, stored procedures, and functions, for ETL of legacy data
  • Collaborated with other cross functional team members for corporate wide data conversion project related to Business Partners, Contracts, Claims and Accounting
  • Developed documentation of all related data flows and coding enhancements
  • Partnered with team members in forming a dedicated Production support unit to monitor and troubleshoot daytime and overnight systems activity
  • Involved in performing end to end manual testing and validation of the loaded data in stage environment to ensure data integrity is maintained across downstream systems
  • Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team
  • Worked with Spark Ecosystem using Scala and Hive queries on different data formats like Text file and parquet
  • Worked in migrating Hive QL into Impala to minimize query response time
  • Responsible for migrating the code base to Amazon EMR and evaluated Amazon eco-systems components like Redshift
  • Collected the logs data from web servers and integrated in to HDFS using Flume
  • Developed Python scripts to clean raw data
  • Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs
  • Used AWS services like EC2 and S3 for small data sets processing and storage
  • Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS
  • Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO)
  • Created applications using Kafka which monitors consumer lag within Apache Kafka clusters
  • Worked on importing and exporting data into HDFS and Hive using Sqoop, built analytics on Hive tables using Hive context in spark jobs
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS
  • Worked in Agile environment using Scrum methodology

We'd love your feedback!