We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Chicago, IL

SUMMARY

  • 8+ Years of years of experience in Analysis, Design, Development and Implementation as a Data Engineer.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, Spark, Apache Cassandra, HBase.
  • Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Java.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
  • Data pipeline consists of Spark, Hive and Sqoop, and custom build Input Adapters to ingest, transform and analyze operational data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Experience in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
  • Good knowledge of cloud integration with AWS using Elastic MapReduce (EMR), Simple Storage Service (S3), EC2, Redshift.pyspark
  • Expertise in writing MapReduce Jobs in Java for processing large sets of structured, semi - structured and unstructured data sets and stores them in HDFS.
  • Experience working with Java, UNIX and shell scripting.
  • Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files and Databases.
  • Experience with complete Software Development Life Cycle (SDLC) process which includes Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
  • Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
  • Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark Context, Spark SQL, Data Frames and RDD’s.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Java.
  • Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using Spark.
  • Used Spark-Structured-Streaming to perform necessary transformations.
  • Expertise in converting MapReduce programs into Spark transformations using Spark RDD's
  • Strong understanding of AWS components such as EC2 and S3
  • Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins and AWS.
  • Exposure to CI/CD tools - Jenkins for Continuous Integration, Ansible for continuous deployment.
  • Worked with waterfall and Agile methodologies.

TECHNICAL SKILLS

Big data Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Spark, Kafka, Nifi, Airflow, Flume, Snowflake

Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR

Language: C, C++, Scala, Python, Java

AWS Components: IAH, S3, EMR, EC2, Lambda, Route 53, Cloud Watch, SNSMethodologies Agile, Waterfall

Build Tools: Maven, Gradle, Jenkins.

Databases: NO-SQL, HBase, Cassandra, MongoDB, PostgreSQL, Mysql

IDE Tools: Eclipse, Net Beans, Intellij

Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML

BI Tools: Tableau

Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X

PROFESSIONAL EXPERIENCE:

Confidential, Chicago, IL

Senior Data Engineer

Environment: Hadoop, Hive, Kafka, Flume, HBase, Java, Hortonworks AWS, ETL, Oracle 10g/11g/12C, Teradata, Cassandra, HDFS, Data Lake, Spark, MapReduce, PostgreSQL, Ambari, Tableau, NoSQL

Responsibilities:

  • Analyzing large amounts of data sets to determine optimal way to aggregate and report on these data sets.
  • Designed and Implemented Big Data Analytics architecture/pipeline.
  • Experienced in writing Spark Applications in Java.
  • Built multiple notebooks and piped all of the jobs together, built an ETL pipeline to making wear algorithm predictions of the model, and writing the outputs to Azure CosmosDB.
  • Analyzed the SQL scripts and designed the solution to implement using Spark.
  • Developed custom aggregate functions usingSparkSQL and performed interactive querying.
  • Implemented ADB job scheduler for authoring, scheduling and monitoring Data Pipelines
  • Developed Java code to gather the data from API’s and designs the solution to implement usingSpark.
  • Implemented Kafka model which pulls the latest records into Hive external tables.
  • Perform a performance auditing of the PostgreSQL RDBMS on every basic which includes profiling of the SQL queries, log and database metric collection.
  • Imported data into ADLS from various API’s and SQL databases and files using ETL and from streaming systems.
  • Loaded all data-sets into Hive from Source CSV files using spark from Source CSV files using Spark
  • Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
  • Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store (HBase).
  • Migrated the computational code in HQL toPySpark.
  • Completed data extraction, aggregation and analysis in HDFS by usingPySparkand store the data needed to Hive.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Working on the integration of data engineering and Tire Wear model with data science team, to make the correct output predictions and working on fixing other issues.
  • Experienced in Importing and exporting data into HDFS and Hive using Sqoop.
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.
  • Deploying Spark jobs in Amazon EMR and running the job on AWS clusters.
  • Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift; Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as, AWS, GCP

Confidential, Stamford, CT

Sr. Data Engineer

Environment: Map Reduce, HDFS, Sqoop, Spark, LINUX, Hadoop, AWS, Spark Streaming, Kafka, Storm, Spark, Akka, MongoDB, Hadoop Cluster, PostgreSQL, Talend.

Responsibilities:

  • Designed stream processing job used bySpark Streamingwhich is coded inJava.
  • Developed real time data processing applications by using Java and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
  • Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Developed Spark programs and created the data frames and worked on transformations.
  • Involved in loading data from Linux file systems, servers, Java web services using Kafka producers and partitions.
  • Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
  • Implement POC with Hadoop. Extract data with Spark into HDFS.
  • Deploy new hardware and software environments required for PostgreSQL/Hadoop and existing environment.
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Creating Databricks notebooks using SQL, Java and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Implemented applications with Scalaalong with Akka and Play framework.
  • Optimized the code using Spark for better performance
  • Worked on Spark streaming using Apache Kafka for real time data processing.
  • Developed MapReduce jobs using MapReduce Java and HIVEQL.
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in ETL, Data Integration and Migrationby writing Pig scripts.
  • Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
  • Integrated Hadoop with Solr and implement search algorithms.
  • Experience in Storm for handling real-time processing.
  • Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
  • Designed and implemented MongoDB and associated RESTful web service.
  • Very capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitor Hadoop/Spark jobs on AWS.
  • Experience in processing large volume of data and skills in parallel execution of process usingTalendfunctionality.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate Terabytes of data and stored it in AWS HDFS.
  • Developed Sqoop scripts to extract the data from MYSQL and load into HDFS.
  • UsedTalendtool to create workflows for processing data from multiple source systems.

Confidential

Sr. Data Engineer

Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Spark, Spark-Streaming, Spark SQL, Spark, MapR, Pyhton, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Hortonworks, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.

Responsibilities:

  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Pig and Python MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/Hbase into SparkRDD.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Developed and analyzed the SQL scripts and designed the solution to implement using spark
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analysing data.
  • Supported MapReduce Programs and distributed applications running on the Hadoop clusterand scripting Hadoop package installation and configuration to support fully automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setupHadoopenvironment in AWS EC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clustersand worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
  • Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.

Confidential

Bigdata Developer

Environment: Apache Hadoop, Cloudera, Pig, Hive, Talend, Map-reduce, Sqoop, UNIX, Cassandra, Java, LINUX, Oracle 11gR2, UNIX Shell Scripting, Kerberos.

Responsibilities:

  • Developed multiple MapReduce jobs in Python for data cleaning and preprocessing and assisted with data capacity planning and node forecasting.
  • Involved in design and ongoing operation of several Hadoop clusters and Configured and deployed Hive Meta store using MySQL and thrift server
  • Implemented and operated on-premises Hadoop clusters from the hardware to the application layer including compute and storage.
  • Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
  • Designed custom deployment and configuration automation systems to allow for hands-off management of clusters via Cobbler, FUNC, and Puppet.
  • Prepared complete description documentation as per the Knowledge Transferred about the Phase-II Talend Job Design and goal and prepared documentation about the Support and Maintenance work to be followed in Talend.
  • Deployed the company's first Hadoop cluster running Cloudera's CDH2 to a 44-node cluster storing 160TB and connecting via 1 GB Ethernet.
  • Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team.
  • Modified reports and Talend ETL jobs based on the feedback from QA testers and Users in development and staging environments.
  • Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HBase tables.
  • Involved in Cluster Maintenance and removal of nodes using Cloudera Manager.
  • Collaborated with application development teams to provide operational support, platform expansion, and upgrades for Hadoop Infrastructure including upgrades to CDH3.
  • Participated in Hadoop development Scrum and installed, Configured Cognos8.4/10 and Talend ETL on single and multi-server environments.

We'd love your feedback!