We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

TX

SUMMARY

  • Dynamic and motivated IT professional with around 10 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/Data Mart, Data Visualization, Reporting, and Data Quality solutions.
  • In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, NameNode, DataNode, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Dataframe API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on PySpark and Scala
  • Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS.
  • Implemented the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, creating realm/domain, managing.
  • Experience of Partitions, bucketing in Hive and designed both Managed and External table to optimize performance.
  • Experience with different file formats like Avro, parquet, ORC, Json and XML.
  • Well versed in installation, configuration, supporting and managing of Big Data and underlying infrastructure of Hadoop Cluster along with CDH3&4 clusters.
  • Log data Stored in HBase DB is processed and analyzed and then imported into Hive warehouse, which enabled end business analyst to write HQL queries.
  • Real time experience in Big Data related technology experience in Storage, Querying, Processing and analysis of data.
  • Experience in analyzing data using HiveQL, Pig Latin, HBase and custom Map Reduce programs in Java.
  • Experienced in optimization techniques in sorting and phase ofMap reduce programs and implemented optimized joins that will join data from different data sources.
  • Hands on experience in creating Apache SparkRDD transformations on Data sets in the Hadoop data lake.
  • Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
  • Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
  • Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible.
  • Experienced in fact dimensional modeling ( Star schema, Snowflake schema ), transactional modeling and SCD (Slowly changing dimension)/
  • Experience in using various packages in python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikitlearn, Beautiful Soup, Rpy2, SQLAlchemy, PyQT, PyTest.
  • Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques.
  • Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, MongoDB, Apache Spark, Spark Streaming, HBase, Flume, Impala

Hadoop Distribution: Cloudera, Horton Works, Apache, AWS

Languages: SQL, PL/SQL, Pig Latin, HiveQL, Scala, Regular Expressions

Operating Systems: Windows (xp/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Build Automation tools: SBT, Ant, Maven

Databases: Oracle, SQL Server, MySQL, MS Access, Teradata.

Cloud Technologies: AWS, Microsoft Azure

PROFESSIONAL EXPERIENCE

Confidential, TX

Data Engineer

Responsibilities:

  • Performed end- to-end Architecture & implementation of various AWS services like Amazon EMR, Redshift, S3.
  • Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.Worked on analyzing Hadoop stack and different big data analytic tools including Pig and Hive, HBase database and Sqoop.
  • Written multipleMapReduceprograms to extract data for extraction, transformation and aggregation from more than 20 sources having multiple file formats includingXML, JSON, CSV&othercompressedfile formats.
  • ImplementedSparkCoreinScalato process data in memory.
  • Performed job functions usingSpark API’sinScalaforreal time analysis and for fast querying purposes.
  • Involved in creatingSparkapplicationsinScalausing cache, map, reduceByKey etc. functions to process data.
  • CreatedOozieworkflowsforHadoopbased jobs includingSqoop,HiveandPig.
  • CreatedHive External tablesand loaded the data in to tables and query data usingHQL.
  • Performed data validation on the data ingested usingMapReduceby building a custom model to filter all the invalid data and cleanse the data.
  • Handled the importing of data from various data sources, performed transformations using hive,Map-Reduce, loaded data intoHDFSand extracted data fromMySQLintoHDFSusingSqoop.
  • WroteHiveQLqueries by configuring number ofreducersandmappersin the query needed for the output.
  • Transferred data betweenPig ScriptsandHiveusingHCatalog, transferred relational database usingSqoop.
  • Configured and Maintained different topologies in Storm cluster and deployed them on regular basis.
  • Responsible for building scalable distributeddata solutionsusingHadoop.Installed and configuredHive,Pig,Oozie, andSqooponHadoopcluster.
  • Developed simple tocomplexMap-Reduce jobs using Java programming language that was implemented usingHiveandPig.
  • Ran many performance tests using the Cassandra -stress tool in order to measure and improve the read and write performance of the cluster
  • Configuring the Kafka, Storm and Hive to get and load the real time messaging.
  • SupportedMapReducePrograms that are running on the cluster.Cluster monitoring, maintenance and troubleshooting.
  • Cluster coordination services throughZookeeper.Installed and configuredHiveand also writtenHive UDFs.
  • Worked in collaboration with stakeholders to fulfill the financial development of the company.
  • Worked on the Analytics Infrastructure team to develop a stream filtering system on top ofApache Kafka and Storm.
  • Worked on a POC on Spark and Scala parallel processing. Real streaming the data using Spark with Kafka. worked extensively on pyspark to build bigdata flow.
  • Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.

Environment: s:AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL Workbench, Java,Eclipse, Oracle 10g, SQL.

Confidential, TX

Data Engineer

Responsibilities:

  • Handled importing of data from various data sources, performed data transformations using HAWQ, Map Reduce.
  • Exported the analyzed data to the relational databases usingSqoopand generated reports for the BI team.
  • Managing and scheduling jobs on a Hadoop cluster usingOozie.
  • Along with the Infrastructure team, involved in design and developedKafka and Stormbased data pipeline.
  • Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework.
  • Developed storm-monitoring bolt for validating pump tag values against high-low and
  • High high - low low values from preloaded metadata.
  • Designed and configuredKafkacluster to accommodate heavy throughput of 1 million messages per second. UsedKafka producer0.8.3 API's to produce messages.
  • Installed, Configured Talend ETL on single and multi-server environments.
  • Troubleshooting, debugging & fixing Talend specific issues, while maintaining and performance of the ETL environment.
  • Developed Merge jobs inPythonto extract and load data into MySQL database.
  • Created and modified several UNIX shell Scripts according to the changing needs of the project and client requirements. DevelopedUNIXshellscriptsto call Oracle PL/SQL packages and contributed to standard framework.
  • Developed Simple to complexMap/reduceJobs usingHive.
  • Implemented Partitioning and bucketing inHive. Mentored analyst and test team for writingHive Queries.
  • Involved in setting up ofHBaseto useHDFS.
  • Installation of patches and packages using RPM and YUM in Red hat and suseLinux and also using patchadd and pkgadd in Solaris 10 Operating System.
  • Along with the Infrastructure team, involved in design and developed Kafka and Storm based Data pipeline. This pipeline is also involved inAmazon Web ServicesEMR, S3 and RDS.
  • Knowledgeable of Spark and Scala mainly in framework exploration for transition fromHadoop/MapReduce to Spark.
  • Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
  • Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.

Environment: Azure SQL DW, Databrick, Unix Shell Scripting, Python, Oracle 11g, DB2,HDFS, Kafka, Storm, Spark, ETL, 1Java (jdk1.7), Pig, Linux,HiveQL, AWS EMR, Cassandra, MapReduce, Ms Access,Toad,SQL,Scala, MySQL Workbench, XML, No-SQL, MapReduce, SOLR, HBase, Hive,Sqoop.

Confidential, Estero, FL

Data Engineer

Responsibilities:

  • Involved in completeSDLClife cycle of big data project that includes requirement analysis, design, coding, testing and production.
  • Extensively UsedSqoopto import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
  • Involved in implementing the solution for data preparation which is responsible for data transformation as wells as handling user stories.
  • Developing and testing dataIngestion/Preparation/Dispatchjobs.
  • Worked on migrating existingmainframedata and reporting feeds to hadoop.
  • Involved in setup ofIBM CDCto capture changes on mainframe.
  • DevelopedPigscript to read CDC files and ingest into Hbase.
  • Worked onHbasetable setup and shell script to automate ingestion process.
  • Created hive external table on top of Hbase which were used for feed generation.
  • Scheduled automated run inTalend.
  • Worked on migration of an existing feed from hive to Spark. In order to reduce latency of feeds the existing hql was transformed to run using Spark SQL and Hive Context.
  • Worked on logs monitoring using Splunk. Performed setup of Splunk forwarders and built dashboards on Splunk.
  • Prepared Pig Scripts that were used to build denormalized JSON documents which were then loaded in ElasticSearch
  • Experience in working withSpark SQLfor processing data in theHivetables.
  • WritingPig LatinScripts to perform transformations (ETL) as per the use case requirement.
  • Created dispatcher jobs usingsqoop exportto dispatch the data into Teradata target tables.
  • Involved in indexing the files usingsolrfor removing the duplicates in type 1 insert jobs.
  • Implemented new PIG approach forSCD type1jobs usingPIG Latinscripts.
  • CreatedHive target tablesto hold the data after all the PIG ETL operations using HQL.
  • CreatedHQLscripts to perform the data validation once transformations are done as per the use case.
  • Implemented compression technique to free up some space in the cluster usingSnappy compressionon HBase tables to reclaim the space.
  • Hands on experience with Accessing and performCURDoperations against HBase data.
  • IntegratingSQLlayer on top ofHBaseto get the best performance while reading and writing using salting feature.
  • Writtenshellscripts to automate the process by scheduling and calling the scripts from scheduler.
  • Create Hive scripts to load the historical data and also partition the data
  • Integrated Hadoop with Tableau and SAS analytics to provide end users analytical reports
  • Wrote rules in hive to predict members with various ailments and their primary care providers and reports are pushed toElastic Search.
  • Worked with provisioning and configuring Elastic Search nodes and create indexes. PreparedKibanaDashboard for business users.
  • Closely collaborated with both the onsite and offshore team
  • Closely worked with App support team to deploy the developed jobs into production

Environment: Hadoop, HDFS, Map Reduce, Hive, Flume, Sqoop, PIG, Java (JDK 1.6), Eclipse, MySQL and Ubuntu, Zookeeper, Oracle, Shell Scripting, Elastic Search, Kibana.

Confidential, Lowell, MA

Data Engineer

Responsibilities:

  • Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
  • Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.
  • Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
  • Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.
  • Implemented End to End solution for hosting the web application on AWS cloud with integration to S3 buckets.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
  • Used Kettle widely in order to import data from various systems/sources like MySQL into HDFS.
  • Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins.
  • Involved in creating Hive tables, and then applied HiveQL on those tables for data validation.
  • Moved the data from Hive tables into Mongo collections.
  • Used Zookeeper for various types of centralized configurations.
  • Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts
  • Managed and reviewed Hadoop log files.
  • Tested raw data and executed performance scripts.
  • Shared responsibility for administration of Hadoop, Hive and Pig.

Environment: AWS, Hadoop, Pig, Hive, Sqoop, Flume, MapReduce, HDFS, LINUX, Oozie, MongoDB

Confidential ata Analyst

Responsibilities:

  • Analyze data by performing data cleaning, data mapping, data manipulation, data visualization, using Python, R, Tableau and MATLAB.
  • Demonstrate understanding of existing machine learning tools or develop customized algorithms to solve analytical problems with incomplete data sets and deploy automated processes into systems.
  • Set up MySQL Server and transfer data into those databases from local machines.
  • Utilize machine learning techniques for predictions & forecasting based on the data.
  • Evaluate, test, compare, and validate different models before selecting the best model for prediction by using R/SQL/Spark Scala.
  • Perform budget analysis, cost performance analysis, schedule performance analysis, cost variance analysis, and risk analysis using SQL, R, Python and Tableau.Tune performance of
  • Monitor Cluster using Cloudera Manager and resolve issues by reviewing logs.
  • Implement Capacity scheduler to securely share the available resources among multiple queues.
  • Secure data encrypting data at application level and in HDFS at rest and data in transit.
  • Auto logrotate Hadoop log files by creating and managing the Cron jobs.
  • Migrate data from an RDBMS to a NoSQL database.
  • Wrote and optimized complex SQL queries involving multiple joins and advanced analytical functions to perform data extraction and merging from large volumes of historical data.
  • Work with various HDFS file formats like Avro, ORC, and Parquet for Hive and Spark.
  • Participated in Business meetings to understand the business needs & requirements.
  • Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using PL SQL.

We'd love your feedback!