We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

MD

SUMMARY

  • 8 years of overall software development experience on Big Data Technologies, Hadoop Eco system and Java/J2EE Technologies wif experience programming in Java, Scala, and Python.
  • Hands - on experience in data structure, design and analysis using Machine Learning Technics and modules in PYTHON, R.
  • 4+ years of strong hands-on experience on Hadoop Ecosystem including Spark, Map-Reduce, Hive, Pig, HDFS, YARN, HBase, Oozie, Kafka and Sqoop.
  • Experience in architecting, designing, and building distributed data pipelines.
  • Deep noledge of troubleshooting and tuningSpark applications and Hive scripts to achieve optimal performance.
  • Strong experience in migrating other databases to Snowflake.
  • Experience in building Snowpipe.
  • Experience in using Snowflake Clone and Time Travel.
  • Worked wif real-time data processing and streaming techniques using Spark streaming and Kafka.
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Significant experience writing custom UDF’s in Hive and custom Input Formats in MapReduce.
  • Experience in designing aTerraformand deploying it in cloud deployment manager to spin up resources like cloud virtual networks,Compute Enginesin public and private subnets along wifAutoScalerinGoogle Cloud Platform.
  • Experience in Designing, Architecting and implementing scalable cloud-based web applications usingAWSandGCP.
  • Strong experience productionalizing end to end data pipelines on hadoop platform.
  • Experience using various Hadoop Distributions (Cloudera, Hortonworks, Amazon AWS EMR) to fully implement and utilize various Hadoop services.
  • Experience working wif NoSQL databases like MongoDB, Cassandra and HBase.
  • Used Hive extensively for performing various data analytics required by business teams.
  • Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
  • Imported the customerdata into Python using Pandas libraries and performed variousdata analysis - found patterns indata which halped in key decisions.
  • Good experience is designing and implementing end to end data security and governance wifin Hadoop Platform using Kerberos.
  • Application development wif oracle forms and report wif OBIEE, discoverer, report builder and ETL development.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Hands on experience in developing end to end Spark applications using Spark apis like RDD, Spark Data frame, Spark MLLib, Spark Streaming and Spark SQL.
  • Good experience working wif various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Atana, Glue etc.,
  • Good understanding of Spark ML algorithms such as Classification, Clustering, and Regression.
  • Experienced in migrating data warehousing workloads into Hadoop based data lakes using MR, Hive, Pig and Sqoop.
  • Setting up the build and deployment automation for Java base project by using Jenkins
  • Extensive experience in developing standalone multithreaded applications.
  • Experience in Software Design, Development and Implementation of Client/Server Web based.
  • Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
  • Experience in maintaining an Apache Tomcat MYSQL, LDAP, Web service environment.
  • Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
  • Good experience wif use-case development, wif Software methodologies like Agile and Waterfall.
  • Active team player wif excellent interpersonal skills, keen learner wif self-commitment& innovation.
  • Proven ability to manage all stages of project development Strong Problem Solving and Analytical skillsAndabilities to make Balanced and Independent Decisions

TECHNICAL SKILLS

Hadoop Technologies and Distributions: HDP, Cloudera

Hadoop Ecosystem: HDFS, Hive, Pig, Spark, Zookeeper, Map-Reduce, Spark-SQL, Spark Streaming and Spark MLLib, Scalding, Oozie.

Cloud Technologies: Amazon Web Services (AWS), Google Cloud platform

NoSQL Databases: HBase, Cassandra

Programming: Python, Bigquery,Core Java, Scala, Shell Scripting PL/SQL

Web Development: HTML, JavaScript, CSS, XML,JSP, Servlets.

IDE: IntelliJ, Eclipse

Databases: Oracle, MS-Sql Server, Snowflake,Teradata

PROFESSIONAL EXPERIENCE

Confidential, MD

Data Engineer

Responsibilities:

  • Designed, developed, and maintained data integration programs in Hadoop and RDBMS environment wif both RDBMS and NoSQL data stores for data access and analysis.
  • Used all major ETL transformations to load the tables through Informatica mappings.
  • Created Hive queries and tables dat halped line of business identify trends by applying strategies on historical data before promoting them to production.
  • Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive.
  • Implemented Spark GraphX application to analyse guest behaviour for data science segments.
  • Data Extraction, aggregations and consolidation of Adobe data wifin AWS Glue using PySpark.
  • Exploring wif the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Worked on batch processing of data sources using Apache Spark, Elastic search.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
  • Worked on migrating PIG scripts and Map Reduce programs to Spark Data frames API and Spark SQL to improve performance.
  • Developed the Talend mappings using various transformations, Sessions and Workflows. Teradata was the target database, Source database is a combination of Flat files, Oracle tables, Excel files and Teradata database.
  • Good level of experience in Core Java, J2EE technologies as JDBC, Servlets, and JSP.
  • Hands-on noledge on core Java concepts like Exceptions, Collections, Data-structures, Multi-threading, Serialization and deserialization
  • Created Hive External tables to stage data and tan move the data from Staging to main tables.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
  • Worked wif NoSQL database HBase in getting real time data analytics.
  • Able to assess business rules, collaborate wif stakeholders and perform source-to-target data mapping, design and review.
  • Designed and developed a Sybase/Open server API to access data from an In Memory KDB/Q database.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as MapReduce Hive, Pig and Sqoop.
  • Created scripts for importing data into HDFS/Hive using Sqoop from DB2.
  • Loading data from different source (database & files) into Hive using Talend tool.
  • Conducted POC's for ingesting data using Flume.
  • Objective of dis project is to build a data lake as a cloud-based solution in AWS using Apache Spark.
  • Experienced in deploying the Apache solar/ zookeeper cluster in cloud, on Premises, working on the data storage and disaster recovery for solar/ zookeeper.
  • Worked on Data modelling, Advanced SQL wif Columnar Databases using AWS.
  • Worked on Sequence files, RC files, Map side joins, bucketing, Partitioning for Hive performance enhancement and storage improvement.
  • Developed BASH scripts to parse the raw data, populate staging tables and store the refined data in partitioned DB2 tables for Business analysis.
  • Worked on managing and reviewing Hadoop log files. Tested and reported defects in an Agile Methodology perspective.
  • Extensive experience wif the searching engines like Apache Lucerne, Apache Solar and Elastic Search.

Environment: Hardtop, Cloud era, Talen, Scale, Spark, HDFS, KDB, Hive, Pig, Swoop, DB2, SQL, Linux, SOLR, Yarn, NDM, Informatics, AWS, Windows & Microsoft Office.

Confidential, St. Louis, MO

Data Engineer

Responsibilities:

  • Worked as a Data Engineer utilizing Big data&Hadoop Ecosystems components for building highly scalable data pipelines.
  • Worked in Agile development environment and participated in daily scrum and other design related meetings.
  • Analyze customer, behavior data, symptoms data, transaction data and campaign data to identify trends and patterns of data in different visualization techniques like Seaborn library in PYTHON.
  • Wrote script in python to predict number of people getting TEMPeffect of some diseases, by collecting set of predicted (symptoms) data from all medical sectors and evaluated wif outcome data and Make the aware of people using Machine Learning Module like logistic regression.
  • Involved in converting Hive/SQL queries into Spark transformations using Scala.
  • Worked on data manipulation and raw marketing data of different formats from multiple sources and prepared the data for Sentiment analysis of all the customer medical issue data using packages like NLP wif NLTK (Natural Language Processing wif Python / Analyzing Text wif the Natural Language Toolkit).
  • Worked on Spark SQL, created data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Extraction of large amounts ofdata for analysis and reporting. Responsible for documentation of all analysis as well asdata discrepancies in both Spark and Python.
  • Imported the customerdata into Python using Pandas libraries and performed variousdata analysis - found patterns indata which halped in key decisions.
  • Responsible for loading the customer's data and event logs from Kafka into Redshift through spark streaming.
  • Developed batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real-time and persist it to Redshift clusters.
  • Developed low-latency applications and interpretable models using machine learning algorithms.
  • Used the AWSSageMakerto quickly build, train and deploy the machine learning models.
  • Worked wif different performance metrics such asf-1 score, precision, recall, log-loss, accuracyandAUCetc.
  • Worked wif Machine learning algorithms like Regressions (linear, logistic), SVMs and Decision trees.
  • By thorough systematic search, demonstrated performance surpassing the state-of-the-art (deep learning)
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, Pyspark.
  • Analysed the sql scripts and designed it by using Pyspark SQL for faster performance.
  • Created Sqoop Scripts to import and export customer profile data from RDBMS to S3 buckets.
  • Used Spark Data frame and Spark API to implement batch processing of Jobs.
  • Used Apache Kafka and Spark Streaming to get the data from adobe live stream rest api connections.
  • Automated creation and termination of AWS EMR clusters.
  • Implemented Python libraries such as Numpy, Matplotlib, Pandas, SKlearn and used them to create dashboards and visualizations, wif using IDE - Spyder/ Jupyter notebook.
  • Used various concepts in spark like broadcast variables, caching, dynamic allocation etc to design more scalable spark applications.
  • Implemented continuous integration and deployment using CI/CD tools like Jenkins, GIT, Maven.
  • Installed and Configured Jenkins Plugins to support the project specific tasks.
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape.
  • Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
  • Developed ELT workflows using NiFI to load data into Hive and Teradata.
  • Worked on Migrating jobs from NiFi development to Pre-PROD and Production cluster.
  • Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Data Extraction, aggregations and consolidation of Adobe data wifin AWS Glue using PySpark.
  • Created external tables wif partitions using Hive, AWS Atana and Redshift
  • Performed data manipulations using various Talend components like tMap, tJavarow, tjava, tOracleInput, tOracleOutput, tMSSQLInput and many more.

Environment: AWS EMR, S3, Spark, Hive, Sqoop, Scala, Java, MySQL, Oracle DB, Atana, Redshift, Snowflake, NiFi, TeraData.

Confidential, Des Moines, IA

Cloud Engineer GCP/Data Engineer

Responsibilities:

  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Setup GCP Firewall rules to allow or deny traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
  • Worked on GKE Topology Diagram including masters, slave, RBAC, helm, kubectl, ingress controllers GKE Diagram including masters, slave, RBAC, helm, kubectl, ingress controllers.
  • Created projects, VPC's, Subnetwork's, GKE Clusters for environments QA3, QA9 and prod using Terraform Created projects, VPC's, Subnetwork's, GKE Clusters for environments.
  • Managed AWS infrastructure as code (IaaS) using Terraform. Expertise in writing new python scripts in order to support new functionality in Terraform. Provisioned the highly available EC2 Instances using Terraform and cloud formation and Setting up the build and deployment automation for Terraform scripts using Jenkins
  • Design and architect various layer of Data lake.
  • Design star schema in Big Query.
  • Performed data quality issue analysis using snow SQL by building analytical warehouses on Snowflake
  • Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environment.
  • Build a program wif Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Application development wif oracle forms and report wif OBIEE, discoverer, report builder and ETL development.
  • Wrote Flume configuration files for importing streaming log data into HBase wif Flume.
  • Imported several transactional logs from web servers wif Flume to ingest the data into HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Worked on continuous Integration tools Jenkins and automated jar files at end of day.
  • Worked wif Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
  • Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
  • Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWSElastic search.
  • Experienced noledge over designing Restful services using java-based APIs like JERSEY.
  • Supported in setting up QA environment and updating configurations for implementing scripts wif Pig, Hive and Sqoop.

Environment: Linux, AWS, EC2, RDS, ELB (Elastic Load Balancing), S3, Cloud watch, Cloud Formation, Route53, Lambda, Bigquery, Dataproc, Data lake, ETL,HiveQL, Map Reduce, Oracle, Spark, SQL, Python,, Yarn, Pig.

Confidential, Seattle, WA

Hadoop/Spark Developer

Responsibilities:

  • Involved in creating data ingestion pipelines for collecting health care and providers data from various external sources like FTP Servers and S3 buckets.
  • Involved in migrating existing Teradata, Datawarehouse to AWS S3 based data lakes.
  • Involved in migrating existing traditional ETL jobs to Spark and Hive Jobs on new cloud data lake.
  • Wrote complex spark applications for performing various de-normalization of the datasets and creating a unified data analytics layer for downstream teams.
  • Primarily responsible for fine-tuning long running spark applications, writing custom spark udfs, troubleshooting failures etc.,
  • Involved in building a real time pipeline using Kafka and Spark streaming for delivering event messages to downstream application team from an external rest-based application.
  • Involved in creating Hive scripts for performing adhoc data analysis required by the business teams.
  • Worked extensively on migrating on prem workloads to AWS Cloud.
  • Worked on utilizing AWS cloud services like S3, EMR, Redshift, Atana, and GlueMetastore.
  • Used broadcast variables in spark, TEMPeffective & efficient Joins, caching, and other capabilities for data processing.
  • Involved in continuous Integration of application using Jenkins.

Environment: Linux, AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozy, Base, Scale, Map Reduce.

Confidential

Software Engineer

Responsibilities:

  • Involved in Requirements Analysis and design an Object-oriented domain model.
  • Involvement in the detailed Documentation, written functional specifications of the module.
  • Involved in development of Application wif Java and J2EE technologies.
  • Develop and maintain elaborate services-based architecture utilizing open source technologies like
  • Hibernate, ORM and Spring Framework.
  • Developed server-side services using core Java multithreading, Struts MVC, Java, EJB, Spring, Webservices (SOAP, WSDL, AXIS).
  • Responsible for developing DAO layer using Spring MVC and configuration XMLs for Hibernate and toalso manage CRUD operations (insert, update, and delete).
  • Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application,
  • Reference implementation.
  • Development of JavaScript for client end data entry validations and Front-End Validation.
  • Deployed Web, presentation, and business components on Apache Tomcat Application Server.
  • Developed PL/SQL procedures for different use case scenarios
  • Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
  • Worked on SnowSQL and Snowpipe
  • Converted Talend Joblets to support the snowflake functionality.
  • Created Snowpipe for continuous data load.
  • Used COPY to bulk load the data.
  • Created data sharing between two snowflake accounts.
  • Created internal and external stage and transformed data during load.
  • Redesigned the Views in snowflake to increase the performance.

Environment: Java/J2EE, JSP, XML, Spring Framework, Hibernate, Eclipse (IDE), Java Script, Ant, SQL, PL/SQL, Oracle, Windows, UNIX, Soap, Jasper report, SnowFlake.

We'd love your feedback!