Sr Data Engineer Resume
Omaha, NE
SUMMARY
- 7+ years of overall IT experience with experience in Big Data technologies which includes designing and implementing data pipelines, AWS Cloud, Java, Spark (using Python/Scala).
- Expertise in Hadoop, HDFS, Map Reduce and Hadoop Ecosystem including Hive, HBase, HBase - Hive Integration, Spark-Core, Spark-SQL, Kafka, Sqoop, Oozie & MapReduce Framework.
- Good understanding on Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Application Master, Resource Manager, Node Manager and MapReduce programming paradigm.
- Experience in AWS services like EMR, S3, Cloud Formation stacks, Glue, Redshift, Athena, Aurora RDS, Cloud watch, SNS, Lambda and Step Functions.
- Good exposure and experience in Spark, Scala, Big Data and AWS Stack.
- Used different Spark modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
- Used Spark Data Frames Operations to perform required validations on the data and to perform analytics on the Hive data.
- Worked on Spark Streaming and Structured Spark streaming using Apache Kafka for real time data processing.
- Strong experience troubleshooting long running spark applications, designing highly fault tolerant spark applications and fine-tuning spark applications.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Extract Real time feed using Kafka and Spark Streaming and convert it to DF and process data in the form of Data Frame and save the data as JSON/ORC formats in HDFS.
- Have experience in Shell Scripting and used it extensively to automate deployment and configuration management tasks.
- Developing various cross platform products while working with different Hadoop file formats like ORC, Avro, Parquet, Json, Delimited files.
- Analyzing Data through Hive QL, Pig Latin & MapReduce programs in Java.
- Extending HIVE core functionalities by implementing custom UDF’s.
- Performed ad-hoc queries on structured data using Hive QL and used Partition, bucketing techniques and joins with Hive for faster data access.
- Importing and exporting data into HDFS and HIVE using Sqoop.
- Good hands on experience in creating the RDD's, Data frames for the required input data and performed the data transformations using Spark and Scala.
- Hands on experience on Hortonworks, Cloudera Hadoop environments.
- Experienced in working with NoSQL databases like Cassandra and HBase.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
- Experienced in using waterfall, Agile and Scrum models of software development process.
- Strong knowledge of version control systems like Bit Bucket and GITHUB.
- Good level of experience in Core Java, JEE technologies, JDBC, Servlets and JSP.
- Active team player with excellent interpersonal skills, keen learner with self-commitment & innovation. Ability to meet deadlines and handle pressure in coordinating multiple tasks in the work environment.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: HDFS, Hive, HBase, Sqoop, Yarn, Spark, Spark SQL, Kafka
Hadoop Distributions: Cloudera and AWS EMR
Languages: Java, Python, Scala
Reporting: Tableau
Operating Systems: Linux, Unix and Windows
Databases: Teradata, Oracle DB2, SQL Server, MySQL
Build Tools: Maven, Ant, Jenkins
Version Control: Git, SVN, CVS
PROFESSIONAL EXPERIENCE
Sr Data Engineer
Confidential, Omaha, NE
Responsibilities:
- Migrate legacy IBM data stage ETL pipelines to containerized python applications.
- Design and implementing Scalable PySpark data processing framework for incoming customer information.
- Implementing data quality over incoming data as per business requirement in PySpark.
- Communication with client and requirement gathering for change request.
- Deployed, schedule and execute applications in Cybermation.
Environment: Unix, Docker container, Jenkins, Cybermation, IBM data stage, DB2.
Sr Data Engineer
Confidential, Detroit, MI
Responsibilities:
- Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.
- Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.
- Built series of PySpark Applications using python and Hive scripts to produce various analytical datasets needed for digital marketing teams.
- Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
- Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
- Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
- Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
- Working with Azure data bricks, Azure data factory, Pyspark and other relevant technologies in Azure Microsoft Azure cloud.
- Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.,
- Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
- Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
- Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.
- Used Talend for Data Integration
- Developed microservice on boarding tools leveraging Python and Jenkins allowing for easy creation and maintenance of build jobs and Kubernetes deploy and services.
Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Python, Java, Hive, Kafka, IAM Roles
Sr BigData Engineer
Confidential, SFO, CA
Responsibilities:
- Built custom Input adapters for ingesting gigabytes of behavioural event logs from external servers such as FTP server and S3 buckets on daily basis.
- Created Sqoop scripts to import/export user profile and other lookup data from RDBMS to S3 data store.
- Developed various spark applications using python(pyspark) to perform cleansing, transformation and enrichment of these click stream data.
- Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting.
- Utilized Spark RDD, Dataframes and Spark Sql API to implement batch processing of jobs.
- Troubleshooting Spark applications for improved error tolerance and reliability.
- Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Created Kafka producer API to send live-stream json data into various Kafka topics.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to Snowflake.
- Utilized Spark in Memory capabilities, to handle large datasets.
- Used Broadcast variables in PySpark, effective & efficient Joins, transformations and other capabilities for data processing.
- Created and updated existing jobs by using autosys
- Experienced in working with EMR cluster and S3 in AWS cloud.
- Creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning (both dynamic Partitions and Static Partitions) and Bucketing in HIVE.
- Involved in continuous Integration of application using Jenkins.
- Interacted with the infrastructure, network, database, application and BA teams to ensure data quality and availability
- UsedKubernetesto deploy scale, load balance, scale and manageDockercontainers
Environment: Spark, Kafka, Hive, Java, S3, EMR, Redshift, Athena, Glue, Scala, Kafka
BigData Engineer
Confidential, NYC, NY
Responsibilities:
- Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis
- Migrated Existing MapReduce programs to Spark Models using Python.
- Migrating the data from Data Lake (Hive) into S3 Bucket.
- Done data validation between data present in data lake and S3 bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on Hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Used Kafka for real time data ingestion.
- Created different topic for reading the data in Kafka.
- Read data from different topics in Kafka.
- Moved data from s3 bucket to snowflake data warehouse for generating the reports.
- Written Hive queries for data analysis to meet the business requirements.
- Migrated an existing on-premises application to AWS.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
- Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark SQL.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in Hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores, and executors.
Environment: Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP
Java Developer
Confidential
Responsibilities:
- Involved in Requirements Analysis and design an Object-oriented domain model.
- Involvement in the detailed Documentation, written functional specifications of the module.
- Involved in development of Application with Java and J2EE technologies.
- Develop and maintain elaborate services-based architecture utilizing open-source technologies like
- Hibernate, ORM and Spring Framework.
- Developed server-side services using Java multithreading, Struts MVC, Java, EJB, Spring, Webservices (SOAP, WSDL, AXIS).
- Responsible for developing DAO layer using Spring MVC and configuration XMLs for Hibernate and to also manage CRUD operations (insert, update, and delete).
- Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application,
- Reference implementation.
- Development of JavaScript for client end data entry validations and Front-End Validation.
- Deployed Web, presentation, and business components on Apache Tomcat Application Server.
- Developed PL/SQL procedures for different use case scenarios
- Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
Environment: Java/J2EE, JSP, XML, Spring Framework, Hibernate, Eclipse (IDE), Java Script, Ant, SQL, PL/SQL, Oracle, Windows, UNIX, Soap, Jasper reports.