Sr. Data Engineer Resume
Columbus, OH
SUMMARY
- 8+ years of professional IT experience which includes around 4+ years of hands - on experience in Hadoop using Cloudera, Hortonworks, and Hadoop working environment includes Map Reduce, HDFS, HBase, Zookeeper, Hive, Sqoop and with AWS and Azure cloud providers.
- Good Experience in Data Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Export through use of ETL tools such as Informatica, SSIS.
- Experience on designing and developing Data Pipelines for Data Ingestion or Transformation using Python or PySpark.
- Hands on experience in deployment of machine learning pipelines using docker and Kubernetes on cloud platforms
- Designed and implemented Apache Spark streaming application using Python and Scala.
- Hands on experience on HDFS, HIVE, PIG, Hadoop Map Reduce framework, SQOOP, Spark.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, Spark.
- Experienced with processing different file formats like Avro, XML, JSON and sequence file formats using MapReduce programs.
- Experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers, and strong experience in writing complex queries for Oracle.
- Working experience in importing and exporting data using Sqoop from Relational Database Systems (RDBMS) to HDFS.
- Well versed with big data on AWS cloud services like EC2, S3, Glue, EMR, DynamoDB and RedShift.
- Good experience in developing SQL with various relational databases like Oracle, SQL Server
- Capable of processing large sets of structured, semi-structured, un-structured data and supporting systems application architecture.
- Proficiency in programming with different IDEs like Eclipse, Net Beans & Maven.
- In depth understanding of Map Reduce and AWS cloud concepts and its critical role in data analysis of huge and complex datasets.
- Good knowledge on Impala, Mahout, SparkSQL, Storm, Avro, Kafka, Hue and AWS.
- Excellent understanding and best practice of Data Warehousing Concepts
- Involved in Full Development life cycle of Data Warehousing. Expertise in enhancements/bug fixes, performance tuning, troubleshooting, impact analysis and research skills.
- Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Strong Problem Solving and Analytical skills and abilities to make Balanced & Independent Decisions.
- Experience in analyzing data using HiveQL, Pig Latin, and custom MapReduce programs in Java.
- Good Team Player, Strong Interpersonal, Organizational and Communication skills combined with Self-Motivation, Initiative and Project Management Attributes
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Columbus, OH
Responsibilities:
- Implemented Sqoop for large data transfers from RDBMS to HDFS/HBase/Hive and vice-versa.
- Implemented partitioning, bucketing in Hive for better organization of the data
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage.
- Imported data from AWSS3 into SparkRDD, Performed transformations and actions on RDD's.
- Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.
- Responsible to manage data coming from different sources.
- Import and export data between the environments like MySQL, HDFS and deploying into productions.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, python.
- Worked on processing the data from Kafka topic using spark structured streaming.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Developed multiple Kafka Producers and Consumers from scratch to as per the requirement specifications.
- Worked on writing the data to Snowflake from PySpark applications.
- Worked on developing ETL extracts from snowflake, SQL Server databases to send the data to various vendors and automated them using the Scheduling tools.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed Spark Application by using Python (PySpark).
- Run short term ad-hoc queries, jobs on the data stored on S3 using AWS EMR.
- Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Stored the output of data in Avro and Parquet file formats for better performance.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Implemented a Continuous Delivery pipeline with Docker and Git Hub.
- Involved in SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on basis of using defect reports.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Involved in preparing SQL and PL/SQL coding convention and standards.
- Involved in Data mapping specifications to create and execute detailed system test plans.
Environment: HBase, Sqoop, HDFS, Hive, AWS, EC2, S3, EMR, Snowflake, SQL Server, PySpark, Spark, Kafka, Python, MySQL, Scala, Hadoop, MapReduce, Kubernetes, Docker.
Sr. Data Engineer
Confidential, CA
Responsibilities:
- Worked closely with data scientists for data gathering enterprise level to predict consumer behavior, such as what products user has bought and made recommendations based on recognizing patterns.
- Involved in end-to-end data processing like ingestion, processing, quality checks, and splitting.
- Migrating the data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Proposed an automated system using Shell script to Sqoop the job.
- Developed various automated scripts for DI (Data Ingestion) and DL (Data Loading) using Python & Java MapReduce.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
- Performed data validation which does the record wise counts between the source and destination.
- Worked on running spark jobs on data bricks cluster and stored the data on delta lake tables on Azure Blob storage.
- Used Amazon CloudWatch to monitor and track resources on AWS.
- Written code in python and Scala for spark jobs.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Refined terabytes of patient data from different sources and created hive tables.
- Created various types of data visualizations using Python and Tableau.
- Developed MapReduce jobs for data cleaning and preprocessing.
- Importing and exporting data into HDFS and HIVE from an Oracle database using Sqoop
- Responsible to manage data coming from different sources.
- Responsible for loading data from UNIX file systems into HDFS.
- Installed and configured Hive and wrote Hive UDF.
- Wrote Pig scripts to process unstructured data and create structure data for use with Hive.
- Developed a spark pipeline to transfer data from lake to Cassandra in cloud to make the data available for decision
- Developed machine learning pipeline to predict risk score of patients using the claims data
Environment: Oracle, MySQL, Hadoop, HBase, Python, Tableau, Sqoop, HDFS, Hive, Spark, PySpark, Control-M, MapReduce, Azure Data Lake, Azure SQL Server
Data Engineer
Confidential, Bothell, WA
Responsibilities:
- Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
- Imported Bulk Data into HBase Using Map Reduce programs.
- Performed analytics on Time Series Data exists in HBase using HBase API.
- Designed and implemented Incremental Imports into Hive tables.
- Used Rest API to Access HBase data to perform analytics.
- Worked in Loading and transforming large sets of structured, semi structured, and unstructured data
- Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, and Sqoop
- Experience using Cloudera Impala for real-time query processing.
- Using Power BI and Tableau, I created dashboard reports.
- I used a variety of modifications like as Multicast, Conditional Split, OLEDB Command, Script Component, and Lookup while loading the data into destination.
- Experience with task workflow scheduling and monitoring technologies such as Oozie, as well as a thorough understanding of Zookeeper.
- Creating Docker Containers from scratch.
- Experience on agile methodologies Scrum.
- Managed Docker orchestration and Docker containerization using Kubernetes.
- Excellent knowledge and expertise with ETL tools like as Alteryx, Matillion, and SSIS.
- Worked in the administration activities in providing installation, upgrades, patching and configuration for all hardware and software Hadoop components
- Configured and implemented Apache Hadoop technologies i.e., Hadoop distributed file system (HDFS), MapReduce framework, Pig, Hive, Sqoop, Flume.
- Implemented Kerberos for authenticating all the services in Hadoop Cluster.
- Involved in collecting, aggregating, and moving data from servers to HDFS using Apache Flume
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Written Python scripts to analyze the data.
- Developed Shell, and Python scripts to automate and provide Control flow to Pig scripts.
- Involved in creating Hive tables, loading with data, and writing hive queries that will run internally in map reduce way.
- Imported and exported data between NoSQL and HDFS
- Worked with Avro Data Serialization system to work with JSON data formats.
- Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
- Involved in Unit testing and delivered Unit test plans and results documents using Junit.
- Created and maintained technical documentation for launching HADOOP Clusters and for executing Hive Scripts.
Environment: Teradata, Hadoop, HDFS, Kerberos, HBase, Python, MapReduce, Hive, Sqoop, Flume, Avro, JSON, Rest API, NoSQL
Data Engineer
Confidential
Responsibilities:
- Processed the raw log files from the set-top boxes using java map reduce code and shell scripts and stored them as text files in HDFS.
- Extensive involvement in Designing Azure Resource Manager Template and in designing custom build steps using PowerShell.
- Apache Sqoop, Flume, java, MapReduce programs, hive queries, and pig scripts.
- Generating the required reports using Oozie workflow and Hive queries for the operations team from the ingested data.
- Reporting and BI tools like Microsoft SQL Server Reporting Services (SSRS) and SAP Crystal Reports
- Worked on NoSQL database systems, such as MongoDB and CouchDB
- Writing Map Reduce code to make unstructured and semi-structured data into structured data and loading into Hive tables.
- Coordinate all Scrum Ceremonies including Sprint Planning, Daily Standups, Sprint retrospectives, Sprint Demos, Story Grooming, and Release Planning
- Involved in Spark streaming solution for the time-sensitive revenue-generating reports to match the pace with upstream (STB) systems data
- Worked on SFDC ODATA connector to get the data from NodeJS services which in turn fetch the data from HBase.
- Utilized AWS S3 services to push/store and pull the data from AWS from external applications
- Responsible for functional requirements gathering, code reviews, deployment scripts, and procedures, offshore coordination, and on-time deliverables.
- Leveraged Google Cloud Platform Services to process and manage the data from streaming and file-based sources
- Designed, configured, and deployed Microsoft Azure for a multitude of applications utilizing the Azure stack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
Environment: Apache Hadoop, HDFS, Pig, Hive, Flume, Kafka, MapReduce, Sqoop, Spark, Oozie, LINUX, NodeJS, SFDC, ODATA, and AWS, Agile scrum, GCP
Data Analyst
Confidential
Responsibilities:
- Exported the analyzed data to the Relational databases using Sqoop for performing visualization and generating reports for the Business Intelligence team.
- Developed Simple to complex Map Reduce jobs.
- Analyzed the data by performing Hive queries and running Pig Scripts to know user behavior and creating partitioned tables in Hive as part of my job.
- Administered and supported distribution of Horton works.
- Wrote Korn shell, Bash shell, Pearl scripts to automate most Database maintenance tasks.
- Worked on installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Monitoring the running Map-Reduce programs on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Through PHP, I created documents and execute software designs that may involve complicated workflows or multiple product areas
- An alternate UNIX/Oracle-based system required bug fixes, change requests, and tuning. My position was responsible for all requests of this system. Implementation, testing, and documentation were performed for this system.
- Consult with project managers, business analysts, and development teams on application development and business plans
- Installed and configured Hive and Created Hive UDFs.
- Involved in creating Hive Tables, loading with data, and Writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Implemented the workflows using the Apache Oozie framework to automate tasks.
- Developed scripts and automated data management from end to end and sync up between the clusters.
- Designed, developed, tested, and deployed Power BI scripts and performed detailed analytics.
- Performed DAX queries and functions in Power BI.
Environment: s: Apache Hadoop, Java, Bash, ETL, Map Reduce, Hive, Pig, Horton works, Deployment tools, Data tax, Flat files, Oracle 11g/10g, MySQL, Window NT, UNIX, Sqoop, Oozie, Tableau.