Senior Big Data Engineer Resume
Round Rock, TX
SUMMARY
- Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
- HDFS, Hive, Sqoop, Pig, Oozie, HBase, NiFi, Spark, Scala, Kafka and Zoo Keeper and ETL (Data Stage).
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop Map reduce HDFS HBase Hive Sqoop Pig Zookeeper and Flume
- Hands on experience in Test - driven development, Software Development Life Cycle (SDLC) methodologies like Agile and Scrum Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Developed a python script to hit REST API’s and extract data to AWS S3
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Experience in collection of Log Data and JSON data into HDFS using Flume and processed the data using Hive/Pig.
- Experience in transferring Streaming data, data from different data sources into HDFS and NoSQL databases using Apache Flume. Cluster coordination services through Zookeeper.
- Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of Map R.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Developed Python code to gather the data from HBase and designs the solution to implement using spark.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Extensive experience in Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce concepts.
- Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
- Experience working wif Hortonworks and Cloudera environments.
- Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
- Installed and configured Hive, Pig, Sqoop, Flume andOozieon the Hadoop cluster.
- Experience in using build/deploy tools such asJenkins, Docker and Open Shiftfor Continuous Integration & Deployment for Microservice.
- Good working experience on Spark (spark streaming, spark SQL) wif Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Used various analytic features of HIVE like RANK, DENSE-RANK, ROW-NUMBER on demand for in depth analysis.
- Strong experience using HDFS, Map reduce, Hive, Spark, Sqoop, Oozie, and HBase.
- Used Data bricks XML plug-in to parse the incoming data in the XML format, and generate the required XML as output
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Hands on experience in using other Amazon Web Services like Auto scaling, Redshift, Dynamo DB, Route53.
- Excellent programming skills wif experience in Java, C, SQL and Python Programming.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets,JSP, Struts, Spring, Hibernate and Web services.
TECHNICAL SKILLS
Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase
Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++
Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5
Data Bases: Oracle 9i/10g/11g/12c, MySQL, SQL Server, MS Access
NoSQL Data Bases: MongoDB, Cassandra, HBase
Workflow mgmt. tools: Oozie, Apache Airflow
Visualization & ETL tools: Tableau, D3.js, Informatica, Talend
Cloud Technologies: AWS, MS Azure, Snowflake
IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ
Version Control Systems: Git, SVN
Operating Systems: Unix, Linux, Windows
PROFESSIONAL EXPERIENCE
Confidential, Round Rock, TX
Senior Big Data Engineer
Responsibilities:
- Created yaml files for each data source and including glue table stack creation
- Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Developed Lambda functions and assigned IAM roles to run python scripts along wif various triggers (SQS, Event Bridge, SNS)
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
- Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, and Cassandra wif Horton work Distribution.
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Experienced wif workingMap reduce Design Patternsto solvecomplex Map reduce programs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Experience in job workflow scheduling and monitoring tools like Oozie and good noledge on Zookeeper.
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands wif Crontab.
- Implemented Technics wif Sparkfor improving the performance and optimization of the existing algorithms in Hadoop usingSparkContext,Spark-SQL, Data Frame and SparkYARN.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Developed various Mappings wif the collection of all Sources, Targets, and Transformations using Informatica Designer
- Involved in writing customMap Reduce,PigandHiveprograms.
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
- Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Experienced on implementation of a log producer in Scala dat watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Automated the data processing wif Oozie to automate data loading into the Hadoop Distributed File System.
- Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, HBase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used Apache NiFi to copy data from local file system to HDP.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
Environment: Erwin, Big Data, Hadoop, Oracle, PL/SQL, Scala, Spark-SQL, PySpark, Python, Kafka, SAS, MDM, Oozie, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig, Sqoop, MS Access.
Confidential, Burr Ridge, IL
Big Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Utilized SQOOP, Kafka, Flume and Hadoop File system APIs for implementing data ingestion pipelines
- Responsible for data services and data movement infrastructures
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Implementation and data integration in developing large-scale system software experiencing wif Hadoop ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig.
- Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
- Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
- Writing PySpark and spark sql transformation in Azure Data bricks to perform complex transformations for business rule implementation
- Design and Implemented the Sqoop incremental imports, delta imports on tables wifout primary keys and dates from Teradata and SAP HANA and appends directly into Hive Warehouse.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Worked on confluence and Jira
- Wrote AZUREPOWERSHELLscripts to copy or move data from local file system to HDFS Blob storage.
- Performed the migration of Hive and Map Reduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole
- Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built wif Python
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Created and maintained various Shell and Python scripts for automating various processes and optimized Map Reduce code, pig scripts and performance tuning and analysis.
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Integrated Apache Spark wif Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs, HBase and Hive by integrating wif Spark.
- Built performant, scalable ETL processes to load, cleanse and validate data
- Collaborate wif team members and stakeholders in design and development of data environment
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment wif Linux/Windows for big data resources.
- Installed and configured Hadoop eco system components
- Wrote Flume configuration files for importing streaming log data into HBase wif Flume.
- Imported several transactional logs from web servers wif Flume to ingest the data into HDFS.
- Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Worked wif Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed Map Reduce programs in Java for parsing the raw data and populating staging Tables.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Used Spark-SQL to Load JSON data and create Schema R DD and loaded it into Hive Tables and handled structured data using Spark SQL.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Data bricks.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Involved in creating Hive tables, loading wif data and writing hive queries which will run internally in map reduce way.
- Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows.
Environment: Hadoop, HDFS, Map Reduce, Kafka, Zookeeper, Hive, Hbase, PL/SQL, Oracle, Mongo DB, Azure, Data Factory, Data Bricks, Data Storage, Data Lake, Impala, Python, Pig, Oozie, Linux, Windows
Confidential, Charlotte, NC
Big Data Engineer
Responsibilities:
- Extensively used Agile methodology as the Organization Standard to implement the data Models
- Created several types of data visualizations using Python and Tableau.
- Extracted Mega Data from AWS using SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database.
- Strong understanding of AWS components such as EC2 and S3
- Having good knowledge in writing Map Reduce jobs through Pig, Hive, and Sqoop.
- Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
- Worked on Implementation of a log producer in Scala dat watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
- Implemented python codebase for branch management over Kafka features.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
- Migrate data from on-premises to AWS storage buckets
- Experience in usingZookeeperandOozieoperational services to coordinate clusters and scheduling workflows
- Developed and designed automate process using shell scripting for data movement.
- Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
- Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing wif Pig.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Worked on results from Kafka server output successfully.
- Extensively worked wif Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Designed Oozie workflows for job scheduling and batch processing.
- Developed Scala scripts using both Data frames/SQL and RDD/Map reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
- Implemented data access jobs through Pig, Hive,Tez,Solr, Accumulo,Hbase, andStorm.
- Created SAS ODS reports using SAS EG, SAS SQL, and OLAP Cubes.
- Implemented the Spark Scala code for Data Validation in Hive
- Implemented the automated workflows for all the jobs using the Oozie and shell script.
- Used Spark SQL functions to move data from stage hive tables to fact and dimension tables.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Defined Kafka Zookeeper offset storage.
Environment: Big data Hadoop, Hive, Sqoop, Hdfs, Scala, Spark, Python, Oozie, AWS, Lambda, Talend, SAS, Unix, Oracle
Confidential
Data Engineer
Responsibilities:
- Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
- Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
- Implemented data streaming capability using Kafka and Talend for multiple data sources.
- Implemented a Continuous Delivery pipeline wif Docker, and Git Hub and AWS
- Worked wif multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
- S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
- Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
- Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
- Involved in the development of agile, iterative, and proven data modeling patterns dat provide flexibility.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
- Worked wif SCRUM team in delivering agreed user stories on time for every Sprint.
- Worked on analyzing and resolving the production job failures in several scenarios.
- Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
- Ingested data from varies RDBMS sources.
- Wrote python code to manipulate and organize data frame such dat all attributes in each field were formatted identically.
- Created a PowerPoint presentation on the information discovered using data visualization techniques such as bar graphs and choropleth maps
Environment: Hadoop, Map Reduce, HDFS, Pig, Hive QL, MySQL, UNIX Shell Scripting, Tableau, Java, Spark, SSIS.
Confidential
Data Engineer
Responsibilities:
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Responsible for building scalable distributed data solutions using Hadoop.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
- Experienced in developing Spark scripts for data analysis in both python and Scala.
- Wrote Scala scripts to make spark streaming work wif Kafka as part of spark Kafka integration efforts.
- Built on premise data pipelines using Kafka and spark for real-time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
- Implemented Hive complex UDF's to execute business logic wif Hive Queries.
- Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and tan loading data into HDFS.
- Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
- Experience in managing and reviewing Hadoop Log files.
- Used Sqoop to transfer data between relational databases and Hadoop.
- Worked on HDFS to store and access huge datasets wifin Hadoop.
- Good hands on experience wif GitHub.
Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive, Oozie, Kafka, flume, Java, Git.