Data Engineer Resume
SomerseT
SUMMARY
- Comprehensive experience of 7+ years in software engineering profession, with over 4 years in Hadoop and Scala (spark) development and Administration experience along with 2years of experience as Data Analyst.
- Over 4+ years of experience in Hadoop architecture and various components such as HDFS Namenode, Datanode and MapReduce Job Tracker, Task Tracker and programming paradigm.
- Well versed with HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Experience in using different Hadoop eco system components such as HDFS, YARN, MapReduce, Spark, Pig, Sqoop , Hive, and Kafka.
- Experience with Data warehousing and Data mining, using one or more NoSQL Databases like HBase, Cassandra, and MongoDB.
- Experience in using Sqoop to ingest data from RDBMS to HDFS.
- Experience in Cluster Coordination using Zookeeper and Worked on File Formats like Text, ORC, Avro and Parquet and compression techniques like Gzip and Zlib.
- Experienced in using various Python libraries like NumPy, Scipy, python - twitter, Pandas.
- Worked on visualization tools like Tableau for report creation and further analysis.
- Experienced with Spark processing framework such as Spark SQL, and Data Warehousing and ETL processes.
- Developed end to end ETL pipeline using Spark-SQL, Scala on Spark engine and imported data from AWS S3 into Spark RDD, performed transformations and actions on RDDs.
- Experience with spark streaming and to write spark jobs.
- Experience developing high throughput streaming applications from Kafka queues and writing enriched data back to outbound Kafka queues.
- Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS.
- Good understanding of AWS S3, EC2, Kinesis and Dynamo DB.
- Used RStudio for data pre-processing and building machine learning algorithms on datasets.
- Good Knowledge on NLP, Statistical Models, Machine Learning, Data Mining solutions to various business problems and generating using R, Python.
- Experienced in real-time analytics with Spark RDD, Data Frames and Streaming API.
- Used Spark Data Frame API over Cloudera platform to perform analytics on Hive data.
- Knowledge in integration of data from various sources like RDBMS, Spreadsheets, Text files,
- Acquires good understanding of JIRA and maintaining JIRA dashboards.
- Knowledge in using Java IDE’s like Eclipse and IntelliJ
- Used Maven for building projects.
- Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization.
- Hands on experience on Hortonworks and Cloudera Hadoop environments.
- Provided production support and involved with root cause analysis, bug fixing and promptly updating the business users on day-to-day production issues.
- Developed DAGs and automated the process for the data science teams.
- Developed Ad-hoc Queries for moving data from HDFS to HIVE and analysing the data using HIVE QL.
- Integration Slack Notifications with Jenkins deployments to notify the required users about the deployments.
- Involved in daily SCRUM meetings to discuss the development/progress of Sprints and was active in making scrum meetings more productive.
TECHNICAL SKILLS
Big Data Eco-system: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Hbase, Kafka, Oozie, Spark, Zookeeper, NiFi, Amazon Web Services.
Machine Learning: Decision Tree, Neural Networks, ANN & RNN, PCA, SVM, K-NN, Deep learning.
Hadoop Technologies and Distributions: Apache Hadoop Cloudera CDH5.13,MapR
Operating System: Linux (Centos, Ubuntu), Windows(XP/7/8/10)
Languages: Java, Shell scripting, Pig Latin, Scala, Python,R
Databases: MySql,Teradata,DB2,Oracle
NoSQL: Hbase, Cassandra and Mongo DB
Application Servers: Apache Tomcat, JDBC, ODBC
BI Tools: Power BI, Tableau, Talend
PROFESSIONAL EXPERIENCE
Confidential, Somerset
Data Engineer
Responsibilities:
- Involved in review of functional and non-functional requirements.
- Developed and maintained help desk metrics for an IT group supporting thousands of end users.
- Reviewed, analyzed, and evaluated incoming requests for enhancements that required input from more than one team
- Worked on Large sets of structured, semi structured and unstructured data.
- Worked with Sqoop for importing data from relational data bases.
- Wrote multiple Map Reduce jobs for data cleaning and preprocessing.
- Running Hive queries on large datasets to generate insights.
- Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Generated the intermediary outputs into the Alteryx (.yxdb), Tableau (.tde) files and published final outputs of the Alteryx macros to Tableau server.
- Assisted with data capacity planning and node forecasting.
- Design and develop Spark jobs for streaming the real-time data which is received by Rabbit MQ, IBM MQ through Kafka and Spark streaming.
- Experience with Apache spark streaming and Batch framework. Create Spark jobs for data transformation and aggregation.
- Designed workflows by scheduling Hive processes for data, which is ingested into HDFS using Sqoop.
- Developed Hive queries to process the data and generate the data for visualizing.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Used Zookeeper to manage coordination among the clusters
- Developing scripts and batch jobs to schedule various Hadoop Programs.
- Written various Lambda services for automating the functionality on the Cloud.
- Reporting the data to analysts for further tracking of trends according to various consumers
- Used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.
- Worked with DevOps team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes.
- Work with Continuous Integration (CI)/CD using Jenkins for timely builds and running Tests.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup's for Data Validation and Integrity.
- Develop a script using Jenkins with the integration of the GIT repository for the build, testing, code review and the deployment of the build Jar file, shell-scripts and OOZIE workflows to the destination HDFS paths.
Environment: Sqoop, MapReduce, Pig, Hive, Oozie, Zookeeper, Java, Shell scripting, SPARK, SPARK SQL, Flume,
Confidential, New York
Data Engineer
Responsibilities:
- Familiarity with Hive joins & used HQL for querying the databases eventually leading to complex Hive UDFs.
- Installed OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
- Worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
- Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
- Leveraged Chef to manage and maintain builds in various environments and planned for hardware and software installation on production cluster and communicated with multiple teams to get it done.
- Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.
- Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
- Worked on Configuring Kerberos Authentication in the cluster.
- Experience in using Mapr File system, Ambari, Cloudera Manager for installation and management of Hadoop Cluster.
- Worked on writing Scala Programs using Spark/Spark-SQL in performing aggregations.
- Developed Web Services in play framework using Scala in building stream data Platform.
- Worked with data modelers to understand financial data model and provided suggestions to the logical and physical data model.
- Perform Table partitioning, monthly & yearly data Archival activities.
- Developing python scripts for Redshift CloudWatch metrics data collection and automating the datapoints to redshift database.
- Developed scripts for loading application call logs to S3 and used AWS Glue ETL to load into Redshift for data analytics team
- Installing IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND).
- Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
- Provide troubleshooting and best practices methodology for development teams.
- This includes process automation and new application onboarding.
- Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
- Configuring IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND) for user work-load distribution.
- Multiple batch jobs were written for processing hourly and daily data received through multiple sources like Adobe, No-SQL databases.
- Testing the processed data through various test cases to meet the business requirements.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Environment: Cloudera CDH5.13, Ambari, IBM Web Sphere, Hive, Python, HBase, Spark, Scala, Map Reduce, HDFS, Sqoop, AWS, Flume, Linux, Shell Scripting, Tableau
Confidential
Jr.Data Engineer
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending the functionality of Hive with custom UDF s and UDAF's.
- The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
- Expert in creating Hive UDFs using Java to analyze the data efficiently.
- Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
- Implemented AJAX, JSON, and Java script to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, MapReduce, HDFS, Sqoop, Impala, Tableau, Flume, Oozie, Linux.
Confidential
Data Analyst
Responsibilities:
- Gathered all the Sales Analysis report prototypes from the business analysts belonging to different Business units
- Worked with Master SSIS packages to execute a set of packages that load data from various sources onto the Data Warehouse on a timely basis.
- Involved in Data Extraction, Transformation and Loading (ETL) from source systems .
- Responsible with ETL design (identifying the source systems, designing source to target relationships, data cleansing, data quality, creating source specifications, ETL design documents,
- The data received from Legacy Systems of customer information were cleansed and then Transformed into staging tables and target tables in DB2.
- Used External Tables to Transform and load data from Legacy systems into Target tables.
- Use of data transformation tools such as DTS, SSIS, Informatica or Data Stage.
- Conducted Design reviews with the business analysts, content developers and DBAs.
- Designed, developed, and maintained Enterprise Data Architecture for enterprise data management including business intelligence systems, data governance, data quality, enterprise metadata tools, data modeling, data integration, operational data stores, data marts, data warehouses, and data standards.
- Incremental loading of Fact table from the source system to Staging Table done on daily basis.
- Coding SQL stored procedures and triggers.
- Used various Transformations in SSIS Dataflow, Control Flow using for loop Containers and Fuzzy Lookups and Implemented Event Handlers and Error Handling in SSIS packages.
- Involved in Cloudera Navigator access for auditing and viewing data.
- Extracted tables from various databases for code review.
- Generated document coding to create metadata names for database tables.
- Analyzed metadata and table data for comparison and confirmation.
- Adhered to document deadlines for assigned databases.
- Ran routine reports on a scheduled basis as well as ad-hocs based on key point indicators.
- Develop DataStage jobs to cleanse, transform and load data to Data Warehouse and sequencers to encapsulate the DataStage job flow.
- Designed data visualizations to analyze and communicate findings.
Environment: Linux, Erwin, SQL Server 2000/2005, Crystal Reports 9.0, HTML, Data Stage Version 7.0, Oracle, Toad, MS Excel, Pow
