Data Engineer Resume
NJ
SUMMARY
- 8 years of overall IT experience in a variety of industries, this includes hands - on experience in Big Data Analytics and development.
- Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing datamining, and reporting solutions that scales across massive volume of structured and unstructured data.
- Good Knowledge on Apache NiFi for automating and managing the data flow between systems.
- Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.
- Experienced in understanding distributed ecosystem.
- Experience in data management and implementation of Big Data applications using Spark and Hadoop frameworks.
- Experience in analyzing data using Spark SQL, Hive QL and PIG Latin.
- Familiarity with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
- Hands on experience building streaming applications using Spark Streaming and Kafka with minimal/no data loss and duplicates.
- Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimension modeling for OLAP.
- Strong experience and knowledge of HDFS, Map Reduce and Hadoop ecosystem components like Hive, Pig, Sqoop, NoSQL databases such as Mongo DB and Cassandra.
- Extensive work in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
- Hands on experience in implementing LDA, Naïve Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principal Component Analysis and good knowledge on Recommender Systems.
- Performed statistical & graphical analytics using NUMPY, PANDAS, MATPLOTLIB and BI tools such as Tableau.
- Experience in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
- Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.
TECHNICAL SKILLS
BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase, Flume, Flink, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages: HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB
Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML
Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.
Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Glue, Athena, SageMaker.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, NJ
Data Engineer
Responsibilities:
- Implemented Restful web service to interact with Redis Cache framework.
- Intake happens through Sqoop, and Ingestion happens through Map Reduce, HBASE.
- Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
- Interacting with other data scientists and architected custom solutions for data visualization using tools like Tableau, packages in R.
- Developed predictive models using Python & R to predict customers churn and classification of customers.
- Documenting the best practices and target approach for CI/CD pipeline
- Constructed product-usage SDK data and data aggregations by using PySpark, Scala, Spark SQL and Hive context in partitioned Hive external tables maintained in AWS S3 location for reporting, data science dashboarding, and ad-hoc analyses.
- Involved in data processing using an ETL pipeline orchestrated by AWS Data Pipeline using Hive.
- Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.
- Created Kafka streaming data pipelines to use data from multiple sources and perform transformations using Scala.
- Handled the importing of data from various data sources, performed transformations using hive,Map-Reduce, loaded data intoHDFSand extracted data fromMySQLintoHDFSusingSqoop. sources, channels and sink by which data is ingested into HDFS
- Responsible for performing various transformations like sort, join, aggregations, filter in-order to retrieve various datasets using apache spark.
- Experience in extracting appropriate features from datasets in-order to handle bad, null, partial records using spark SQL.
- Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
- Wrote Spark RDD transformations, actions, Data Frames, case classes for the required input data ad performed the data transformations using Spark Context to convert RDD to Data frames
- Worked on storing the dataframe into hive as table using Python (PySpark).
- Experienced in ingesting data into HDFS from various Relational databases like Teradata using sqoop and exported data back to Teradata for data storage.
- Experience in developing various spark application using Spark-shell (Scala).
- Involved in creating Hive Tables, loading with data, and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Extracted data from multiple sources, applied transformations, loaded data into HDFS.
- Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Hands on experience in developing apache SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts
- Implemented the workflows using Apache Oozie framework to automate tasks
- Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
- Exported data to Cassandra (NoSQL) database from HDFS using sqoop and performed various CQL commands on Cassandra to obtain various datasets as required.
- After performing all the transformations data is stored in MongoDB (NOSQL)using Sqoop.
- Created and imported various collections, documents into MongoDB and performed various actions like query, project, aggregation, sort, limit.
- Involved in Unit testing and delivered Unit test plans and results documents using Junit and MRUnit.
Environment: Hadoop, HDFS, Map Reduce, spark, Sqoop, Oozie, Pig, Kerberos, Hive, Flume, TEZ, LINUX, Java, Eclipse, Cassandra, python, MongoDB.
Confidential, St.Louis, MO
Data Engineer
Responsibilities:
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Written multiple Hive UDFS using Core Java and OOP concepts and spark functions within Python programs.
- Wrote Spark applications for Data validation, cleansing, transformations, and custom aggregations.
- Imported data from various sources into Spark RDD for processing.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Worked on installing cluster, commissioning & decommissioning of Data node, Name node high availability, capacity planning, and slots configuration.
- Cleaned and congruous data was then streamed using Kafka into Spark and manipulations were performed on real time data with Python and Scala.
- Built the Machine learning based coupon purchase recommendation engine by the model on historical purchase data of customers across the retail hemisphere.
- Simulated real-time scenarios using the Sci-kit learn and Tensor flow libraries on Batch data for model with the resulting model being used in real-time models.
- Developed Spark applications for the entire batch processing by using Scala.
- Automatically scale-up the EMR instances based on the data.
- Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
- Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
- Visualized the results using Tableau dashboards and the Python Seaborn libraries were used for Data interpretation in deployment.
Environment: Spark, AWS, EC2, EMR, Hive, MS SQL Server, Genie Logs, Kafka, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Tableau
Confidential
Data Engineer
Responsibilities:
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWSLambda using java to perform event driven processing.
- Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design inHadoopand Big Data
- Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
- Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future s.
- Involved in ConfiguringHadoopcluster and load balancing across the nodes.
- Involved in Hadoopinstallation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
- Wrote script for Location Analytic project deployment on a Linux cluster/farm &AWSCloud deployment using Python.
- Worked extensively on Informatica Partitioning when dealing with huge volumes of data.
- Used Teradata External Loaders like Multi Load, T Pump and Fast Load in Informatica to load data into Teradata database.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
Environment: Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Flume, Spark, Impala, Cassandra, Pig, Hdfs, Scala, Spark RDD, Spark Sql, Kafka
Confidential
Data Engineer
Responsibilities:
- Extracted feeds form social media sites such as Facebook, Twitter using Python scripts.
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Involved in running all the hive scripts through hive. Hive on Spark and some through Spark SQL.
- Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
- Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
- Implemented reporting in PySpark, Zeppelin& querying through Airpal & AWS Athena.
- Wrote Junit tests and Integration test cases for those Microservice.
- Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
- Proven experience with ETL frameworks (Airflow, Luigi, or our own open sourced garcon)
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Createddata modelsforAWS Redshiftand Hive fromdimensional data models.
- Implemented a prototype for the complete requirements using Splunk, python and Machine learning concepts
- Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
- Used SSIS to transform data into SQL database via FTP from text files, MS Excel as source. Manage security - assign permissions and roles. Designed scripts to automate the maintenance tasks.
- Built PL/SQL (Procedures, Functions, Triggers, and Packages) to summarize the data to populate summary tables that will be used for generating reports with performance improvement.
- Developed Star and Snowflake schemas based dimensional model to develop the data warehouse
Environment: ER/Studio, Python, OLAP, OLTP, Oracle, ETL, SQL, PL/SQL, Teradata, SSIS, SSRS, T-SQL, XML