Sr. Big Data Engineer Resume
Pittsburgh, PA
SUMMARY
- 10 Years experience in developing and implementing Big Data solutions and data mining applications on Hadoop using HDFS, MapReduce, HBase, Pig, Hive and Sqoop, Flume, Kafka, Storm, Spark, Oozie, Zookeeper, Flink, Nifi.
- Experience in installation, configuration, supporting, and managing Hadoop Clusters. Building highly scalable Big - data solutions using Hadoop and other multiple distributions i.e. Cloudera, Hortonworks, and NoSQL platforms (HBase & Cassandra)
- Experience in design, development, Unit testing, integration, debugging and implementation and production support, client interaction, and understanding business application, business data flow, and data relations from them.
- Experience in Machine Learning, Data mining with large data sets of Structured and Unstructured Data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, and programming languages like R and Python including Big Data technologies like Hadoop, Spark.
- In-depth understanding/knowledge of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, YARN, Name Node, Data Node and MapReduce concepts.
- Expertise in setting up Hadoop on Pseudo distributed environment and Hive, Pig, HBase, Sqoop on Ubuntu Operating System. In command of setup, configuration, and security for Hadoop clusters using Kerberos.
- Knowledge of provisioning new Hadoop users. This job includes setting up Linux users, setting up Kerberos principals, and testing HDFS, Hive, Pig and MapReduce access for the new users.
- Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
- Extensive experience with ETL and Query tools for BigData like Pig Latin and HiveQL.
- Expert in implementing advanced procedures like text analytics and processing using Apache Spark written in Scala.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
- Experience in using various packages in R and python-like Scikit-learn ggplot2, caret, dplyr, dplyr, pandas, NumPy, seaborn, SciPy, matplotlib, Beautiful Soup, Rpy2.
- Extensive Knowledge of developing Spark Streaming jobs by using RDD's (Resilient Distributed Datasets) and leveraged PySpark and Spark-Shell accordingly.
- Leveraging big data technologies such as Apache Hadoop, Apache Flink, Apache Nifi and combine with new technologies such as Docker to enable fault-tolerant runtime engine that will run Machine Learning lifecycle pipeline for model training, and prediction or inference at a large scale.
- Expertise in Core Java, Data Structures, Algorithms, Object-Oriented Design (OOD), and Java concepts such as Collections Framework, Exception Handling, I/O System and Multi-Threading.
- Developed core modules in large cross-platform applications using JAVA, J2EE, Hibernate, Python, spring, JSP, Servlets, EJB, JDBC, JavaScript, XML, and HTML.
- Extensive experience in working with Oracle, MS SQL Server, DB2, MySQL RDBMS databases.
TECHNICAL SKILLS
Big Data/Hadoop Technologies: Hadoop, HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper, and Oozie.
Language: C, C++, Java, Python, Scala
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.
Cloud Computing Tools: Amazon AWS,S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch
Databases: Microsoft SQL Server 2008 … MySQL 4.x/5.x, Oracle 10g, 11g, 12c, DB2, Teradata, Netezza
NO SQL Databases: HBase, Cassandra, MongoDB, MariaDB.
Build Tools: Jenkins, Maven, Ant, Toad, SQL Loader, RTC, RSA, Control-M, Oozie, Hue, SOAP UI
Modeling: Rational Rose, Star UML, Visual paradigm for UML
Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos7.0/6.0.
Operating Systems: All versions of UNIX, Windows, LINUX, Macintosh HD, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Pittsburgh PA
Sr. Big Data Engineer
Responsibilities:
- Created data pipeline for different events of ingestion, aggregation, and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for AWS Quicksight Dashboard
- Used ANT automated build scripts to compile and package the application and implemented Log4j
- Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables
- Created PySpark Scripts to improve the performance of the application
- Envisioned the architectural scheme, structure, features, functionality, and user-interface design.
- Evolved the overall master data model, including the functions, entities within those functions, and attributes within those entities as the platform design is completed and business needs shift and change
- Applied data warehousing solutions while working with a variety of database technologies.
- Developing Scala jobs to in corporate the SQL-queries for improving the performance over-extraction of the same data in Postgresql Database environment.
- Written programs in Spark using Scala and Python for Data quality check.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data models that get the data from Kafka in near real-time and persist it to Cassandra.
- Good understanding of Cassandra architecture, replication strategy, gossip, snitch, etc.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed data transformations and then exported the transformed data to Cassandra as per the business requirement.
- Consumed XML messages using Kafka and processed the XML file using Spark Streaming to capture UI updates.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to a flat-file.
- Load D-Stream data into Spark RDD and do in-memory data Computation to generate Output response.
- Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS, and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD.
- Worked with ELASTIC MapReduce and set up a Hadoop environment in AWS EC2 Instances.
- Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
- Implemented usage of Amazon EMR for processing Big Data across aHadoop Clusterof virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
- Developing API for Spark jobs to establish connections with AWS S3 buckets to push the ultimate JSON export which is the entry point for clients to refer to the data.
- Designing Oozie workflows to schedule Spark jobs on the cluster to generate JSON exports every day.
- DBVisualizer, Putty, IntelliJ, and Excel were the most used tools on a regular basis and Flowdock was used to keep track of the one-on-one communication between the teams.
- Implemented Partitions, Buckets, and developed Hive query to process the data and generate the data cubes for visualizing. Involved in Cluster maintenance, Cluster Monitoring, and Troubleshooting
Environment: Python, AWS, Scala, Spark, Docker, Spark RDD, AWS EC2, AWS S3, Cassandra, Java, PySpark, Oozie, DBVisualizer, Putty, IntelliJ, Excel, SQL, YARN, Spark SQL, HDFS, Hive, Maven, Apache Kafka, Shell scripting, Linux, Postgresql Database, Git, and Agile Methodologies.
Confidential, San Antonio TX
Sr. Big Data Engineer
Responsibilities:
- Developed and refined the Spark process for ODS (Operations Data Store) by making changes and enhanced the performance of the data ingestion from raw and refined to publishing Postgres data to the core script using Python and PySpark.
- Responsible for the ingestion of data from various APIs and writing modules to store data in S3 buckets.
- Validating data fields from the refined zone to ensure the integrity of the published table.
- Converting ingested data (csv, XML, Json) to parquet file format in compressed form.
- Created the business models from business cases and enterprise architecture requirements for process monitoring, improvement, and reporting and led the team in business intelligence solutions development
- Experience in performing transformations and actions on RDD, Data frames, Data sets using Apache spark.
- Good Knowledge of Spark and Hadoop Architecture and experience in using Spark for data processing.
- Hands-on experience in using Google Stack driver for monitoring the logs of both GKE and GCP instances and configured alerts from Stack driver.
- Developed Gsutil scripts for compression with Gzip, backup, transfer to edge node with all necessary file operational requirements for BQ load jobs.
- Working with HDFS config files for application logs Yarn-site.xml, yarn-default.xml. Mapred-site.xml and setting up log-aggregation properties in config files.
- Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.
- Congregated data from multiple sources and performed resampling to handle the issue of imbalanced data.
- Coded in PostgreSQL to published 10 million records from more than 90 tables to ensure the integrity of data flow in real-time.
- Experienced as Senior ETL Developer (Hadoop ETL/ Teradata / Vertica / Informatica / Data Stage / Mainframe), Subject Matter Expertise (SME), Production Support Analyst, QA Test
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real-time and used Apache NIFI to ingest persist it to HBase
- Extensively worked on TFS (Microsoft) as a tool to deploy production-level code in part with Git.
- Constructed robust, high-volume data pipelines and architecture to prepare data for analysis by the client.
- Used statistical learning algorithms such as logistic regression, linear regression, hypothesis testing, ANOVA lifecycle during the entirety of the project.
- Developed Oozie workflow for scheduling and orchestrating the ETL process
- Architected complete, scalable data warehouse and ETL pipelines to ingest and process millions of rows daily from 30+ data sources, allowing powerful insights and driving daily business decisions.
- Implemented optimization techniques for data retrieval, storage, and data transfer.
- Created HBase tables to load large sets of structured, semi-structured, and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Designing Monitoring Dashboards for pipelines using Kibana.
Environment: Spark 2.4, H Base 1.2, Tableau 10, Power BI, Python 2.7 and 3.4, Scala, PySpark, HDFS, Flume 1.6, Hive, Zeppelin, PostgreSQL, MySQL, TFS, Linux, Spark SQL, Kafka, NIFI, Sqoop 1.46, AWS (S3)
Confidential, Blu Ash OH
Data Engineer
Responsibilities:
- Performed data ingestion from on-premises to AWS Redshift as a part of the data lake ingestion team.
- Tested, debugged, and deployed multiple scripts across multiple environments and handled data in different file formats like txt, xlsx, csv, and Parquet.
- Ingested data from various source systems like MySQL and Oracle to Redshift database and used Autosys to automate jobs.
- Debugged and modified ETL jobs in Talend for data ingestion into Redshift and used Git for maintaining/versioning the code.
- Worked across environments like MySQL, Oracle, AWS S3, Athena, EMR, EC2, Redshift, Talend studio, and Autosys.
- Created SQL scripts for the manual ingestion process to be used whenever the automation job fails or under maintenance.
- Developed Python scripts for data splitting thereby eliminating 50% of manual work in the ingestion process.
- Collaborated with a team of 4 to ingest more than 1500 files in a short period and given knowledge transfer sessions to multiple recruits in the offshore team even at non-working hours.
Environment: Big Data, JDBC, NoSQL, Spark, YARN, Hive, Pig, Scala, Nifi, IntelliJIdea, AWS EMR, Python, Hadoop, Redshift
Confidential
Hadoop Developer
Responsibilities:
- Worked with product owners, Designers, QA, and other engineers in Agile development environment to deliver timely solutions to as per customer requirements.
- Transferring data from different data sources into HDFS systems using Kafka producers, consumers, and Kafka brokers.
- Used Oozie for automating the end-to-end data pipelines and Oozie coordinators for scheduling the workflows.
- Involved in creating Hive tables, loading data, and writing hive queries, views and worked on them using Hive QL.
- Applied Hive queries to perform data analysis on HBase using the serde tables in meeting the data requirements for the downstream applications.
- Designed end to end ETL flow for one of the feeds having millions of records inflows daily. Used apache tools/frameworks Hive, Pig, Sqoop and HBase for the entire ETL workflow.Load and transform large sets of structured, semi structured that includes Avro, sequence files.
- Worked on migration of all existed jobs to Spark, to get performance and decrease time of execution.
- Implemented usage of Amazon EMR for processing Big Data across aHadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR.
- Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip.
- Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and release cycle notes.
Environment: HDFS, Python, Oozie, Hive, HBase, Impala, Spark, AWS.