We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Weehawken, NJ

SUMMARY

  • Over 8+ years of professional experience as Big Data Engineer dealing with Apache Hadoop Ecosystem like HDFS, MapReduce, Hive, Sqoop, Oozie, HBase, Spark - Scala, Kafka, and Big Data Analytics.
  • Experience in designing, implementing large scale data pipelines for data curation using Spark/Data Bricks along with Python and Scala.
  • Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Highly experienced in developing Hive Query Language and Pig Latin Script.
  • Experienced in using distributed computing architectures such as AWS products (EC2, Redshift, and EMR, Elastic search, Athena, and Lambda), Hadoop, Python, Spark and effective use of MapReduce, SQL and Cassandra to solve bigdata type problems.
  • Extensive knowledge in working with Azure cloud platform (HDInsight, Data Lake, Data Bricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Experience in developing CI/CD (continuous integration and continuous deployment) and automation using Jenkins, Git, docker, Kubernetes for ML models deployment.
  • Worked with in Azure Cloud Services (PaaS & IaaS), Azure Databricks, Azure Synapse Analytics, SQL Azure, GCP, Data Factory, Azure Analysis services, Application Insights, Azure HDInsight, Key Vault, Azure Data Lake for data ingestion, ETL process, data integration, data migration, AI solutions
  • Expertise in Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export using multiple ETL tools such as Informatica Power Centre.
  • Experience in designing, building, and implementing complete Hadoop ecosystem comprising of Map Reduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark.
  • Experience with Client-Server application development using Oracle PL/SQL, SQL PLUS, SQL Developer, TOAD, and SQL LOADER.
  • Used AWS EMR Spark cluster and Cloud Dataflow on GCP to compare the efficiency of a POC on a developed pipeline
  • Working experience in migrating several other databases to Snowflake
  • Strong experience with architecting highly per formant databases using PostgreSQL, Post GIS, MySQL and Cassandra.
  • Extensive experience in loading and analyzing large datasets with Hadoop framework (MapReduce, HDFS, PIG, HIVE, Flume and Sqoop)
  • Hands on experience in application development using Java, RDBMS, and Linux shell scripting and Object- Oriented Programming (OOPs), multithreading in Core Java, JDBC.
  • Worked highly with NoSQL databases such as MongoDB, Cassandra.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Hands on experience in scheduling data ingestion process to data lakes using Apache Airflow.
  • Good knowledge and hands on experience on python modules such as NumPy, Pandas, Matplotlib, Scikit-learn, Pyspark modules.
  • Hands-on experience with Snowflake utilities, Snow Pipe, Snow SQL using Python.
  • Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop
  • Good experience working on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
  • Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.

TECHNICAL SKILLS

Languages: Python, Java, R, Scala, SQL, PL/SQL, T-SQL, NoSQL.

Web Technologies: HTML, CSS, XML.

Big data eco system: Hadoop, Hive, Pig, Spark, Sqoop, Oozie, Kafka, Zookeeper, Cloudera, Hortonworks.

Databases: Oracle, SQL Server, Postgres, Neo4j, MongoDB, Cassandra.

Development Tools: Jupiter, Anaconda, Eclipse, SSIS, SSRS, PyCharm.

Visualization Tools: Tableau, Power BI

Cloud Technologies: Azure, AWS (S3, Redshift, Glue, EMR, Lambda, Athena), GCP, SQL Data Warehouse.

Automation/ Scheduling: Jenkins, docker, Kubernetes, Airflow.

Version Control: Git, SVN.

PROFESSIONAL EXPERIENCE

Confidential

Senior Data Engineer

Responsibilities:

  • Data ingestion into data lake(S3) and used AWS Glue to expose the data to Redshift.
  • Configured EMR cluster for data ingestion and used dbt (data build tool) to transform the data in Redshift.
  • Worked with GCP and its various services- Pub Sub, Cloud Storage, Cloud Storage, Big Query, among others.
  • Scheduled jobs in Airflow for automating the ingestion process into the data lake.
  • Used GCP Pub/Sub for ingestion from Streaming sources & replicating data to servers with Data Flow.
  • Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
  • Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with Data Frames in Spark.
  • Worked on writing, testing, debugging SQL code for transformations using data build tool (dbt).
  • Orchestratedmultiple ETL jobsusingAWS step functionsandlambda, also usedAWS Gluefor loading and preparingdata Analyticsfor customers.
  • Involved in writing Java and Node.js API forAmazon Lambdato manage some of the AW’S services.
  • Worked on AWS Lambda to run servers without managing them and to trigger run code by S3 and SNS.
  • Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for the certain events based on use cases.
  • Implemented the AWS cloud computing platform by using RDS, Python, Dynamo DB, S3 and Redshift.
  • Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into data warehouses.
  • Developed scripts using Jenkins with the integration of the Git repository for the build, testing, code review and the deployment.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Worked on CI/CD solution, using Git, Jenkins, Docker and Kubernetes to setup and configure Big data architecture on AWS cloud platform.
  • Worked in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analysing& transforming the data to uncover insights into the customer usage patterns.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
  • Developed Scala scripts, UDFs involving both Data frames and RDD's using SparkSQL for aggregation, queries, and writing data back into the OLTP system directly or through Sqoop.
  • Performed Database activities such as Indexing, performance tuning.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
  • Experience in moving data between GCP and Azure using Azure Data Factory
  • Responsible in loading and transforming huge sets of structured, semi-structured and unstructured data.
  • Used AWS EMR clusters for creating Hadoop and spark clusters. These clusters are used for submitting and executing python applications in production.
  • Designed and develop end-to-end ETL processing from Oracle to AWS using Amazon S3, EMR, and Spark.
  • Written SQL Scripts and PL/SQL Scripts to extract data from the Database to meet business requirements and for Testing Purposes.
  • Facilitated training sessions to demo the dbt tool for various teams and sent weekly communications on different topics related to Data Engineering.

Confidential, Weehawken, NJ

Data Engineer

Responsibilities:

  • Developed data pipeline using Spark, Hive and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
  • Working Experience onAzure Databrickscloud to organize the data into notebooks and make it easy to visualize data using dashboards.
  • Performed ETL on data from different source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure Synapse) and processing the data in InAzure Databricks.
  • Worked on managing the Spark Databricks by proper troubleshooting, estimation, and monitoring of the clusters.
  • Performed Data Aggregation, Validation, and Azure HD Insight using spark scripts written in Python.
  • Implemented Azure Stream Analytics for processing the real-time Geo-Spatial data for location-based targeted sales campaigns.
  • Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
  • Created partitioned tables in Databricks using Spark and designed a data model using Azure Snowflake Datawarehouse.
  • Used Hive, Impala, and Sqoop utilities and Oozie workflows for data extraction and data loading.
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Wrote Python scripts to parse XML documents and load the data in the database.
  • Integrated Nifi with Snowflake to optimize the client session running.
  • Optimized the performance of queries with modification in TSQL queries to Snow SQL, established joins, and created clustered indexes.
  • Created stored procedures to import data into the Elasticsearch engine.
  • Worked with Data Governance, Data Quality, and Metadata Management team to understand the project.
  • Used Spark SQL to process a huge amount of structured datato aid in better analysis for our business teams.
  • Created HBase tables to store various data formats of data coming from different sources.
  • Responsible for importing log files from various sources into HDFS using Flume.
  • Development of routines to capture and report data quality issues and exceptional scenarios.
  • Installed and configured Hadoop and was responsible for maintaining cluster and managing and reviewing Hadoop log files.
  • Involved in troubleshooting at database levels, error handling, and performance tuning of queries and procedures.
  • Worked on SAS Visual Analytics & SAS Web Report Studio for data presentation and reporting.
  • Extensively used SAS/Macros to parameterize the reports so that the user could choose the summary and sub-setting variables to be used from the web application.
  • Developed data warehouse models in snowflake for various datasets using WhereScape.
  • Responsible for translating business and data requirements into logical data models in support of Enterprise data models, ODS, OLAP, OLTP and Operational data structures.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Azure Data solution.

Confidential, BETONVILLE AR

Hadoop Developer

Responsibilities:

  • Performed Hive partitioning, bucketing, and executing different types of joins on Hive tables and implementing Hive Serves like JSON and Avro.
  • Worked with Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with Cloudera Hadoop distribution.
  • Experienced in loading data from the UNIX file system to HDFS
  • Used Google Cloud Platform Services (GCP) to process and manage data from streaming sources.
  • Extraction of the streaming data using Kafka.
  • Worked in transferring objects from the Teradata platform to the Snowflake platform.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
  • Involved in Business Requirements, System analysis and Design of the Data warehouse application.
  • Wrote Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
  • Involved in creating Hive tables, loading them with data and writing hive queries that will run internally in map reduce way.
  • Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS
  • Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
  • Performed loading and transformations in large sets of structured, semi structured and unstructured data.
  • Involved in creating Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability
  • Used Flume to collect, aggregate, and store the web log data from different sources like web servers, network devices and pushed to HDFS.
  • Scripting to deploy monitors, checks and critical sysadmin functions automation.
  • Managing and scheduling Jobs on a Hadoop cluster.
  • Performing tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.

Confidential

Hadoop Developer

Responsibilities:

  • Worked extensively on ingesting disparate data sets coming from different sources to Data Lake. Used SQOOP to bring data to landing zone from where ingestion process starts with the existing automatic process and creates hive tables
  • Created and automated the jobs for extracting the delta load from different Data Sources like MySQL, DB2, and ORACLE and to push the result set to a landing zone.
  • Developed UNIX scripts in creating Batch load and driver code for bringing huge amount of data or large number of tables as a history load from Relational databases to BIGDATA platform.
  • Used IBM CDC (Change Data Capture) to get near real time transactional data.
  • Used Talend in implementing DF2.0 ingestion framework.
  • Worked with Systems Analyst and business users to understand requirements for feed generation.
  • Created Health Allies Eligibility and Health Allies Transactional feeds extracts using Hive,HBase, Python and UNIX to migrate feed generation from a mainframe application called CES (Consolidated Eligibility Systems) to big data.
  • Used bucketing concepts in Hive to improve performance of HQL queries.
  • Created reusable Python script and added it to distributed cache in Hive to generate fixed width data files using an offset file.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
  • Developed Pig queries to load data to HBase
  • Created ORC tables to improve the performance for the reporting purposes.
  • Created a MapReduce program which looks into data in HBase current and prior versions to identify transactional updates. These updates are loaded into Hive external tables which are in turn referred by Hive scripts in transactional feeds generation.
  • Created SPARK Scala scripts with UDFs to transform and aggregate data using RDD transformations, data frames etc. and storing result set into OLTP through Sqoop.
  • Worked on creating Data model to establish parent child relationship for the tables in Elastic Search
  • Loaded data from Hadoop to Elastic Search using Hive context in Spark

Confidential

Data Engineer

Responsibilities:

  • Participated in meetings with different teams to understand the project needs and requirements.
  • Collaborate with product managers, engineers, finance, and operation team to understand current situation and help key decision making.
  • Developed Python scripts to facilitate data collection from MySQL server.
  • Extracted and organized information from manually conducted cases and exported to structured data using Python with re (regular expression).
  • Specified data types to reduce memory requirements.
  • Analyzed the data by Exploratory Data Analysis (EDA), worked on missing values.
  • Worked on feature extraction and in creating new features using Pandas package.
  • Used Principal Component Analysis (PCA) in feature engineering to analyze high dimensional data.
  • Worked on outlier identification using Box Plot with Matplotlib, NumPy, and Pandas.
  • Performed customer segmentation models using K-Means and Gaussian Mixture Model clustering algorithms.
  • Developed Decision Tree model to identify key predictors for the models.
  • Used a Random Forest classifier to check the booking volume changes with the booking price ranges to get the optimum booking price.

Confidential

Software Developer

Responsibilities:

  • Developed the web interfaces using JSP. Created multiple JAX-WS andJAX-RS based web services. Used SOAPUI to test the SOAP services and load testing
  • Used JSP and Servlets for server-side transactions. Create request and call SOAP web service.
  • Developed web services component using XML, WSDL and SOAP with DOM parser to transfer and transform data between applications.
  • Provide custom reports built using SQL and Excel to the management.Login authentication is done by JSP by verifying with database security tables.
  • Knowledge of creating responsive web pages using Bootstrap and css3.Designed and developed Entity beans and Session beans. Scheduling hot and cold backups using MYSQL dump.
  • Configured and Deployed EJB’s Entity and Session beans on Web Logic Server
  • Developed business logic codes using Servlets at the back end of the system.
  • Involved in developing the database tables to hold lender information
  • Responsible for designing the front-end using HTML. Developed JSP’s and Servlets to provide dynamic content to the HTML pages.
  • Developed data access components and multilingual screen generator classes. Developed JSPs, for client-side validations. Experience with JS, jQuery or other JS frameworks.
  • Developed the interface to automatically forward quote requests to qualified lenders using SMTP.
  • Developed test cases to test the business logic.
Environment: Java, Servlets, JavaScript, JDBC, JQuery CVS, Eclipse, Weblogic Server, JSP, MYSQL, Toad, and Linux.

We'd love your feedback!