Sr. Data Engineer Resume
Boston, MA
SUMMARY
- Over 8 years of IT experience in Design, Development, Maintenance and support of Big Data Applications over 3+ years of experience in Data Engineering,Data Pipeline Design,Development and Implementation as a Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in bothWaterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing thedata.
- Python Libraries PySpark, Pytest, Pymongo, PyExcel, Psycopg, NumPy and Pandas.
- Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the DataFrames handle in SPARK with Scala.
- Experience in NoSQL databases and worked on table row key design and to load and retrievedatafor real timedataprocessing and performance improvements based ondataaccess patterns.
- Good working exposure to AWS Glue, EMR, Lambda, and Step functions, SNS, SQS andEC2.
- Experience in migratedatafrom on - premises Bigdata platform toAWScloud architecture.
- Extensive experience in Hadoop Architecture and various components such as HDFS,Job Tracker, Task Tracker, Name Node,DataNode, and Map Reduce concepts.
- Experience in building large scale highly available Web Applications. Working knowledge of web services and other integration patterns.
- Hands-on use of Spark and Scala API to compare the performance of Spark withHive and SQL, and Spark SQL to manipulateDataFrames in Scala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Hands on experience withAWSIAM, S3, EC2, RDS, EMR, Athena, Lambda, RedShift, RedShiftSpectrum, RedShift Datashare
- Able to use Sqoop to migratedatabetween RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL)datafrom various sources intoDataWarehouses, as well asdataprocessing like collecting, aggregating and movingdatafrom various sources using Apache Flume, Kafka,PowerBI and Microsoft SSIS.
- Experience setting up instances behind Elastic Load Balancer inAWSfor high availability and cloud integration withAWSusing ELASTIC MapReduce (EMR).
- Skilled in System Analysis, E-R/DimensionalDataModeling, Database Design and implementing RDBMS specific features.
- Good working experience on AWS infrastructure services Amazon Simple StorageService (Amazon S3), EMR, lambda functions and Amazon Elastic Compute Cloud(Amazon EC2).
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and loaddatato/from different source systems including flat files.
- Excellent interpersonal and communication skills, creative, research-minded, technically competent and result-oriented with problem solving and leadership skills. Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Boston, MA
Responsibilities:
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and BigData.
- Involved in Hadoop installation, Commissioning, Decommissioning, Balancing,Troubleshooting, Monitoring and, debugging Configuration of multiple nodes usingHortonworks platform.
- Implemented and maintained application CI/CD using theAWSCI/CD tack.
- Developing Spark applications using Spark - SQL in Databricks fordataextraction, transformation, and aggregation from multiple file formats for analyzing & transforming thedatato uncover insights into the customer usage patterns.
- Implemented Spark using Python and Spark SQL for faster processing ofdataandWorked on migrating MapReduce programs into Spark transformations using Spark andScala, initially done using python (PySpark)
- UsingAWSfor the Tableau server scaling and secure Tableau server onAWSto protect the Tableau environment using Amazon VPC, security group,AWSIAM andAWS.
- Design and Develop ETL Processes in AWSGlueto migrate the sourcedatato IntegratedDataMarketplace (IDM) on Redshift EDW.
- Developed the Pyspark code for AWSGluejobs and for EMR.
- Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.
- CreatedDataQuality Scripts using SQL and Hive to validate successful das ta load and quality of thedata. Created various types ofdatavisualizations usingPython and Tableau.
- Developeddatapipeline using SQOOP, HQL, Spark and Kafka to ingest Enterprise message deliverydatainto HDFS.
- Developed workflow in Oozie to automate the tasks of loadingdatainto HDFS and pre-processing with Pig and Hive.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
- Developed audit logging which can be incorporated in allGluescripts to capture all ETL activities.
- Served as a subject matter expert forDataarchitecture,AWS, architectural best practices, design patterns, tools & platforms
- ManagedAWSenvironments in accordance with our policies and security guidelines and assist with building and improving on those policies.
- CreatingGluejobs in Python usingGlue, Boto3 and Spark API for alldatatransformations.
- Developed Python scripts to extract thedatafrom the web server output files to load into HDFS.
- Designeddataarchitecture for one project with cloud computing environment using Amazon Web ServicesAWSfor hosting the databases.
- Involved in HBASE setup and storingdatainto HBASE, which will be used for further analysis.
- Created the Power BI report ondatain ADLS, Tabular Model and SQL server.
- Created variousdatapipelines using Spark, Scala and SparkSQL for faster processing ofdata.
- Optimize users experience in theAWScloud. Establishdatagovernance processes and standards.
- Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Used HiveQL to analyze the partitioned and bucketeddataand compute various metrics for reporting
- Assisted in creating and maintaining Technical documentation to launching HADOOPClusters and even for executing Hive queries and Pig Scripts.
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multipleMap Reduce jobs in java fordatacleaning and pre-processing.
Environment: HDFS, Hive, Scala, Glue, Datastage, Spark, Redshift, Power BI, AWS,Cloudera, SQL, Terraform, Splunk, RDBMS, Python, Elastic search,dataLake,Kerberos, Jira, Confluence, Databricks, DataLake, DataFactory, Git, Kafka
Data Engineer
Confidential, Green Bay, WI
Responsibilities:
- Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
- Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs and Scala.
- Involved in Migrating the platform from Cloudera to EMR platform.
- Developed analytical component using Scala, Spark and Spark Streaming.
- Collecting and aggregating large amounts of logdatausing Apache Flume and stagingdatain HDFS for further analysis
- Involved in creating Hive Tables, loading withdataand writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Development of python scripts usingDataframes/SQL/Datasets and RDD in spark forDataAggregation,queries and writingdatato AWS Redshift staging tables using AWSGlueJobs & workflows.
- Assisted in exportingdatainto Cassandra and writing column families to provide fast listing outputs.
- Used Zookeeper for providing coordinating services to the cluster.
- Involved in file movements between HDFS and AWS S3 and extensively worked withS3 bucket in AWS.
- Automated and monitored complete AWS infrastructure with terraform.
- Createddatapartitions on largedatasets in S3 and DDL on partitioneddata.
- Importdatafrom sources like HDFS/HBase into Spark RDD.
- CreatingGluecatalog tables for landing zone using Crawlers.
- Usage of Spark Streaming and Spark SQL API to process the files.
- Worked extensively with Sqoop for importing and exporting thedatafrom HDFS toRelational Database systems/mainframe and vice-versa loadingdatainto HDFS.
- Creating workflows using genericGluejobs to automate the ETL process.
- Storeddatain AWS S3 like HDFS and performed EMR programs ondatastored in S3.
- Converted all Hadoop jobs to run in EMR by configuring the cluster according to thedatasize.
- Developed and executed a migration strategy to moveDataWarehouse from anOracle platform to AWS Redshift.
- Design and Develop ETL Processes in AWS Glue to migrate Campaigndatafrom external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Involved in creating Hive Tables, loading withdataand writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Developed Storm topology to ingestdatafrom various source into HadoopDataLake.
- Developed web application using HBase and Hive API to compare schema betweenHBase and Hive tables.
- Played a vital role in Scala/Akka framework for web based applications.
- Connected to AWS s3 using SSH and ran spark-submit jobs.
- Developed Python Script to importdataSQL Server into HDFS & created Hive views ondatain HDFS using Spark.
- Developing UDFs in java for hive and pig and worked on reading multipledataformats on HDFS using Scala.
- Developed workflow in Oozie to automate the tasks of loadingdatainto HDFS and pre-processing with Hive.
- Created scripts to appenddatafrom temporary HBase table to target HBase table in Spark.
- Worked with both SQL and Non-SQL database andAWS
- Developed complex and Multi-stepdatapipeline using Spark.
- Collecting and aggregating large amounts of logdatausing Apache Flume and stagingdatain HDFS for further analysis.
- Involved in creating ETL flow using Pig, loading withdataand writing Pig Latin queries which will run internally in Map Reduce way.
- Involved in writing Unix/Linux Shell Scripting for scheduling jobs and for writing pig scripts and hive QL.
- Worked with Hue UI in scheduling jobs with ease and File browsing, Job browsing,Metastore management.
- Developed and designed system to collectdatafrom multiple portal using Kafka and then process it using spark.
Environment: Hadoop, HDFS, Hive, Python, Sqoop, Glue, Spark, Scala, Hive,Cloudera CDH4, Oracle, Elastic search, Kerberos, Datastage, SFTP,dataLake,Impala, Jira, Wiki, Alteryx, Teradata, Shell/Perl Scripting, Kafka, AWS (EC2, S3).
Data Analyst /Data Engineer
Confidential, Pittsburg, PA
Responsibilities:
- Gather business requirements, definition and design of thedatasourcing, and worked with thedatawarehouse architect on the development of logicaldatamodels.
- Created sophisticated visualizations, calculated columns and custom expressions and developed Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
- Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
- Created several types of data visualizations using Python - Matplotlib andTableau. Extracteddatausing SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
- Analyze functional and non-functional business requirements and translate into technicaldatarequirements and create or update existing logical and physicaldatamodels. Developed adatapipeline using Kafka to storedatainto HDFS.
- Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
- Developed Spark jobs using Scala for faster real-time analytics and used SparkSQL for querying.
- Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise in R, Matlab, python and respective libraries.
- Research on Reinforcement Learning and control (TensorFlow, Torch), and machine learning model (Scikit-learn).
- Responsible for design and development of Python programs/scripts to prepare, transform and harmonizedatasets in preparation for modeling.
- Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
- Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
- Processed the imagedatathrough the Hadoop distributed system by using Map andReduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests fromTalend.
- PerformedDataVisualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose,datasource, column mapping, transformation, and user group.
- Utilized Waterfall methodology for team and project management.
- Used Git for version control with theDataEngineerteam and Data Scientists colleagues.
Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, TDD, Python, Hadoop, DynamoDB, Kibana, NOSQL, Sqoop, MYSQL.