Senior Data Engineer Resume
Edison, NJ
SUMMARY
- Around 8+ years of professional IT experience involving project development, implementation, deployment, and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- 7+Years of experience As Developer using Big Data Technologies like Databricks/Spark and Hadoop Ecosystems.
- Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
- Good understanding of Spark Architecture with Databricks, Structured Streaming.
- Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
- Proficiency in multiple databases like MongoDB, MySQL, ORACLE, and MS SQL Server.
- Worked as team JIRA administrator providing access, working assigned tickets, and teaming with project developers to test product requirements/bugs/new improvements.
- CreatedSnowflake Schemasby normalizing the dimension tables as appropriate and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
- Experienced in Pivotal Cloud Foundry (PCF) on Azure VM's to manage the containers created by PCF.
- Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches.
- Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle,Data Warehouse etc. Build multiple Data Lakes.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, PowerBI.
- Worked with Google Compute Cloud Data Flow and Big Query to manage and move data within a 200 Petabyte Cloud Data Lake for GDPR Compliance and also designed star schema in Big Query.
- Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
- Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running docker images with multiple microservices.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g., EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and pyspark to solve big data type problems.
- Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation, exploratory data analysis, summary statistics, feature engineering, Machine learning model development and machine learning model deployment into Server system.
- Proficient inStatistical MethodologiesincludingHypothetical Testing,ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
- Worked with various text analytics libraries like Word2Vec, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
- Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Expertise in transforming business resources and requirements intomanageable data formatsandanalytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, MongoDB Avro, Storm, Parquet, Snappy, AWS
Cloud Technologies: AWS, Azure, Google cloud platform (GCP)
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter.
Databases & Warehouses: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE, NoSQL, SQL Server, MS Access, Teradata
Programming / Query Languages: Java, SQL, Python, NoSQL, PySpark, SQL, PL/SQL, Linux shell scripts, Scala.
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, NI-FI, GCP, Google Shell, Linux, Big Query, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports.
Version Controllers: GIT, SVN, Bitbucket
ETL Tools: Informatica, Talend
Operating Systems: UNIX, LINUX, Mac OS, Windows.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapReduce, AWS EMR
PROFESSIONAL EXPERIENCE
Confidential, Edison, NJ
Senior Data Engineer
Responsibilities:
- Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark with Cloudera distribution.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Hands on experience on Cloudera Hue to import data on the GUI.
- Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
- Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Configured, deployed, and maintained multi-node Dev and Tested Kafka Clusters.
- Developed in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Involved in runningHadoopstreaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
- Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software inAWSEC2.
- Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM, Cloud formation) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
- Implementations of generalized solution model using AWS SageMaker.
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
- Experience to manage IAM users by creating new users, giving them a limited access as per needs, assign roles and policies to specific user.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Performed data analysis with Cassandra using Hive External tables.
- Designed the Column families in Cassandra.
- Experienced in runningHadoopstreaming jobs to process terabytes of xml format data.
- Used Spark API overHadoopYARN as execution engine for data analytics using Hive.
- ImplementedYARNCapacity Scheduler on various environments and tuned configurations according to the application wise job loads.
- Configured Continuous Integration system to execute suites of automated test on desired frequencies using Jenkins, Maven & GIT.
- Worked on AWS hosted Databricks environment and used spark structured streaming to consume the data from kafka topics in real time and perform merge operations on delta lake tables.
- Design, Development, Implementation ETL process to support CDC- Change Data Capture on Databricks platform.
- Used Snowflake extensively to do the ETL operations and also imported the data from Snowflake to S3 and S3 to Snowflake.
- Proficient with snowflake architecture and concepts.
- Experience with Agile and Scrum Methodologies. Involved in designing, creating, managing Continuous Build and Integration Environments.
Environment: Hadoop, HDFS, Hive, Spark, Cloudera, AWS EC2, AWS S3, AWS ERM, Sqoop, Kafka, Yarn, Shell Scripting, Scala, Pig, Databricks, Snowflake, Oozie, Agile methods, MySQL
Confidential - Chicago,IL
Data Engineer
Responsibilities:
- Experienced in development using Cloudera distribution system.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application HD Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Worked extensively on running spark jobs on Azure HD Insights environment.
- Used Spark as Data processing framework and have worked on performance tuning of the production jobs.
- Ingested the data from ms-sql server to Azure data storage.
- Worked on creating tabular models onAzure analysis servicesfor meeting business reporting requirements.
- Have good experience working with Azure BLOB andData Lakestorage and loading data intoAzure SQL Synapse analytics (DW)
- As a Hadoop Developer my responsibility is managing the data pipelines and data lake.
- Have experience of working on Snow -flake data warehouse.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Designed custom Spark REPL application to handle similar datasets.
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
- Performed Hive test queries on local sample files and HDFS files.
- Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
- Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
- Worked Extensively on Talend Admin Console and Schedule Jobs in Job Conductor.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
- Implemented auto balance and data reconciliation measures during data receipt, stage load and production load process.
- Wrote severalTeradata SQL Queries using Teradata SQL Assistant for Ad Hoc Data Pull request.
- Created Teradata objects likeTables and Views.
- Extensively worked on to convert ORACLE scripts into Teradata scripts.
- Conducted webinar sessions for offshore/onsite team for Teradata-XML servicestraining. Presented white papers to knowledge portals.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
- Assigned name to each of the columns using case class option in Scala.
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark, and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Worked on the NoSQL databases HBase and mongo DB.
- Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
- Have good experience in logging defects in Jira and Azure Devops tools.
- Developed Python scripts to parse JSON documents and load the data into database.
- Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
- Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries.
- Used python APIs for extracting daily data from multiple vendors.
Environment: Hadoop, Azure, Spark, Hive,Oozie, Java, Linux, Maven, MS-SQL, Oracle 11g/10g, Zookeeper, MySQL.
Confidential -Charlotte,NC
Data Engineer
Responsibilities:
- Experience in developing Spark applications using Spark-SQL inDatabricksfor data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azureservices (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)and processing the data inAzure Databricks.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
- Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS so as to use it for the analysis.
- Migrated Existing MapReduce programs to Spark Models using Python.
- Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
- Done data validation between data present in Data Lake and S3 bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Used Kafka for real time data ingestion.
- Created different topic for reading the data in Kafka.
- Read data from different topics in Kafka.
- Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL.
- Developed complex ETL Packages using SQL Server 2008 Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse.
- Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports
- Written Hive queries for data analysis to meet the business requirements.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores and executors
- By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking.
Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP, MS SQL Server 2014, Teradata, ETL, SSIS, Alteryx, Tableau (Desktop 9.x/Server 9.x), Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL).
Confidential -Memphis,TN
Hadoop Developer
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Involved in loading data from UNIX file system to HDFS. Installed and configured Hive and written Hive UDFs. Importing and exporting data into HDFS and Hive using Sqoop
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Worked hands on with ETL process using Informatica.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Integrated Redshift SSO cluster with Talend
- Involved in Integrating IAM Roles in Talend Components.
- Extracted the data from Teradata into HDFS using Sqoop.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
- Wrote various data normalization jobs for new data ingested into Redshift.
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Migrated on premise database structure to Confidential Redshift data warehouse.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Exported the patterns analyzed back into Teradata using Sqoop.
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
- Worked with applications like R, SPSS and Python to develop neural network algorithms, cluster analysis, ggplot2 and shiny in R to understand data and developing applications.
- Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
- Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
- Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors.
Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Oozie, Bigdata ECO systems, PIG, Cloudera, Python, Informatica Cloud Services, Salesforce, Unix scripts, FlatFiles, XML files, Redshift.
Confidential
Data Engineer
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Recommended structural changes and enhancements to systems and databases.
- Conducted Design reviews and technical reviews with other project stakeholders.
- Was a part of the complete life cycle of the project from the requirements to the production support.
- Created test plan documents for all back-end database modules.
- Used MS Excel, MS Access, and SQL to write and run various queries.
- Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
- Worked with internal architects and assisting in the development of current and target state data architectures.
- Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
- Perform troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
- Write Python scripts to parse JSON documents and load the data in database.
- Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
- Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries.
- Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
- Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.
- Extensively performed large data read/writes to and from csv and excel files using pandas.
- Tasked with maintaining RDD's using SparkSQL.
- Communicated and coordinated with other departments to collection business requirement.
- Created Autosys batch processes to fully automate the model to pick the latest as well as the best bond that fits best for that market.
- Created a framework using plotly, dash and flask for visualizing the trends and understanding patterns for each market using the history data.
- Used python APIs for extracting daily data from multiple vendors.
- Used Spark and SparkSQL for data integrations, manipulations. Worked on a POC for creating a docker image on azure to run the model.
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
Environment: SQL, SQL Server, MS Office, MS Visio, SQL Server 2012, Jupyter, R 3.1.2, Python, SSRS, SSIS, SSAS, MongoDB, HBase, HDFS, Hive, Pig, Microsoft office, SQL Server Management Studio, Business Intelligence Development Studio.