Sr.data Engineer Resume
Charlotte North, CarolinA
SUMMARY
- Over 8+ Years of IT experience and currently working in a Big Data Capacity with the help of Hadoop Eco System across internal and cloud - based platforms.
- Using Big Data technologies in designing and implementing complete end-to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
- Experience in working with different Hadoop distributions like CDH and Hortonworks.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Having proficient experience in various Big Data technologies like Hadoop, Apache Nifi, Hive Query Language, HBase NoSQL database, Sqoop, Spark, Scala, OOZIE and Pig. Oracle Database and Unix shell Scripting technologies.
- Implemented Enterprise Data Lakes using Apache Nifi.
- Developed and designed Microservices components for the business by using Spring Boot.
- Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
- Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
- Good experience in creating data ingestion Pipelines, Data Transformations, Data Management, Data Governance, and real time streaming at an enterprise level.
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Experience in using SDLC methodologies like Waterfall, Agile Scrum for design and development.
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, Writing, and Optimizing the HiveQL queries.
- In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principles etc.
- Experience in various distributions: Cloudera distributions like (CDH4/CDH5).
- Experience developing iterative Algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Experience on agile methodologies Scrum.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, North Carolina
Sr.Data Engineer
Responsibilities:
- Worked extensively on migrating our existing on-Prem data pipelines to AWS cloud for better scalability and infrastructure maintenance.
- Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
- Used AWS Simple workflows for automating and scheduling our data pipelines.
- Worked extensively in automating creation/termination of EMR clusters as part of starting the data pipelines.
- Utilized Glue metastore as common metastore between EMR clusters and Athena query engine with S3 as the storage layer for both.
- Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts and bar graphs.
- Designed robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and semi structured batch and real time data streaming data.
- Written complex hive scripts for performing various data analysis and creating various reports requested by business stakeholders.
- Applied efficient and scalable data transformations on the ingested data using Spark framework.
- Built Spark Scripts by utilizing Scala shell commands depending on the requirement.
- Worked closely with machine learning teams to deliver feature datasets in an automated manner to help them with model training and mode scoring.
- Performed Spark join optimizations, troubleshooted, monitored and wrote efficient codes using Scala.
- Gained good knowledge in troubleshooting and performance tuning Spark applications and Hive scripts to achieve optimal performance.
- Performed Data Migration to GCP.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
- Developed and deployed data pipeline in cloud such as AWS and GCP.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK.
- Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
- Storing Data Files in Google Cloud S3 Buckets daily basis. Using Data Proc, Big Query to develop and maintain GCP cloud base solution.
- Experience in working with various cloud distributions like AWS, Azure and GCP.
- Working knowledge on GCP tools like BigQuery, Pub/Sub, Cloud SQL, and Cloud functions.
- Implemented cloud integrations to GCP and Azure for bi-directional flow setups for data migrations.
- Hands on experience in GCP services like EC2, S3, ELB, RDS, SQS, EBS, VPC, EBS, AMI, SNS, RDS, EBS, Cloud Watch, Cloud Trail, Cloud Formation GCP Config, Autoscaling, Cloud Front, IAM, R53.
- Building/Maintaining Docker/ Kubernetes container clusters managed by Kubernetes Linux, Bash, GIT, Docker, on GCP.
- Developed innovative solutions to big data and cloud issues such as deploying the Docker containers and k8s pods on GCP.
Environment: AWS, Hive, Spark, Scala, Tableau, Splunk, Hadoop, HDFS, GCP
Confidential, Omaha, Nebraska
Sr. Big Data Engineer
Responsibilities:
- Replaced the existingMapReduceprograms andHiveQueries into Spark application using Scala.
- Developed data pipeline using Flume, Sqoop and Pig to extract the data from weblogs and store in HDFS.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Experience working in MVC Architecture based frameworks like Node.js and Angular js.
- Created and maintained server-side Node.js applications.
- Worked on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using python scripts.
- Worked on NGP Onboarding to implement Delete Processor in Cosmos/Azure SQL.
- Developed and Scheduled Data Load jobs in Data Studio Tool/Cosmos Ui using Scope Scripts from both Structured and Non-Structured Streams.
- Evaluated the customer/seller health score using python scripts
- Developed Shell scripts for scheduling and automating the job flow.
- Performing Hive tuning techniques like partitioning and bucketing and memory optimization.
- Hands on experience on Sqoop import, export and eval.
- Worked on migratingMapReduce programsintoSparktransformations usingSparkandScala, initially done usingPython (PySpark).
- Automating ETL process by using Python/Unix/Perl scripting languages.
- Data transfer using Azure Synapse and Polybase.
- Performed Distcp while loading the historic data in to Hive.
- Implemented Hadoop security using Kerberos.
- Involved in designing and development of data warehouse using Hive.
- Loaded data from different sources such as HDFS, HBase and RDBMS into Hive.
- Created Hive tables, views and external tables.
- Monitored and optimized Hive queries performance.
- Used Spark SQL to load data and created schema RDD on top of that which loads into Hive tables and handled structured using Spark SQL.
- Wrote Python scripts to parse XML documents and load the data in database.
- Involved in converting the Hql into Spark transformations using Spark RDD with support of Python and Scala.
- Performed Pig script which picks the data from one hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table.
- Converted this script into a jar and passed as parameter in Oozie script.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Worked on SnowSQL. Involved in Migrating Objects from Teradata to Snowflake.
- Created data sharing between two snowflake accounts. Unit tested the data between Redshift and Snowflake.
- Build an ETL which utilizes spark jar inside which executes the business analytical model.
- Knowledge in deploying and managing cloud infrastructure using AWS CloudFormation, AWS IAM, AWS Security Groups and AWS VPC.
- Experienced in designing and building Big Data solutions using Hadoop, Hive, Spark and Kafka.
- Familiar with Apache Airflow for scheduling and monitoring ETL jobs.
- Expertise in developing code using Python, Scala and SQL.
Environment: Hadoop, Map Reduce, HDFS, Hive, Impala, UNIX, Linux, Tableau, Teradata, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, Azure, Spark SQL, Spark-Streaming, pig, NoSQL, Solr, GIT.
Confidential, New York, NY
Sr. Data Engineer
Responsibilities:
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
- Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs.
- Experience in creating and managing Azure Blob Storage for staging data.
- Hands on Experience in setting up Azure Data factory and creating the ingestion pipelines to pull data to Azure Data Lake store and Azure Blob storage.
- Experience in developing Azure Data Bricks notebooks to perform data transformations.
- Proficient in writing complex SQL statements and stored procedures.
- Experience with Azure SQL Data Warehouse, Azure Synapse Analytics, and Power BI.
- Experience in developing and deploying Azure Cognitive Services for data analysis and Machine learning.
- Hands on experience in developing big data solutions using HDInsight, Azure Databricks, and Azure Stream Analytics.
- Knowledge of Azure DevOps Server, Azure Resource Manager, and Azure Security Center.
- Developed dynamic Data factory pipelines using parameters and trigger them as desired using events like file availability on Blob storage.
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines,
- Performed Data Mapping, Data design (Data Modeling) to integrate data across multiple databases in to EDW.
- Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling.
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data.
- Created pipelines in ADF using linked services to Extract, Transform and load data from different sources like Azure SQL, Blob storage.
- Experienced in Node.js PostgreSQL and Building pipelines in AWS including ETL like Spark,Glue etc.
- Worked on AWS Ealstic BeanStalk for fast deploying of various applications developed with java, Node.Js, Python on familiar servers such as Apache.
- Used Node.Js for standalone UI testing.
- Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.
- Automated the monthly data validation process to validate the data for nulls and duplicates and created reports and metrics to share it with business teams.
- Experience in working with Azure Data Lake Analytics and U-SQL for data transformation activities.
- Experience in using REST API calls, PowerShell, and Azure CLI to manage Azure Data Factory artifacts.
- Experience in documenting the design and implementation of Azure Data Factory solutions.
- Experience in working with Azure DevOps for continuous integration and deployment.
- Used clustering techniques like K-means to identify outliers and to classify unlabeled data.
- Data gathering, data cleaning and data wrangling performed using Python.
- Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods.
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
- Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means.
- Conducted model optimization and comparison using stepwise function based on AIC value.
- Used cross-validation to test models with different batches of data to optimize models and prevent over fitting.
- Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Working experience with container orchestration tools such as Kubernetes.
- Deployed and managed several clusters in production and have experience with its various components, such as pods, node autoscaling, deployments, and services.
Environment: Hadoop, HDFS, Hbase, Oozie, Spark, Machine Learning, Big Data, Python, PySpark, DB2, MongoDB, Elastic Search, Web Services.
Confidential, St.Louis, MO
Big Data Engineer
Responsibilities:
- As a Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics.
- Responsible for the planning and execution of big data analytics, predictive analytics, and machine learning initiatives.
- Very good hands on experience in advanced Big - Data technologies like Spark Ecosystem (Spark SQL, MLlib, SparkR and Spark Streaming), Kafka and Predictive analytics (MLlib, R ML packages including ML library of H2O).
- Designed and developed spark jobs for performing ETL on large volumes of medical membership and claims data.
- Created Airflow Scheduling scripts in Python.
- Worked on migration of data from on-prem SQL server to cloud databases like Azure Synapse,Azure SQl Db.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily
- Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems
- Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
- Implemented Dynamic Data Masking in Azure SQL database and Azure Synapse Analytics with different masking functions and datatypes using Azure portal and T-SQL commands
- Designed and developed Natural Language Processing models for sentiment analysis.
- Used predictive modeling with tools in SAS, SPSS, and Python.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
- Developed web page and form validation with team using Angular, Node.js, HTML.
- Developed visualizations and dashboards using ggplot2, Tableau.
- Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
- Experience in managing large-scale, geographically-distributed database systems, including relational (Oracle, SQL server) and NoSQL (MongoDB, Cassandra) systems.
- Built and analyzed datasets using R, SAS, MATLAB, and Python (in decreasing order of usage).
- Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them.
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data.
- Used Cloudera Hadoop YARN to perform analytics on data in Hive.
- Wrote Hive queries for data analysis to meet the business requirements.
- Expertise in Business Intelligence and data visualization using Tableau.
- Expert in Agile and Scrum Process.
- Worked on setting up AWS EMR clusters to process monthly workloads
- Was involved in writing pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process
- Wrote spark SQL and spark scripts(pyspark) in DataBricks environment to validate the monthly account level customer data stored in S3
- Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes, and Hadoop Distributed File System (HDFS).
- Interfaced with large-scale database system through an ETL server for data extraction and preparation.
- Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark, AirFlow, Machine learning, AWS, MS Azure, Cassandra, Avro, HDFS, GitHub, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), SAS, SPSS, MySQL, Bitbucket, Eclipse, XML, PL/SQL, SQL connector, JSON, Tableau, Jenkins.
Confidential
Data Analyst
Responsibilities:
- Planning, designing, and implement application database code objects, such as stored procedures and views.
- Building and maintain SQL scripts, indexes, and complex queries for data analysis and extraction.
- Automate the configuration management of database and Big Data systems.
- Performing schema management, database sizing, maintaining privileges.
- Created Sqoop scripts to ingest data from HDFS to Teradata and from SQL Server to HDFS and to PostgreSQL.
- Installing and monitoring PostgreSQL database using the standard monitoring tools like Nagios etc.
- Daily log analysis using pgbadger tool and query tuning.
- Maintaining custom vacuum strategies at table and db level.
- Provide database coding to support business applications using Sybase T-SQL
- Performed quality assurance and testing of SQL server environment.
- Used Erwin tool for dimensional modelling (Star schema) of the staging database as well as the relational data warehouse.
- Developed new processes to facilitate import and normalization, including data file for counterparties.
- Worked with business stakeholders, application developers, and production teams across functional units to identify business needs and discuss solution options.
- Developed parameter and dimension-based reports, drill-down reports, matrix reports, charts, and Tabular reports using Tableau Desktop.
- Retrieved data from data warehouse and generated a series of meaningful business reports using SSRS.
- Expertise in Client-Server application development using Oracle 12c/11g/10g/9i, PL/SQL, SQL PLUS, TOAD and SQL LOADER.
- Validated and tested reports, then published the reports to the report server.
- Designed coded, tested and debugged custom queries using Microsoft T-SQL and SQL Reporting Services.
- Experience in Oracle Dynamic SQL, Records Collections and PL/SQL Tables.
- Conducted research to collect and assemble data for databases - Was responsible for design/development of relational databases for collecting data
- Built data input and designed data collection screens - Managed database design and maintenance, administration and security for the company.
- Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatics Power Center.
- Designed and developed several ETL scripts using Informatica, UNIX shell scripts.
- Analyzed the source data coming from Oracle, Flat Files, and MS Excel coordinated with data warehouse team in developing Dimensional Model.
- Created FTP, ODBC, Relational connections for the sources and targets.
- Experience in database programming inPL/SQL(Stored Procedures, Triggers and Packages).
- Developed UNIX shell scripts to control the process flow for Informatica workflows to handle high volume data.
- Used SQL to test various reports and ETL Jobs load in development, testing and production.
- Prepared Test cases based on Functional Requirements Document.
Environment: PostgreSQL, MS SQL Server, Windows Advanced Server, VB, and XML, DTS, Query Analyzer, SSRS, SQL Profiler and Enterprise Manager. PL/SQL, Informatica Power Center 9.x