Data Engineer Resume
Irving, TX
SUMMARY
- Data Engineer with 7+ years of professional IT experience in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and implementing enterprise level systems spanning Big Data.
- A skilled developer with strong problem solving, debugging and analytical capabilities, who actively engages in understanding customer requirements and ability to work independently and collaboratively and to communicate effectively with non - technical coworkers.
- Experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, Yarn, Spark, Nifi, Pig, Hive, Flume, Hbase, Oozie, Zookeeper, Sqoop, Scala.
- Hands on experience in creating real-time data streaming solutions using Apache Spark core, Spark SQL & Data Frames, Kafka, Spark streaming and Apache Storm.
- Excellent knowledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node, Data node, Resource manager, Node Manager and Job history server.
- Expertise in administering the Hadoop Cluster using Hadoop Distributions like Apache Hadoop & Cloudera.
- Extensive development experience in different IDE’s like Eclipse, NetBeans and Forte.
- Proficient in creating complex data ingestion pipelines, data transformations, data management and data governance, real time streaming engines at an Enterprise level.
- Proficient working on NoSQL technologies like HBase, Cassandra and MongoDB.
- Extensive experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS)
- Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modeling.
- Hands on experience in coding MapReduce/Yarn Programs using Java, Scala and Python for analyzing Big Data.
- Good experience in working with cloud environment like Amazon Web Services (AWS), Microsoft Azure, GCP.
- Establishing multiple connections to different Redshift clusters (Bank Prod, Card Prod, SBBDA Cluster) and provide the access for pulling the information we need for analysis.
- Strong knowledge in extraction, transformation and loading of data directly from different source systems like flat files, Excel, Hyperion, Oracle and SQL Server, Redshift.
- Worked on extensive migration of Hadoop and Spark Clusters to GCP, AWS and Azure.
- Experience in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure Big Data Technologies (Hadoop and Apache Spark) and Data bricks.
- Used Kafka and Spark Streaming for data ingestion and cluster handling in real time processing.
- Good knowledge in using Apache Nifi to automate the data movement between different Hadoop systems.
- Hands on experience in coding MapReduce/Yarn Programs using Scala and Python for analyzing Big Data.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Extensive experience in the implementation of Continuous Integration (CI), Continuous Delivery and Continuous Deployment (CD) on various Java based Applications using Jenkins, TeamCity, Azure DevOps, Maven, Git, Nexus, Docker and Kubernetes.
- Expertise in Natural Language Processing (NLP), Text Mining, Topic Modelling, Sentiment Analysis.
- Proficient at using Spark API’s to explore, cleanse, aggregate, transform and store machine sensor data.
- Experience in creating Data frames using PySpark and performing operation on the Data frames using Python.
- Developed ETL\Hadoop related java codes, created RESTful APIs using Spring Boot Framework, developed web apps using Spring MVC and JavaScript, developed coding framework, etc.
- Excellent Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Imported and exported data from different data sources into HDFS using Sqoop and performed transformations using Hive, Map Reduce and then loaded data into HDFS
- Good Understanding and experience in Machine Learning Algorithms and Techniques like Classification, Clustering, Regression, Decision Trees, Random Forest, NLP, ANOVA, SVMs, Artificial Neural Networks.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
- Expertise in SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS) tools and in development of T-SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
- Experience in creating interactive Dashboards and Creative Visualizations using tools like Tableau, Power BI
- Hands on experience with Microsoft Azure components like HDINSIGHT, Data Factory, Data Lake Storage, Blob, Cosmos DB.
- Experience in using distributed computing architectures like AWS products (e.g. EC2, Redshift, and EMR, Elastic search) and working on raw data migration to Amazon cloud into S3 and performed refined data processing.
- Extensive skills on LINUX and UNIX Shell command.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
TECHNICAL SKILLS
Big Data Eco-system: HDFS, MapReduce, Spark, Yarn, Hive, Pig, HBase, Sqoop, Flume, Kafka, Oozie, Zoo-Keeper, Impala
Hadoop Technologies: Apache Hadoop 1.x, Apache Hadoop 2.x, Cloudera CDH4/CDH5, Hortonworks
Programming Languages: Python, Scala, Shell Scripting, HiveQL
Machine Learning: Regression, Decision Tree, Clustering, Random Forest Classification, SVM, NLP
Operating Systems: Windows (XP/7/8/10), Linux (Ubuntu, Centos)
NoSQL Database: HBase, Cassandra and Mongo DB
Database: RDBMS, MySQL, Teradata, DB2, Oracle
Container/Cluster Managers: Docker, Kubernetes
BI Tool: Tableau, Power BI
Cloud: AWS, Azure
Web Development: HTML, XML, CSS
IDE Tools: Eclipse, Jupyter, Anaconda, Pycharm
Development Methodologies: Agile, Waterfall
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Writing test cases in converting csv files to parquet format and other scenarios.
- Migrating all the ATF supplier data to GCP in readable format.
- Integrating with SIM application, and building the ETL processes in scala
- Building and hosting an API by Datalake team to check the serial number of firearm and to serve the firearm description.
- Involved in convertingHive/SQLqueries intoSpark transformationsusingSpark RDDs and Scala.
- Worked onPySpark APIsfor data transformations.
- Data ingestiontoHadoop(Sqoop imports). To perform validations and consolidations for the imported data.
- Adding additional data from sources within Confidential and from two external data sources to Datalake, work on the UI according to the dashboard requirements.
- Build data pipelines to process the data using different processing methodologies using pyspark.
- Implement data ingestion routines both real time and batch using best practices in data modeling, ETL/ELT processes by leveraging cloud technologies and big data tools.
- Developing, building, and executing programs and scripts to automate various tasks using different programming languages and tools like Python, SQL, Snowflake, Tableau etc.
- Perform Data Analytics on large data sets using SparkSQL and python.
- Involve in Spark performance tuning to optimize resource and time utilization.
- Setup CICD automation on GitHub.
- Develop and maintain CI/CD (continues integration/ continues deployment) pipelines that build, test, and deploy our applications across multiple environments (development, testing, quality assurance, and production)
- Responsible for daily integrity checks, performing deployments and releases.
Environment: Python, Azure, Spark, SparkSQL, Jira, GIT, Scala, parquet, CSV, Hadoop, Hive, Agile Scrum, CI/CD, Datalake, ETL/ELT, Snowflake, Tableau.
Confidential, Irving, TX
Data Engineer
Responsibilities:
- Extensively involved in Installation and configuration ofCloudera Hadoop Distribution.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities likeApache Sparkwritten inScala.
- Developedsparkapplications for performing large scale transformations and denormalization of relational datasets.
- Have real-time experience ofKafka-Stormon HDP 2.2 platform for real time analysis.
- Loaded data into the cluster from dynamically generated files usingFlumeand from relational database management systems usingSqoop.
- Created reports for the BI team usingSqoopto export data intoHDFS and Hive.
- Performed analysis on the unused user navigation data by loading intoHDFSand writingMapReducejobs. The analysis provided inputs to the new APM front end developers and lucent team.
- Loading the data from multiple Data sources like (SQL, DB2, and Oracle) intoHDFSusingSqoopand load intoHive tables.
- CreatedHIVE Queriesto process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
- Developed ComplexHiveQL'susingSerDe JSON.
- CreatedHBasetables to load large sets of structured data.
- Involved in importing the real time data toHadoopusingKafkaand implemented theOoziejob for daily imports.
- Performed Real time event processing of data from multiple servers in the organization usingApache Stormby integrating withapache Kafka.
- Managed and reviewedHadoop logfiles.
- Involved in convertingHive/SQLqueries intoSpark transformationsusingSpark RDDs and Scala.
- Worked onPySpark APIsfor data transformations.
- Data ingestiontoHadoop(Sqoop imports). To perform validations and consolidations for the imported data.
- ExtendingHive and Pigcore functionality by writing custom UDF's for Data Analysis.
- Upgraded currentLinuxversion to RHEL version 5.6.
- Expertise in hardening,LinuxServer and Compiling, Building and installing Apache Server from sources with minimum modules.
- Workedon JSON, Parquet, Hadoop File formats.
- BuiltS3buckets and managed policies for S3 buckets and usedS3 bucketandGlacierfor storage and backup onAWS.
- Work with other teams to help develop thePuppetinfrastructure to conform to various requirements including security and compliance of managed servers.
- Built aVPC, established the site-to- site VPN connection betweenData CenterandAWS.
- Management and Administration of AWS ServicesCLI,EC2,VPC,S3,ELBGlacier,Route 53,Cloudtrail,IAM, and Trusted Advisor services.
- Created automated pipelines in AWSCodePipelineto deployDockercontainers in AWSECSusing serviceslikeCloudFormation,CodeBuild,CodeDeploy,S3andpuppet.
- Worked on differentJavatechnologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
- UsedGit hubfor continuous integration services.
Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.
Confidential, Plano, TX
Data Engineer
Responsibilities:
- Responsible for the execution ofbig data analytics, predictive analytics and machine learning initiatives.
- Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- DevelopedScalascripts,UDF's using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
- Experience indata cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
- UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process usingPythonandScalaas needed through various sources such asAPI, AWS S3,Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
- Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
- ImplementedAWS Lambdafunctions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
- Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
- Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary query’s orpythonscripts based on source.
- Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
- Created DAG to use theEmail Operator, Bash Operator and spark Livy operatorto execute and inEC2instance.
- Deploy the code toEMRviaCI/CD using Jenkins.
- Extensively usedCode cloudfor code check-in and checkouts for version control.
Environment: AgileScrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Codecloud, AWS
Confidential, San Francisco, CA
Data Engineer
Responsibilities:
- Loaded and transformed huge sets of structured, semi structured, unstructured data using Hadoop/Big Data.
- Integrated Oozie with Pig, Hive, Sqoop and developed Oozie workflow for scheduling and orchestrating the Extract, Transform, and Load (ETL) process within the Cloudera Hadoop.
- Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
- Managing workflow and scheduling for complex map reduce jobs using Apache Oozie.
- Designed physical and logical data models based on Relational (OLTP), Dimensional (OLAP) on snowflake schema using Erwin modeler to build an integrated enterprise data warehouse.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Extended Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
- Worked with Apache Spark which provides fast engine for large data processing integrated with Scala.
- Develop quality check modules in PySpark and SQL to validate data in data lake, automated the process to trigger the modules before the data gets ingested
- Experience in developing multiple MapReduce programs in java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other file formats.
- Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Worked on development of Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.
- Used Hive QL to analyze data and create summarized data for consumption on Power BI.
- Sqoop to import and export data from database to HDFS and Data Lake on AWS.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a Proof of Concept (POC) using Amazon EC2.
- Performed data cleaning and handled missing values in Python using backward-forward filling methods and applied Feature engineering, Feature normalize, & Label encoding techniques using scikitlearn preprocessing.
- Involved in Python OOP code for quality, logging, monitoring, and debugging code optimization.
- Used SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging area and finally transferred into data marts and performed action in XML.
- Developed Batch processing solutions with Azure Databricks and Azure Event.
- Analyzed, designed and built Modern data solutions using Azure PaaS service to support visualization of data.
- Addressed overfitting and under fitting by tuning the hyper parameter of the Machine learning algorithms by using Lasso and Ridge Regularization and used GIT to coordinate team development
- Supported MapReduce Programs those are running on the cluster and also wrote MapReduce jobs using JavaAPI.
- Experience in designing Cloud Azure Architecture and Implementation plans for hosting complex application workloads on MS Azure.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Working knowledge of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Created the Lambda script in Python for executing the EMR jobs
- Created presentations for data reporting by using pivot tables, VLOOKUP and other advanced Excel functions.
- Performed Exploratory Data Analysis (EDA) for maximizing insights from the dataset also to detect utliers and extract important variables by graphical and numerical visualizations. Supported various reporting teams and experience with data visualization tools Tableau and Power BI.
Confidential
Data Engineer
Responsibilities:
- Created data pipelines in multiple instances to load the data from DynamoDB to store in HDFS location.
- Successfully executed Performance tuning of MapReduce jobs by analysing and reviewing Hadoop log files.
- Involved in the Partitioning and Bucketing of the data stored in Hive Metadata.
- Used Apache Flume to collect and aggregate large amounts of log data and staging data in HDFS.
- Used Proc SQL, Proc Import, SAS Data Step to clean, validate and manipulate data.
- Developed SAS datasets by extracting Data from Oracle servers and Flat files.
- Collected Log data from web servers to integrate into HDFS location.
- Wrote the MapReduce programs to handle semi-structured and unstructured data like Json, Argo data files and Sequence files for log data.
- Developed Kafka producer and consumers for message handling
- Extensively involved in developing Restful API using JSON library of Play framework.
- Worked on extending the core functionalities of Hive and Pig by writing UDF’s using Java. • Involved in importing data from MS SQL server, MySQL into Hadoop using Sqoop • Identified and created Sqoop scripts to batch data periodically in to the HDFS.
- Developed Oozie workflows to collect and manage for end to end processing
- Analysed large and critical datasets using HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper.
- Migrated HiveQL queries into SparkSQL for improved performance.
Environment: Cloudera, MapReduce, Hadoop, HDFS, SAS, SAS Macro, SAS Graph, SAS Access, Pig Scripts, Hive Scripts, HBase, Sqoop, Zookeeper, Oozie, Oracle, Shell Scripting.