We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Round Rock, TX

SUMMARY

  • Highly experienced Data Engineer with over 6 years of experience inData Extraction, Data Modelling, Data Wrangling, Statistical Modeling,Data Mining, Machine Learning, andData Visualization.
  • Domain noledgeand experience inHealth, Retail, Banking, andManufactureindustries.
  • Expertise in transforming business resources and requirements intomanageable data formatsandanalytical models,designing algorithms,building models,and developing data miningandreporting solutionsdat scale across a massive volume of structured and unstructured data.
  • Over 5+ years of diversified IT experience in E2E data analytics platformssuch as Big data, Hadoop, Java/J2EE Development, Informatica, Data Modeling, and System Analysis, In Banking, Finance, Insurance, and Telecom domains.
  • Worked for 4 years with AWS - Bigdata/Hadoop Ecosystem in the implementation ofData Lake.
  • Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workflows), extending the functionality by writing custom UDFs.
  • Extensive experience in developing Data warehouse applications usingHadoop, Informatica, Oracle, Teradata, MS SQL server on UNIX and Windowsplatforms and experience in creating complex mappings using various transformations and developing strategies for Extraction, Transformation and Loading (ETL) mechanism by using Informatica 9.x/8. x.
  • Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning,Dynamic-Partitioning, Bucketing and Parallel Executionconcepts.
  • As Data Architect designed and maintained high performance ELT/ETL processes.
  • Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java, custom UDF s.
  • Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
  • Knowledge on Cloud computing infrastructure AWS (amazon web services).
  • Created modules for spark streaming in data into Data Lake usingSpark.
  • Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables, concepts like Lambda Architecture, and Batch processing, Oozie.
  • Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping designer, Mapplet Designer, ETL Transformations, Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.
  • Over 4+ years of experience writing Shell Scripts and python scripts.
  • Worked closely to review pre- and post-processed data to ensure data accuracy and integrity with Dev and QA teams.
  • Working experience with Functional programming languages like Scala, and Java.
  • Extensive noledge ofData Modeling, Data Conversions, Data integration and Data Migrationwith specialization in Informatica Power Center.
  • Expertise in extraction, transformation, and loading data from heterogeneous systemslike flat files, excel, Oracle, Teradata, MSSQL Server.
  • Extensive noledge withUNIX/Linuxcommands, scripting and deploying the applications on the servers.
  • Strong skills in algorithms, data structures, Object-oriented design, Design patterns, documentation, and QA/testing.

TECHNICAL SKILLS

Databases: MySQL Server, Oracle DB, HiveQL, Spark SQL, PostgreSQL, HBase, Mongo DB, Dynamo DB, Redshift, Snowflake.

Big Data: HDFS, MapReduce, Pig, Hive, Kafka, Sqoop, Spark Streaming, Spark SQL, Oozie, Zookeeper.

Statistical Methods: Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, TEMPPrincipal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Autocorrelation

Machine Learning: Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble Method, Natural Language Processing (NLP), AWS Sage Maker, Azure ML Studio

ETL Tools: , Azure Data Factory,AWS Glue.

Data Visualization: Tableau, Matplotlib, Seaborn, ggplot2.

Languages: Python, Scala, Shell scripting, R, SAS, SQL, T-SQL.

Operating Systems: PowerShell, UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows

Cloud: ,Azure, AWS.

IDE Tools: Databricks, PyCharm, IntelliJ IDEA, Anaconda.

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

  • Evaluated business requirements and prepared detailed specifications dat follow project guidelines required to develop written programs.
  • Implemented big data framework: Hadoop, HDFS, Apache Spark, Hive, Map/Reduce and Sqoop
  • Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Managed and reviewed Hadoop log files to identify issues when a job fails and used HUE for UI-based pig script execution, Oozie scheduling.
  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
  • Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
  • Developed PySpark code and to mimic the transformations performed in the on-premises environment.
  • Analyzed the SQL scripts and designed solutions to implement using PySpark.
  • Automated workflows using shell scripts pull data from various databases into Hadoop and developed scripts to automate the process and generate reports.
  • Designed several partitions and replication factors for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Used various Spark Transformations and Actions for cleansing the input data and was involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on an extensible framework for building high-performance batch and interactive data processing applications on the hive.
  • Migrate SQL Server and Oracle database to Microsoft Azure Cloud and snowflake.
  • Created Data ingestion framework in Snowflake for both Batch and Real-time Data from different file formats using Snowflake Stage and Snowflake Data Pipe.
  • Migrate the Data using Azure Database Migration Service (AMS).
  • Collaborate with customers to define and meet requirements
  • Extracted Real-time feed using Spark streaming and convert it to RDD process data into Data Frame and load the data into Cassandra.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks, and performance analysis.
  • Exported event weblogs to HDFS by creating an HDFS sink dat directly deposits the weblogs in HDFS and used Elastic Search as a distributed RESTful web service with MVC for parsing and processing XML data.
  • Worked on Cloudera distribution for the Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop, and Oozie on the Hadoop cluster.
  • Integrated Oozie with Map-Reduce, Pig, Hive, and Sqoop and developed Oozie workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system.
  • Solved performance issues in Hive and Pig scripts with an understanding of Joins, and group and aggregation and how does it translate to MapReduce jobs.
  • Used Impala connectivity from the User Interface (UI) and query the results using Impala SQL.
  • Developed UNIX shell scripts to load many files into HDFS from the Linux File system.

Environment: Python, Cassandra, AWS Redshift, S3, EC2, EMR, Glue, Lambda Snowflake, Java, Spark, Oozie, DB Visualizer, Putty, IntelliJ, Excel, SQL, YARN, Spark SQL, HDFS, Hive, Maven, Apache Kafka, Shell scripting, Linux, PostgreSQL Database, Git, and Agile Methodologies.

Confidential

Data Engineer

Responsibilities:

  • Designed solution for Streaming data applications using Apache Storm.
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Create anArchitectural solutionsdat leverages the best Azure analytics tools to solve our specific need in Chevron use case
  • Design and present technical solutions to end users in a way the is easy to understand and buy into
  • Educating client/business users onthe pros and cons of variousAzure PaaS and SaaSsolutions ensuring themost cost-effectiveapproaches are taken into consideration
  • CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
  • CreateSpark Vectorized panda user definedfunctions for data manipulation and wrangling
  • Worked in a real-time data streaming data using AWS kinesis, EMR and AWS Glue.
  • Transfer data in logical stages from System of records to raw zone, refined zone and produce zone for easy translation and denormalization
  • Setting up Azure infrastructure likestorage accounts, integration runtime, service TEMPprincipalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
  • Writing spark and spark sql transformation in Azure Databricks to perform complex transformations for business rule implementation
  • Creating Data factory pipelines dat can bulk copy multiple tables at once from relational database to Azure data lake gen2
  • Create custom logging framework for ETL pipeline logging using Append variables in Data factory
  • Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
  • Took proof of concept projects ideas from business, lead, developed and created production pipelines dat deliver business valueusing Azure Data Factory
  • Kept our data separated and secure across national boundaries through multiple data centers and regions.
  • Implement Continuous integration/continuous development best practice using Azure Devops, ensuring code versioning
  • UtilizedAnsible playbookfor code pipeline deployment
  • Delivered denormalized data forPowerBIconsumers for modeling and visualization from the produced layer in Data Lake
  • Worked in aSAFE (Scaled Agile Framework)team with daily standups, sprint planning, quarterly planning.

ENVIRONMENT: Azure Data Lake Gen2, Azure Data Factory, Spark, Databricks, Azure Devops, Agile, PowerBI, Python, R, SQL, Scaled Agile team environment

Confidential, Round rock, TX

Data Engineer

Responsibilities:

  • This project was focused on customer clustering.Used the ETL Data Stage Director to schedule and running the jobs, testing, and debugging its components & monitoring performance statistics.
  • Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Architected, Designed and Developed Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
  • ImplementedSpark GraphXapplication to analyze guest behavior fordata sciencesegments.
  • Worked on batch processing of data sources usingApache Spark, Elastic search.
  • Developed Big Data solutions focused on pattern matching and predictive modeling.
  • Collaborated with EDW team in,High Level designdocuments for extract, transform, validate and load ETL process data dictionaries, Metadata descriptions, file layouts and flow diagrams.
  • Develop an Estimation model for various product & services bundled offering to optimize and predict the gross margin
  • Designed OLTP system environment and maintained documentation of Metadata. Used forward engineering approach for designing and creating databases for OLAP model.
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop.
  • Worked on migrating PIG scripts and Map Reduce programs to Spark Data frames API and Spark SQL to improve performance.
  • Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
  • Used the ETL Data Stage Director to schedule and running the jobs, testing and debugging its components & monitoring performance statistics.
  • Worked closely with the ETL Developers in designing and planning the ETL requirements for reporting, as well as with business and IT management in the dissemination of project progress updates, risks, and issues.
  • Performed scoring and financial forecasting for collection priorities using Python, and SAS.
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS
  • Worked in AWS environment for development and deployment of custom Hadoop applications.
  • Managed existing team members lead the recruiting and on boarding of a larger Data Science team dat addresses analytical noledge requirements.
  • Developed predictive causal model using annual failure rate and standard cost basis for the new bundled services.
  • Used classification techniques including Random Forest and Logistic Regression to quantify the likelihood of each user referring.
  • Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau, and Power BI.

ENVIRONMENT: IBM DataStage, Python, Spark framework, AWS, Redshift, MS Excel, NoSQL, Tableau, T-SQL, ETL, RNN, LSTM MS Access, XML, MS office 2007, Outlook, MS SQL Server.

Confidential

Software Engineer

Responsibilities:

  • Analyzed large data sets by running Hive queries and Pig scripts.
  • Involved in creating Hive tables and loading and analyzing data using hive queries.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.
  • Involved in running Hadoop jobs for processing millions of records of text data.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
  • Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Involved in unit testing using MR unit for Map Reduce jobs.
  • Involved in loading data from LINUX file system to HDFS.
  • Integrated the snowflake data-warehouse in the pipeline with ingestion from the ETL pipeline
  • Responsible for managing data from multiple sources.
  • Experienced in running Hadoop streaming jobs to process terabytes of XML format data.
  • Load and transform large sets of structured, semi structured data.
  • Responsible to manage data coming from different sources.
  • Assisted in exporting analyzed data to relational databases using Sqoop.
  • Created and maintained technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.

Environment: Hadoop, HDFS, Pig, Python, Hive, MapReduce, HBase, Sqoop, LINUX, Java, Python

We'd love your feedback!