Sr. Data Engineer Resume
Irving, TX
SUMMARY
- Over 6+ years of strong experience in Data Analyst, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Statistical modeling, Data modeling, Data Visualization. Adept in statistical programming languages like R and Python, SAS, Apache Spark including Big Data technologies like Hadoop, Hive, Pig.
- 3+ years of experience in Hadoop 2.0. Led development of enterprise level solutions utilizing Hadoop utilities such as Spark, MapReduce, Sqoop, PIG, Hive, HBase, Zookeeper, Phoenix, Oozie, Flume, streaming jars, Custom SerDe, etc. Worked on proof of concepts on Kafka, and Storm.
- Deep analytics and understanding of Big Data and algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
- Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink).
- Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
- Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.
- Procedural knowledge in cleansing and analyzing data using HiveQL, Pig Latin, and custom MapReduce programs in Java.
- Experienced in writing custom UDFs and UDAFs for extending Hive and Pig core functionalities.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS), Teradata and vice versa.
- Strong experience in design and development of Business Intelligence solutions using data modeling, Dimension Modeling, ETL Processes, Data Integration, OLAP and client /server application.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Hands on experience in formatting and ETL’ing raw data in various format such as Avro, ORC, Parquet, CSV, JSON etc. Experience in Elasticsearch and MDM solutions.
- Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and, Relational databases. Deep understanding & exposure of Big Data Ecosystem.
- Extensively used SQL, Numpy, Pandas, Scikit - learn, Spark, Hive for Data Analysis and Model building.
- Have experience with processing Framework such as Spark and Spark Sql.
- Experienced on data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
- Analysis and Factor Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
- Extensively used SQL, Numpy, Pandas, Scikit-learn, Spark, Hive for Data Analysis and Model building.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
- Expertise in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis.
- Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
- Experienced with Integration Services (SSIS), Reporting Service (SSRS) and Analysis Services (SSAS)
- Expertise in Normalization to 3NF/De-normalization techniques for optimum performance in relational and dimensional database environments.
- Well-versed in version control and CI-CD tools such as GIT, SourceTree, Bitbucket, etc, and Amazon Web Services (AWS) products S3, EC2, EMR, and RDS.
- Experience in all stages of SDLC (Agile, Waterfall), writing Technical Design document, Development, Testing and Implementation of Enterprise level Data mart and Data warehouses.
TECHNICAL SKILLS
OLAP/Reporting Tools: SSRS, SSAS, MDX, Tableau, PowerBI
Relational Databases: SQL Server database 2014/2012/2008 R2/2005, Oracle 11g, SQL Server Azure, MS Access
SQL Server Tools: Microsoft Visual Studio 2010/2013/2015 , SQL server management studio
Big Data Ecosystem: HDFS, Nifi, Map Reduce, Oozie, Hive/Impala, Pig, Sqoop, Zookeeper and Hbase, Spark, Scala, Kafka, Apache Flink, AWS- EC2, S3, EMR.
Other Tools: MS Office 2003/2007/2010 and 2013, Power pivot, Power Builder, GIT, CI-CD, Jupyter Note Book.
Programming languages: C, SQL, PL/SQL, T-SQL, JAVA, Batch scripting, R, Python
Data Warehousing & BI: Star Schema, Snowflake schema, Facts and Dimensions tables, SAS, SSIS, and Splunk
Operating Systems: Windows XP/Vista/7/8 and 10; Windows 2003/2008R2/2012 Servers
PROFESSIONAL EXPERIENCE
Confidential, Irving, TX
Sr. Data Engineer
Responsibilities:
- Involved in requirements gathering, analysis, design, development, change management, deployment.
- Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Extracted data from heterogeneous sources and performed complex business logic on network data to normalize raw data which can be utilized by BI teams to detect anomalies.
- Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
- Developed common Flink module for serializing and deserializing AVRO data by applying schema.
- Developed Spark streaming pipeline to batch real time data, detect anomalies by applying business logic and write the anomalies to Hbase table.
- Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development. Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and metadata management processes.
- Indexed processed data and created dashboards and alerts in splunk to be utilized/ action by support teams.
- Responsible for operations and support of Big data Analytis platform, Splunk and Tableau visualization.
- Managed, developed, and designed a dashboard control panel for customers and Administrators using Tableau, PostgresSQL and RESTAPI calls.
- Designed and Developed applications using Apache Spark, Scala, Python, Nifi, S3, AWS EMR on AWS cloud to format, cleanse, validate, create schema and build data stores on S3.
- Developed CI-CD pipeline to automate build and deploy to Dev, QA, and production environments.
- Supported production jobs and developed several automated processes to handle errors and notifications. Also, tuned performance of slow jobs by improving design and configuration changes of PySpark jobs.
- Created standard report Subscriptions and Data Driven Report Subscriptions.
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, AWS, PostgresSQL, Python, PySpark, Flink, Kafka, SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.
Confidential, Plano, TX
Big Data Engineer/Spark Devloper
Responsibilities:
- Involved in implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Developed Spark batch job to automate creation/metadata update of external Hive table created on top of datasets residing in HDFS.
- Developed Data Serialization spark common module for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.
- Worked on ERModeling, Dimensional Modeling (StarSchema, SnowflakeSchema), Data warehousing and OLAP tools.
- Populated HDFS and PostgreSQL with huge amounts of data using Apache Kafka.
- Design and develop Rest API (Commerce API) which provides functionality to connect to the PostgreSQL through Java services.
- Designed Batch Audit Process in batch\shell script to monitor each ETL job along with reporting status which includes table name, start and finish time, number of rows loaded, status, etc.
- Developed Spark jobs in PySpark to perform ETL from SQL Server to Hadoop.
- Responsible for continuous monitoring and managing Elastic MapReduce(EMR) cluster through AWS console.
- Designed and implemented data acquisition, ingestion, Management of Hadoop infrastructure and other Analytics tools(Splunk, Tableau).
- Working knowledge of build automation and CI/CD pipelines.
- Developed python scripts to automate data ingestion pipeline for multiple data sources and deployed Apache Nifi in AWS.
- Design and develop Tableau visualizations which include preparing Dashboards using calculations, parameters, calculated fields, groups, sets and hierarchies.
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, AWS, PostgresSQL, Python, PySpark, Flink, Kafka, SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.
Confidential, Bentonville, AR
Data Scientist/Python Devloper
Responsibilities:
- Performed Data Integration, Extraction, Transformation, and Load (ETL) Processes
- In preprocessing phase, used Pandas to remove or replace all the missing data and balanced the dataset with Over-sampling the minority label class and Under-sampling the majority label class to perform a well Data Cleaning Process.
- Developed models relying on Linear Regression, Multiple Regression, Decision Trees, Random Forest, Logistic Regression, Naive Bayes.
- Validated and selected models using k-fold cross validation, confusion matrices and worked on optimizing models for high recall rate and Implemented Ensemble models like Boosting and Bagging
- Provided data mining solutions on Association Rules, Classification and Clustering problems
- Used Python to identify trends and relationships between different pieces of data and drew appropriate conclusions.
- Developed Rest API to serve the data generated by prediction model to serve other customers/teams.
- Generated complex reports in various formats such as list reports, summary reports etc. using advanced data manipulation techniques from SAS Enterprise Guide.
- Analyzed data and generated analytical reports using SAS, MS SQL.
- Created Visual Charts, Graphs, Maps, Area Maps, Dashboards and Storytelling using Tableau.
- Develop, Maintain and support Continuous Integration framework (CI-CD) based on Jenkins.
- Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.
Environment: AWS EC2, S3, Python (Scikit-Learn/Numpy/Pandas/Matplotlib), Machine Learning (Logistic Regression/Support Vector Machine/Gradient Boosting/Random Forest), Tableau, SAS, MS SQL.
Confidential, Plano, TX
SQL Data Analyst
Responsibilities:
- Designed and created Data Marts in data warehouse database
- Implementations of MS SQL Server Management studio 2008 to create Complex Stored Procedures and Views using T-SQL.
- Collecting the data from many resources and converting into flat text files with comma delimiter separator and importing the data to the SQL server for data manipulations.
- Responsible for deploying reports to Report Manager and Troubleshooting for any errors during the execution.
- Scheduled the reports to run on daily and weekly basis in Report Manager and emailing them to director and analyst to be reviewed in Excel Sheets.
- Created several reports for claims handling which had to be exported out to PDF formats.
- Analyzed business requirements and provided excellent and efficient solutions
Environment: SQL Server 2008, Microsoft Visual Studio 2008, MS Office, SSRS