Sr. Big Data Engineer Resume
North Chicago, IL
SUMMARY
- Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adhee to personal accountability in both individual and team scenarios.
- Over 7+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
- Extensively usedPythonLibraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Worked on Jenkins Pipelines to build Docker containers and exposure in deploying the same to Kubernetes engine
- Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Professional with in - depth IT experience in the areas of data warehousing especially using Cloud Technologies, Informatica DEI/BDM, Power Center, IICS tools and Databricks.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: NoSQL, SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Kubernetes, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle 12c/11g.
OLAP Tools: Tableau, SSAS, Business Objects ETL Informatica PowerCenter 9.6.1, Informatica BDM 10.2.2
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, North Chicago, IL
Sr. Big Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Experience on migration of data from PostGreSQL to Snowflake
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Designed & build infrastructure for the Google Cloud environment from scratch
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Strong understanding of AWS components such as EC2 and S3
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data inAzure Databricks
Environment: Hdfs, Hive, Spark (PySpark, SparkSQL, SparkMLIib), Kafka, Linux, Python 3.x(Scikit-learn, NumPy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig. Json and Parquet File systems. Map Reduce Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, Teradata utilities.
Confidential, Cincinati, OH
Big Data Engineer
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Analyzed SQL scripts and designed the solutions to implement using PySpark.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Optimizealgorithmwithstochastic gradient descent algorithmFine-tuned thealgorithm parameterwith manual tuning and automated tuning such asBayesian Optimization.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper, AWS(Glue, Lambda, StepFunctions, SQS, Unix/Linux Shell Scripting,Python, PyCharm, Informatica, Linux, Shell Scripting, Informatica PowerCenter
Confidential
Data Engineer/ Data Analyst
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using Map Reduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
- Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic Map Reduce(EMR)on(EC2).
- Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
- Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, Git add to add files, Git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. Yaml which actually build the code, generates artifacts which releases in to production
- Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
- Connected to AWS Redshift through Tableau to extract live data for real time analysis.
- Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
- Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
- Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
- Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
- Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.
Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, AWS, Oozie, Impala, Kafka, JSON, XML PL/SQL,SQl, Azure, HDFS, Unix, Python, PySpark, Azure.
Confidential
Data Analyst
Responsibilities:
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
- Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive.
- Headed negotiations to find optimal solutions with project teams and clients
- Mapped client business requirements to internal requirements of trading platform products
- Supported revenue management using statistical and quantitative analysis, developed several statistical approaches and optimization models.
- Led the business analysis team of four members, in absence of the Team Lead
- Added value by providing innovative solutions and delivering improved upon methods of data presentation by focusing on the Business need and the Business Value of the solution. Worked for Internet Marketing - Paid Search channels.
- Thenear real time reportingwas achieved by anevent-based processingapproach adoption instead ofmicro-batchingto deal with data coming fromKafka.
- Have written applications using Spring boot that reads data from Kafka and writes it toMapR (MapR version of HBase).
- Have written applications that produced data toKafkaand also consumed data from it.
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, HBase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Oozie, Linux.