Sr Data Engineer Resume
Chicago, IL
SUMMARY
- Around 8+ years of technical IT experience in Design, Development and Maintenance of enterprise analytical solutions using Big data technologies.
- Proven data and business expertise in Retail, Finance and Healthcare domains.
- Specialized in Big Data ecosystem, Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, Data Processing, and Database Management.
- A Data Science enthusiast with strong Problem solving, Debugging, and Analytical capabilities, who actively engage in understanding and delivering to business requirements.
- Expert in Oracle database design and administration.
- Good expertise in using Database tools like MongoVue, Mongo compass and Toad.
- Proficient in choosing and evaluating the right technologies needed for building data pipelines from ingestion, curation and consumption for both batch and streaming use cases on cloud and on - premises environments.
- Result-oriented professional with experience in Creating Data Mapping Documents, Writing Functional Specifications and Queries, Normalizing Data from 1NF to 3NF/4NF. Requirements gathering, System & Data Analysis, Requirement Analysis, Data Architecture, Database Design, Database Modeling, Development, Implementation and Maintenance of OLTP and OLAP Databases.
- Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in-depth knowledge in Spark API's like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
- Experience working on Informatica MDM to design, develop, test and review & optimize Informatica MDM (Siperian).
- Strong working experience on NoSQL databases and their integration with the Hadoop cluster - HBase, Cassandra, MongoDB, DynamoDB, and Cosmos DB
- Experience in MVC and Microservices Architecture and Docker containers.
- Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
- Provided full life cycle support to logical/physical database design, schema management and deployment.
- Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
- Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
- Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Development (TDD) and Continuous Integration (CI).
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
- Worked on Kubernetes to provide platform as service on private and public cloud.
- Improved the query performance by transitioning log storage from Cassandra to Azure SQL Datawarehouse
- Experience in working on creating and running docker images with multiple microservices.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Experience in using stackdriver service/ dataproc clusters in GCP for accessing logs for debugging.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
- Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of the Hadoop cluster using PySpark and Shell Scripting.
- Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party platform integrations as part of Enterprise Site platform.
- Experience in implementing pipelines using ELK (Elasticsearch, logstash, kibana) and developing stream processes using Apache Kafka.
- Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation.
- Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA, Time Series, Principal Component Analysis.
- Good knowledge in Informatica Master Data Management (MDM) and Informatica Data Quality (IDQ).
- Daily processing of internal and external customer data management services for occupational company.
- Good understanding of Data Modeling (Dimensional and Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension Tables.
- Extensive experience in writing SQL and PL/SQL scripts to validate the database systems and for backend database testing.
- Experience in data retrieval methods using Universes, Personal Data files, Stored Procedures, and free hand SQL.
- Good at conceptualizing and building solutions quickly and recently developed a Data Lake using sub-pub Architecture.
- Developed a pipeline using Python and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
- Installed both Cloudera (CDH4) and Hortonworks (HDP1.3-2.1) Hadoop clusters on EC2, Ubuntu 12.04, CentOS 6.5 on platforms ranging from 10-100 nodes.
- Architected complete scalable data pipelines, data warehouse for optimized data ingestion.
- Collaborated with data scientists and architects on several projects to create data mart as per requirement.
- Conducted complex data analysis and reporting the results.
- Constructed data staging layers and fast real-time systems to feed BI applications and machine learning algorithms.
TECHNICAL SKILLS
Hadoop Components / Big Data: HDFS, Hue, MapReduce, Pig, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark Airflow, Kafka Snowflake, Apache airflow
Languages: Scala, Python, SQL, Python, Hive QL
IDE Tools: Eclipse, IntelliJ, Pycharm.
Cloud platform: AWS (Lambda, DynamoDB, S3, EC2, EMR, RDS), MS Azure (Azure Databricks, ADF, Azure Data Explorer, Azure HDInsight, ADLS), GCP
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, MongoDB)
Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera
Data Analysis Libraries: Pandas, NumPy, SciPy, Scikit-learn, NLTK, Plotly, Matplotlib
Containerization: Docker, Kubernetes
CI/CD Tools: Jenkins, Bamboo, GitLab
Software Methodologies: Agile, Scrum, Waterfall
Development Tools: Eclipse, PyCharm, IntelliJ, SSMS, Microsoft Office Suite
Programming Languages: Python (Pandas, Scipy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, Seaborn, Keras, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting
Databases: MS-SQL, Oracle and DB2
NoSQL Databases: Cassandra, PostgreSQL, Mongo DB and Azure Cosmos DB
Reporting Tools/ETL Tools: Power BI, Tableau, DataStage, Pentaho, Informatica, Cognos, Talend, Azure Data Factory, Azure Databricks, Arcadia, Data stage, Talend, SSIS, SSRS, SSAS, ER Studio.
Version Control Tools: GitHub and Azure DevOps, SVN, Bitbucket
PROFESSIONAL EXPERIENCE
Confidential, Chicago, IL
Sr Data Engineer
Responsibilities:
- Installing, configuring, and maintaining Data Pipelines
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Involved in Data Mapping Specifications to create and execute detailed system test plans. The data mapping specifies what data will be extracted from the data warehouse.
- Designed and deployed data pipelines using Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
- Maintaining Data pipeline (Kafka) for teh entire traffic.
- Monitor the activities related to design performance and come up with a solution.
- Design, plan, and execute MDM integration with treasury DW.
- Integrated Apache Kafka for data ingestion.
- Successfully Generated consumer group lags from kafka using their API Kafka- Used for building real-time data pipelines between clusters.
- Files extracted from Hadoop and dropped on daily hourly basis into S3.
- Authoring Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Migrated an entire oracle database to BigQuery and build Data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
- Worked with ETL in creating External Batches to execute mappings, Mapplets using Informatica workflow designer to integrate Shire’s data from varied sources like Oracle, DB2, flat files and SQL databases and loaded into landing tables of Informatica MDM Hub.
- Designed, Installed, Configured core Informatica/Siperian MDM Hub components such as Informatica MDM Hub Console, Hub Store, Hub Server, Cleanse Match Server, Cleanse Adapter, Data Modeling.
- Monitored database performance. Worked on AWR, ADDM, ASH reports for performance tuning.
- Migrating from Apache Kafka to Confluent Kafka on ease without loss of data and zero downtimes.
- Installed Golden Gate Director to monitor the GG process and the parameter files. Also installed WEBLOGIC server and client to fulfill the requirement of GG director installation.
- Developed the UNIX shell script to monitor database and to schedule jobs. Monitored server level performance by checking CPU, I/O, and paging. Worked on the system level reports.
- Developed shell script to check listener, database up time, get the automated tablespace and health reports, automated the AWR reports for database monitoring.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).
- Experience in developing end-to-end ETL pipelines using Snowflake, Alteryx, and Apache NiFi for both relational and non-relational databases (SQL and NoSQL).
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith
- Performed analytics using real time integration capabilities of AWS Kinesis (Data Streams) on streamed data.
- Experience in code deployment, Orchestration and Scheduling using tools such as Kubernetes, Docker Swarm,
- Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
- Monitoring and logging such as ELK Stack (Mirth Connect, Elastic Search, Kibana).
- Expertise in Reporting Tools like Grafana, Kibana(ELK), for setting up Graphs and Charts for better Visual Representation of Test results.
- Was involved in setting up of Apache airflow service in GCP.
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Loading data from different sources to a data warehouse to perform some data aggregations for business Intelligence using python.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used SQL Server Management Tool to check the data in the database as compared to the requirement given.
- Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
- Identified and documented Functional/Non-Functional and other related business decisions.
- Developed Rest API in NodeJS using express service. Converted existing APIs and implemented new APIs into the client's middleware stack.
- Worked extensively in the implementation and deployment of REST API and Microservices.
- Created TypeScript reusable components and services to consume REST API's using component based architecture.
- Automated and scheduled recurring reporting processes using UNIX shell scripting and Teradata utilities such as MLOAD, BTEQ and Fast Load.
- Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
- Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.
Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Gits, Sqoop, Oozie, Spark, Pyspark, Azure, Informatica, ELK, MDM, Tableau, GCP, OLTP, OLAP, Hub, HBase, Cassandra, Apache airflow, SQL Server, Python, Shell Scripting, XML, Unix.
Confidential, Dallas, TX
Sr Data Engineer
Responsibilities:
- Worked with the Business team for gathering the requirements and also helping them with their test cases.
- Played a vital role in design and development for building the common architecture for retail data across geographies.
- Worked on design and developing 5 different flows Point of sales, Store traffic, Labor, Customers Survey and Audit data.
- Developed a common framework using spark to ingest data from different data sources (Teradata to S3 and S3 to Snowflake etc.,)
- Developed reusable spark scripts and functions for data processing that can be leveraged in different data pipelines.
- Worked on ingesting the real-time data using Kafka.
- Designed, developed Azure (AAS & SSAS) cubes for data visualization.
- Used Sqoop to ingest the data from Oracle database and store them on S3.
- Worked on ingesting data from JSON, CSV files using spark and EMR and store the output data in Parquet file format on S3.
- Integrated on-premises data (MySQL, Hbase) with cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse using Azure Data Factory
- Built ETL pipelines on Snowflake and the data products are used by stakeholders for querying and serve as backend objects for visualizations.
- Configured Spark streaming to receive real-time data from Apache Flume and store the stream data using Scala to Azure Table and DataLake is used to store and do all types of processing and analytics. Created data frames using Spark Dataframes.
- Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
- Processed huge datasets by leveraging Spark Context, SparkSQL, and Spark Streaming
- Worked on performance tuning of spark jobs by adjusting the memory parameters and the cluster configuration.
- Monitored and configured orchestration tools like Kubernetes.
- Excellent knowledge on AWS services (S3, EMR, Athena, EC2), Snowflake and Big Data technologies.
- Providing knowledge transition to support team.
- Used Airflow for scheduling and orchestration of the data pipelines.
- Improved security by using Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory, and Apache Ranger for authentication.
- Managed resources and scheduling across the cluster using Azure Kubernetes Service
Environment: Hive, Spark SQL, Spark, PySpark, EMR, Tableau, Sqoop, AWS, Python, Snowflake, Teradata, Azure AAS & SSAS, Kafka, Apache.
Confidential, Memphis, TN
Big Data Engineer / Hadoop Developer
Responsibilities:
- Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
- Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
- Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
- Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
- Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
- Optimized workflows by building DAGs in Apache Airflow to schedule the ETL jobs and implemented additional components in Apache Airflow like Pool, Executors, and multi-node functionality.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Pyspark as needed through various sources such as API, AWS S3, Teradata and Snowflake.
- Created a business category mapping system that automatically maps customers' business category information to any source website's category system.
- Category platforms include Google, Facebook, Yelp, Bing etc.
- Developed a data quality control model to monitor business information change overtime.
- The model flags outdated customer information using different APIs for validation and updates it with correct data.
- Responsible for monitoring sentimental prediction model for customer reviews and ensuring high performance ETL process.
- Data cleaning, pre-processing and modelling using Spark and Python.
- Implemented real-time data driven secured REST API's for data consumption using AWS (Lambda, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), and Snowflake
- Develop the automation scripts to transfer the data from on premise clusters to Google Cloud Platform (GCP).
- Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.
- Involved in performance tuning and optimization of long running spark jobs and queries (Hive/SQL)
- Implemented Real-time streaming of AWS CloudWatch Logs to Splunk using Kinesis Firehose.
- Developed using object-oriented methodology a dashboard to monitor all network access points and network performance metrics using Django, Python, MongoDB, JSON.
Environment: Hive, Spark SQL, PySpark, EMR, Tableau, Sqoop, AWS, Presto, Python, Snowflake, Teradata, Azure AAS & SSAS, Kafka.
Confidential
AWS Python Developer
Responsibilities:
- Allotted permissions, policies and roles to users and groups usingAWSIdentity and Access Management (IAM).
- Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash which saved $85K YOY.
- Developed complex Hive Scripts for processing the data and created dynamic partitions and bucketing in hive to improve the query performance.
- Developed server-side software modules and client-side user interface components and deployed entirely in Compute Cloud of Amazon Web Services (AWS).
- Implemented Lambda to configure Dynamo DB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
- Automated nightly build to run quality control using Python with BOTO3 library to make sure pipeline does not fail which reduces the effort by 70%.
- Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.
- Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
- DevelopedPythonAWSserverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable
- Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into the internal cache system
- Chunking the data to convert them from larger data sets to smaller chunks using python scripts which will be useful for faster data processing.
Environment: AWS, S3, EC2, LAMBDA, IAM, Datadog, CLI, Ansible, MySQL, Python, Git, Jenkins, DynamoDB, Cloud Watch, Step Functions
Confidential
SQL Developer
Responsibilities:
- Worked in the development of applications, especially in the UNIX environment and familiar with all its commands.
- Reviewed basic SQL queries and edited inner, left, & right joins in Tableau Desktop by connecting live/dynamic and static datasets.
- Reported and created dashboards for Global Services & Technical Services using SSRS, Oracle, and Excel.
- Deployed Excel VLOOKUP, PivotTable, and Access Query functionalities to research data issues.
- Involved in reviewing business requirements and analyzing data sources from Excel, Oracle SQL Server for design, development, testing, and production rollover of reporting and analysis projects.
- Used Test driven approach for developing the application and implemented the unit tests using Python Unit test framework.
- Migrated successfully the Django database from SQLite to MySQL to PostgreSQL with complete data integrity.
- Performed API testing by utilizing POSTMAN tool for various request methods such as GET, POST, PUT, and DELETE on each URL to check responses and error handling.
- Performed debugging and troubleshooting the web applications using Git as a version-controlling tool to collaborate and coordinate with the team members.
- Developed and executed various MySQL database queries from python using python -MySQL connector and MySQL database package.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using SQL Alchemy and PostgreSQL.
