Data Engineer Resume
TexaS
SUMMARY
- Data Engineer with 8 years of experience in Big Data, analytics field in storage, querying, processing, and analysis for developing data pipelines with hands - on cloud infrastructure.
- Expertise in designing scalable big data solutions, data warehouse models on large-scale distributed data, performing wide range of analytics.
- Worked on SDLC life cycle agile scrum from feasibility analysis and conceptual design through implementation, including documentation, user training and operation support. Always eager to contribute to team success through hard work, attention to detail and good organizational skills.
- Experienced in querying Snowflake, Oracle, Redshift, MS SQL server databases for OLTP and OLAP
- Developed Spark and Hive Jobs to summarize and transform data
- Understanding of RDBMS database concepts and performance tuning and query optimization
- Data ingestion from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce
- Worked on data processing - collecting, aggregating, moving from various sources using Apache Flume and Kafka
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, and SQL Server using Python.
- Responsible for design development of Spark SQL scripts based on functional specifications
- Worked on cluster coordination services through Zookeeper
- Deployed data pipelines with CI/CD process
- Worked on ETL pipelines on S3 paraquet files on data lakes using AWS Glue
- Used python and shell scripting to build pipelines
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala
- Responsible for resolving issues and troubleshoot performance related to Hadoop clusters and fine-tuning failures in Spark applications
- Hands-on experience on developing SQL scripts
- Involved in file movements between HDFS and AWS S3 and worked with S3 bucket in AWS
- Helped implemented and automate detective controls in cloud environment to alert on critical security issues
- Monitored AWS migrated applications using CloudTrail, CloudWatch and AWS config
- Written python script which automates to launch the EMR cluster and configure the Hadoop applications
- Experienced in Waterfall and Agile development (SCRUM) methodologies
- Experience with building supporting data transformation, data structures, metadata, dependency, and workload management.
- Implemented and maintained security controls that reduce risk and allow risk-based reporting on cloud security posture.
- Worked closely with AWS cloud security matter experts and advisor for IT product management staff in relation to dynamic and static code scans, vulnerability scans, web applications scans, and other cloud security reviews
TECHNICAL SKILLS
Data Tools / Technologies: Spark, Spark SQL, Spark Streaming, Hive, Sqoop, Hadoop, HDFS, MapReduce, Pig, Flume, Kafka, Zookeeper, Airflow, Data Lake
Programming Languages: Python, SQL, HiveQL, T-SQL, NoSQL, Shell Scripting, Java
NO SQL Databases: HBase, MongoDB, DynamoDB
Tools: PyCharm, Visual Studio Code, Tableau, Databricks, MySQL Workbench, Maven, Jupyter/Notebook, Tableau, GIT, Eclipse, Informatica, Terraform
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c
Cloud Platforms/Services: Snowflake, AWS, AWS CLI, EC2, S3, EMR, IAM, Redshift, DynamoDB, AWS Lambda, Glue, Athena, VPC, Databricks
Hadoop Distributions: Cloudera, Hortonworks
Platforms: UNIX, Windows, Linux
PROFESSIONAL EXPERIENCE
Confidential, Texas
Data Engineer
Responsibilities:
- Involved in gathering requirements from different teams to design ETL migration process from Existing RDBMS to Hadoop Cluster using Sqoop.
- Created Bash Scripts to load data from Linux/UNIX file system into HDFS
- Developed HIVE queries for Data Transformation and Data analysis.
- Loaded data from existing DWH sources like (Teradata and Oracle) into HDFS using Sqoop and load into Hive tables which are partitioned.
- Converted some existing Sqoop, Hive jobs to Spark SQL applications to read data from RDBMS using JDBC and write it to hive table
- Used Hive Optimizations like partitioning, Bucketing, Map Join, table statistics for efficient data access reducing execution time by 30%
- Collaborated with Predictive Analytics Engineering team and developed End to End Data solutions for building Data Lake and migrated Datawarehouse from OnPrem to Hadoop cluster
- Developed Pyspark scripts to Reduce costs of organization by 30% by migrating customers data in DHW (Teradata) to Hadoop.
- Built Data Pipelines using Apache NIFI to analyze structured data by pulling it from Splunk and created Hive tables
- Writing Advanced SQL queries against Snowflake and saving as Delta tables.
- Developed spark jobs to session clickstream data residing in Snowflake
- Worked on implementation of log producer in Scala and send them to Kafka and Zookeeper based log collection platform
- Experience in handling JSON datasets and written custom Python functions that is re-used by various applications within enterprise
- Responsible for assessing and improving the Quality of Customer Data.
- Worked on end-to-end data quality process setup on AWS for entire health insurance division
- Designed ETL pipelines in Amazon EMR using Pyspark to process the raw data from Amazon S3 and copy the data into Amazon Redshift and created views to enable fast access to the data & improved view performance by using sort keys
- Worked on developing ETL pipelines on S3 parquet files on data lake using AWS Glue
- Created and executed CRON jobs to automate the execution process of mass data quality
- Developed a Java based ETL tool which extracts data from sources like IBM Cognos(xml) & MySQL and dumps data in the target tables in MySQL database
- Monitored datasets on Ec2 instances with EBS volumes attached.
- Involved in the code migration of quality monitoring tool from AWS EC2 to WS AWS lambda WS to reduce the costs incurred due to reserved EC2 instances
Environment: Apache Spark, Apache Hive, Python, AWS, S3, Pyspark, NIFI, Zookeeper, Kafka, Oracle Primavera, Hadoop, Data Lake, EMR
Confidential, Ohio
Data Engineer
Responsibilities:
- Involved in the design and extraction of data from different ETL tools and then apply transformation logical
- Used fast load and multiload utilities for loading data to staging and target tables
- Developed complex mapping in DataStage
- Worked on data profiling and created logical datasets on snowflake to administer quality monitoring process
- Develop code in Hadoop technologies and perform unit testing
- Designed and developed Spark Scala ingestion pipelines both in real-time and batch
- Bulk loading from external stage and internal stage to Snowflake and perform transformations based on business requirements using Databricks, SparkSQL, Pyspark, S3 and Delta.
- Developed ETL pipelines using combination of Python and Snowflakes, SnowSQL and writing advanced SQL quires against Snowflakes
- Developed Spark streaming programs to process data from Kafka and process the data for both stateless and state full transformations
- Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time
- Created Spark programs to parse, analyze and implement the solutions based on customer needs
- Migrated all Facts/OLAPS written in Hive into Pyspark
- Ingested data and performed RDD transformations using Spark to perform streaming analytics in Data bricks
- Worked on implementation of log producer in Scala for application logs, transform incremental log and send them to Kafka and Zookeeper based log collection platforms
- Transformed Pyspark using AWS Glue dynamic frames with Pyspark
- Implemented AWS step functions to automate Amazon Sage Maker related tasks such as publishing data to S3
- Created graphs/charts with detailed data analysis results
- Participate in Agile Scrum Daily Stand-up meeting to discuss work progress and blockers on the way
- Actively be a part of Sprint meeting which is held every 15 days to Demo work to the clients and get their feedbacks.
Environment: Databricks, Spark, Python, AWS, S3, Snowflake, Pyspark, SparkSQL, Kafka, Zookeeper
Confidential
Data Engineer
Responsibilities:
- Designed, built, and maintained the data pipelines that bring data into data lakes.
- Involved in designing ETL pipelines to automate ingestion data and facilitate data analysis
- Ingest streaming data from multiple sources into HDFS for storage and analysis using Apache Flume
- Worked with global teams to fed data into various environments so that downstream applications can be tested
- Used Hive to run map reduce jobs for data aggregation, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats
- Efficiently used Spark transformations to create structured data from the pool of unstructured data received to build simple/quick and complex ETL applications
- Created HBase tables to store variable data formats coming from different portfolios
- Designed Internal and External table schemas in Hive with appropriate static and dynamic partitions for efficiency.
- Involved in preparing design, unit, and integration tests documents.
- Developed Hadoop solutions on AWS from developer to admin roles utilizing the Horton Hadoop stack
- Managed AWS role-based security Hadoop admin load balancing on AWS EC2 clusters.
Environment: Apache Spark, Apache Hive, Python, AWS, S3, Snowflake, Pyspark, Spark SQL, Kafka, Oracle Primavera, Hadoop, Data Lake
Confidential
Data Engineer
Responsibilities:
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Involved in Functional testing, Integration testing, Regression testing, Smoke testing, and performance testing. Tested Hadoop Map Reduce developed in python, pig, Hive
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
- Developed Spark code and Spark-SQL/streaming for faster testing and processing of data.
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
Environment: Hadoop, MapReduce, Hive, Apache Spark, Sqoop Snowflake, Nebula, Teradata, SQL Server, Python, Pig, GitHub, Teradata, Tableau
Confidential
Data Analyst
Responsibilities:
- Create database objects such as tables, views, stored procedures, triggers etc
- Designed and implemented data integration modules for ETL data analysis techniques to validate business rules and identify low quality missing data in the existing enterprise data warehouse
- Performed analysis and presented results using SQL, Python, SSIS, MS Access, Excel and visual basic scripts
- Analyzed and validated findings, creating reports, presentations, and visualizations
- Designed data table in coordination with client services and internal departments
- Coordinated with QA testers for end-to-end unit testing and postproduction testing
- Performed Tableau Server admin duties like installation, configuration, security, migration, upgrades, maintenance, and monitoring
- Creating dashboards using Tableau to leverage interactive, reliable reporting and visually stunning, accurate dashboards, Tested, Cleaned, and Standardized Data to meet the business standards
- Worked with relational DBMS environments and ER diagramming
- Resolved and troubleshooted complex issue
Environment: SQL Server, Python, Tableau, MS Excel, MS Power Point.