Big Data Engineer Resume
Houston, TX
SUMMARY
- 8 years of demonstrated experience in IT industry with expert - level skills in Big Data Hadoop Ecosystem, Apache Spark, PySpark, Scala, Python, Kafka, Data Warehousing, Data Lakes.
- Solid experience in building Data Lake and Data Warehouse using AWS and Google Cloud.
- Expert in building Batch and Streaming Data Pipelines using cutting edge big data technologies and cloud services.
- Expert level knowledge on Hadoop Distributed File System (HDFS) architecture and YARN.
- Used AWS Services like AWS S3, Lambda, RDS, Redshift, Glue, EMR, Ec2, Athena, IAM, StepFunction, QuickSight, ect.
- Used GCP Services like GCP Cloud storage, BigQuery, Cloud Composer, DataFlow, DataProc, Cloud Function, ect
- Experienced on Hive partitioning, bucketing, optimization code through set parameters and perform different types of joins on Hive tables and implement Hive SerDe like Avro, JSON.
- Experienced in importing and exporting data from different databases like SQL Server, Oracle, Teradata, etc.,
- Experienced in automating Apache Airflow and Step Function.
- Experience in Zookeeper configuration as to provide cluster coordination services.
- Extensive experience in creating RDD’s and Data Sets in Spark from local file system and HDFS.
- Hands on experience in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala.
- Created Data Frames and performed analysis using Spark SQL and used RDD and DF APIs to access variety of data sources using Scala, PySpark, Pandas and Python.
- Excellent knowledge on Spark core architecture.
- Acute knowledge on Spark Streaming and Spark Machine Learning Libraries.
- Created ERM transient and long running clusters and in AWS for data processing (ETL) and log analysis.
- Deployed various Hadoop applications in EMR - Hadoop, Hive, Spark, HBase, Hue, Glue, Oozie and Presto etc. based on the needs.
- Experience in integrating Hive with AWS S3 to read and write data from and to S3 and created partitions in Hive
- Extensively worded on ETL/Data pipelines to transform data and load from AWS S3 to Snowflake and Redshift.
- Extensively utilized EMRFS (Elastic Map Reduce File System) for reading and writing SerDe data between HDFS and EMRFS.
- Experience in using Presto on EMR to query different types of data sources RDBMS and NoSQL databases.
- Experience in creating ad-hoc reporting, development of data visualizations using enterprise reporting tools like Tableau, Power BI, Business Objects, etc.,
TECHNICAL SKILLS
Big Data Tools: Hadoop, Hive, Apache Spark, PySpark, HBase, Kafka, YARN, Sqoop, Impala, Oozie, Pig, Map Reduce, Zookeeper and Flume
Hadoop Distributions: EMR, Cloudera, Hortonworks.
Cloud Services: AWS - EC2, S3, EMR, RDS, Glue, Presto, Lambda, RedShift and Azure - Data Lakes, BLOB GCP - Cloud Storage, BigQuery, Compute Engine, Cloud Composer, Data Proc, Data Flow, Pub-sub
BI and Data Visualizations: ETL -Informatica, SSIS, Talend, Tableau and Power BI
Relational Databases: Oracle, SQL Server, Teradata, MySQL, PostgreSQL and Netezza
No SQL Databases: Cassandra, MongoDB and HBase
Programming Languages: Scala, Python and R
Scripting: Python and Shell scripting
Build Tools: Apache Maven and SBT, Jenkins, Bitbucket
Version Control: GIT and SVN
Operating Systems: Unix, Linux, Mac OS, CentOS, Ubuntu and Windows
Tools: PUTTY, Putty-Gen, Eclipse, IntelliJ and Toad
PROFESSIONAL EXPERIENCE
Confidential - Houston, TX
Big Data Engineer
Responsibilities:
- Collaborate with technical, application and security leads to deliver a reliable and secure Big Data infrastructure tools using cutting edge technologies like Spark, Container services and AWS services.
- Developed data processing pipelines in Spark and other big data technologies.
- Designed and deployed high performance systems with reliable monitoring, logging practices and dashboards.
- Designed and developed batch and streaming data pipelines to load data into Data Lake and Data Warehouse. Using services like AWS S3, RedShift, Glue, EMR, Lambda, Step Function, IAM, QuickSight, RDS, CloudWatch events and Trails, etc.,
- Developed APIs to access enterprise metadata platform for registering datasets and migrated all the data pipeline applications from legacy to new platform.
- Developed and modified all our data application using python.
- Worked with information Security teams to create data policies and develop interfaces and retention models and deployed the solution to production.
- Designed, Architected, and Developed solutions leveraging big data technology (Open Source, AWS) to ingest, process and analyze large, disparate data sets to exceed business requirements
- Used AWS services like AWS S3, AWS RDS, AWS Redshift, AWS Athena, AWS Lambda, AWS Ec2, AWS EMR, AWS IAM, AWS Step Functions, AWS CloudWatch, AWS Glue, AWS QuickSight, AWS EKS, etc.,
- Used Orchestration service like Apache Airflow, created several dags as part of batch and streaming pipelines.
- Developed Python and PySpark code for ETL and used them in AWS services like AWS Lambda, Glue, Ec2, EMR, etc.
- Used AWS Lambda, AWS Kinesis, AWS S3, AWS Redshift, for streaming pipelines.
- Used AWS Lambda, AWS Glue, AWS Athena, AWS S3, AWS RedShift and AWS EMR for Batch and streaming pipelines.
- Extensively used PySpark, Python for building and modifying existing pipelines.
- Worked on creating panda’s data frames and spark data frames where needed in order optimize the performance.
- Created/modified data pipeline applications using spark, python, PySpark, AWS EMR, S3, Lambda, pandas, SparkSQL, Glue, Pestro and Snowflake
- Extensively worked on EMR and Spark jobs and created IAM roles, S3 buckets policies based on the enterprise request.
- Created complex SQL queries using SparkSQL for all data transformation when loading data into Snowflake
- Experience in working Kafka steaming applications for real time data needs.
- Deploy application to Docker Container using Jenkins and trigger jobs to load data from S3 to Snowflake tables or Snowflake to S3 based on the user requirements
- Collaborated with Data Analysts to understand the data and their end requirements and transform the data using PySpark, SparkSQL and created data frames
- Extensively worked on file formats like Parquet, AVRO, CSV and JSON files
- Experience in using POSTMAN to test the end points for various applications and validate the schema, data types, formats of file etc.
- Monitor applications using PagerDuty, and slunk logs based the data loads daily, weekly and monthly
- Experience working on Docker Images, update docker images as necessary
- Worked on Databricks, created managed clusters and performed data transformations.
Environment: PySpark, Python, AWS- EMR, S3, Lambda Functions, IAM, Security Groups, Glue, Presto, Git, Jenkins, Docker, Snowflake, SQL, DataFrames, pandas, Containers, PagerDuty, Splunk and Databricks
Confidential - Manchester, CT
Data Engineer
Responsibilities:
- Designed, developed, and deployed Data Lake and Data Warehouse in AWS Cloud.
- Worked on creating data pipelines using python and Apache airflow integrating various data
- Processed data ingestion using Spark APIs for loading data into data lakes and AWS S3.
- Find data availability and spin off process data to S3 in Parquet or Avro file formats and store on S3 and data lake.
- Configured and development of Hadoop environment with AWS - EC2, EMR, S3, Glue, Pestro, Redshift, Athena and Kinesis.
- Created batch scripts to fetch data from AWS S3 storage and performed Spark operations- Transformations and Actions using Scala and Spark/PySpark
- Performed ETL tasks using AWS Glue to move data from various sources to S3 and configures crawlers to handle the schemas.
- Worked on Apache Spark to read file formats like Parquet, converted HiveQL queries into Spark transformations using Spark RDD and Data Frames in Scala
- Extensively used Spark core - Spark Context, Spark SQL and Spark Streaming for real time data
- Performed schema level transformations using Apache Spark and Scala in Databricks environment for downstream users.
- Collaborated with QA and Data Scientists to support system, integration and user acceptance testing
- Worked on Apache Kafka and Storm for providing real-time analytics
- Extensively worked with Streaming data with Kafka by importing XML data for real time streaming of data into HBase and HDFS file system
- Utilized snappy compression techniques with Avro, Parquet files to leverage storage in HDFS
- Extensively worked on Hive to create value added procedures and wrote Hive UDF (User Defined Functions) to make the function reusable for different models.
- Implemented Partitioning and Bucketing in Hive tables for query optimization
- Worked on Impala for creating views for business use-case requirements on top of Hive Tables.
- Extensively worked on Window functions and Aggregate functions in Hive
- Developed Shell and Python scripts to automate data loads and data validations
Environment: Apache Spark, PySpark, Hive, Databricks, Azure Data lakes, Azure Blob, Snowflake, AWS S3, EMR, Glue, Pestro, Hadoop, GIT, Eclipse, Scala, Python and Apache Airflow.
Confidential - Houston, TX
Data Engineer
Responsibilities:
- Used Agile software development methodology in defining the problem, gathering requirements, development iterations, business modeling and communicating with the technical team for development of the system.
- Built Real-time and batch pipelines using GCP services.
- Modeled and developed a new data warehouse from scratch and then migrated that data warehouse to BigQuery. Automated data pipelines and quality control checks.
- Architected and implemented a solution to migrate the data platform from Hadoop to Google cloud Platform.
- Worked on developing continuous integration (CI), continuous delivery (CD), and continuous training (CT) for the ML system using Cloud Build.
- Worked heavily with the business to determine reporting requirements and to explore the data with each other.
- Used Cloud functions, DataFlow, Cloud pub-sub and BigQuery to build streamline dashboard to monitor services.
- Created jobs using Cloud composer (Airflow DAG) to migrate data from Data Lake (cloud storage) to transform it using DataProc and ingest in BigQuery for further analysis.
- Member of the Enterprise Architecture Committee and presented multiple times to this group and other business leaders.
- Built batch and streaming jobs using GCP Services like BigQuery, Pub/Sub, DataProc, Dataflow, Cloud Run, Compute engine and Cloud Composer.
- Created Data pipelines to read data from single and multiple sources using DataFlow and saved data into Cloud Storage and BigQuery
- Helped to implement PowerBI in the organization. Developed PowerBI reports for every department in the company.
- Developed PySpark scripts for ETL using DataProc.
- Good experience with GCP Stack driver Trace, Profiler, Logging and Error Reporting, Monitoring
- Setup and Implement Continuous Integration and Continuous Delivery (CI & CD) Process stack using GIT, Jenkins
- Experience in developing APIs in cloud.
- Good Knowledge on IAM rules and Cloud Security.
- Designed PowerBI data visualizations using cross tabs, maps, scatter plots, pie charts etc.
Environment: Apache Spark, AWS EMR, S3, RedShift, Oozie, Hive, Sqoop, Scala, Python, Java, MAVEN, SBT, IntelliJ, MySQL, Oracle, HDFS file system, Spark RDD, PySpark, Cloudera, Horton Works.
Confidential - NYC, NY
Data Engineer
Responsibilities:
- Migrated on-perm HDFS ecosystem to AWS Cloud.
- Migrated the data lake and Data Warehouse from HDFS and Hive to AWS S3 and Redshift.
- Analyzed large and critical data sets using Cloudera, HDFS, HBase, Hive, UDF, Sqoop, YARN, PySpark and Apache Spark
- Worked with unstructured and semi structured data of 2PB in size and developed multiple Kafka producers and Consumers to customize partitions to get optimized results.
- Worked AWS EMR and S3 and processed data directly in S3 and imported into HDFS on EMR cluster and utilized Spark for analysis.
- Developed Hive scripts, Unix shell scripts and PySpark and Scala Programming for all ETL loading processes and converting the files into Parquet format in HDFS Filesystem.
- Developed Spark Operations which are used to Stream, Transformations, Aggregations and generating daily snapshot of customer data.
- Created Apache Spark RDD’s (Resilient Distributed Data sets) using Scala programming for analyzing large data sets using filters, maps, flat-maps, count, distinct, etc.
- Extensively used Sqoop to import/export data between RDBMS and hive tables, incremental imports.
- Leveraged Snappy compression format technique with Avro, Parquet file formats for storage in HDFS.
- Worked on Hive tables to created value added procedures and wrote UDF (User Defined Functions) to make the function reusable.
- Developed Partitioning and Bucketing techniques in Hive tables for query optimization for Enterprise Business Intelligence and Data Analytics teams.
- Worked on Impala query engine for data store on distributed cluster for various business user cases on top of Hive table.
- Streamlined Hadoop Jobs and workflow operations using Oozie and Scheduled through Autosys on monthly and quarterly basis.
- Extensively used built tools MAVEN, SBT and IntelliJ IDEA for building scripts.
Environment: Hadoop Ecosystem, Oracle, Informatica, ETL, Pig, HBase, SQL Server, Eclipse, MySQL, Sqoop, Pig, Hive, Shell scripting, MapReduce jobs
Confidential
Database Developer
Responsibilities:
- Responsible for extracting data from various flat files using SQL scripts and ETL, validate the data and load data into Datawarehouse.
- Worked on UNIX Shell scripts to extract data from various sources such as DB2 and Oracle
- Performed data analysis on both OLATP and OLAP data using complex SQL queries and delivered data solutions
- Extensively wrote SQL queries, created views, Stored Procedures in Oracle database
- Extracted data from multiple databases using database links by writing complex pl/sql queries to eliminate duplicate data
- Developed complex and ad-hoc reports using multiple data providers for patient services and commercial teams
- Created ETL workflows to extract data from CRM applications and loaded into Oracle Datawarehouse.
Environment: Relational Databases- MySQL, SQL server, Oracle, DB2, JIRA, SQL, PL/SQL, Informatica (ETL) and BI tools.