Data Engineer Resume
San Francisco, CA
SUMMARY
- An extensive experience of 6+ years in all phases of Software Development Life Cycle which includes Bigdata and cloud - based applications spanning across technologies and business domains.
- Proficient in designing and developing complex ETL pipelines on top of Databricks, leveraging Apache Spark, MPP Databases such as Snowflake, Redshift
- Experienced in building/managing effective end-to-end big data pipelines by using Python, Hadoop ecosystem (Hive, Sqoop), Spark, Airflow
- Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation and applying aggregations.
- Excellent understanding/knowledge of Hadoop architecture and various components of Big Data and related technologies such as HDFS, MapReduce, HIVE, HBASE, OOZIE, SQOOP and this includes working experience in Spark Core, Spark SQL, Spark Streaming and Kafka with Python.
- Sound knowledge of architecture of Distributed Systems and parallel processing frameworks.
- Good working experience on Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node & Map Reduce programming paradigm.
- Configured Spark streaming to receive real-time data from Kafka and store the stream data to HDFS using Python.
- Experienced in designing, built, and deploying a multitude of the application utilizing the AWS stack, focussing on high-availability, fault tolerance, and auto-scaling.
- Experienced in developing spark application using Spark RDD APIs, Data frames, Spark-SQL and Spark-Streaming API's.
- Strong experience in using Spark Streaming, Spark SQL, batch processing and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
- A profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in spark applications.
- Experience in writing REST APIs in Python for large-scale applications.
- Experience in developing Advanced SQL queries and stored procedures. Ensure appropriate data filters are applied based on business requirements
- Responsible for SQL query performance tuning.
- Strong Experience in working with Databases like Oracle 10g, SQL Server 2016, and MySQL and proficiency in writing complex SQL queries.
- Worked on both Hortonworks sandbox, Cloudera (Linux, RedHatOS) Hadoop distributions.
- Knowledge of Delta Lake
- Experienced in writing Shell Scripts to automate processes as per business requirements
- Worked in an agile development environment to analyze, develop, test and deploy potential use cases for the business.
- Strong knowledge in NLP, Supervised and Unsupervised Machine Learning algorithms using both scikit-learn and Spark ML API
- Good communication and presentation skills, willing to learn, adapt to new technologies and third-party products.
TECHNICAL SKILLS
Big Data Tools/Technologies: Spark, Spark SQL, Spark Streaming, Hive, Sqoop, Hadoop, HDFS, YARN, MapReduce, Pig, Impala, Flume, Kafka, Zookeeper, Airflow, Oozie, Delta Lake
Programming Languages: Python, Boto3,SQL, HiveQL, T-SQL, NoSQL, Shell Scripting, Java
NO SQL Databases: HBase, MongoDB, Cassandra, DynamoDB
Tools: PyCharm, Visual Studio Code, Tableau, Databricks, MySQL Workbench, Maven, Jupyter/Notebook, Tableau, GIT, Eclipse, Informatica
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c
Cloud Platforms/Services: Snowflake, AWS, AWS CLI, EC2, S3, EMR, IAM, Redshift, DynamoDB, AWS Lambda, Glue, Athena, VPC, Databricks
Hadoop Distributions: Cloudera, Hortonworks
PROFESSIONAL EXPERIENCE
Confidential, San Francisco, CA
Data Engineer
Responsibilities:
- Design/develop/unit test data pipelines that load data from Snowflake and perform transformations based on business requirements using Databricks, SparkSQL, Pyspark, S3, and Delta
- Creating Databricks notebooks using SparkSQL, Pyspark and automating/scheduling data pipeline using Databricks jobs.
- Developed spark jobs to sessionize clickstream data residing in Snowflake.
- Conducted performance tuning and achieved 70% performance improvement for key data pipelines.
- Writing Advanced SQL queries against Snowflake and saving as Delta tables
- Worked on multiple data formats like Parquet, JSON, CSV, Delta, Excel, Google Spread Sheets.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Developed Spark data pipeline that reads data from Google Drive, perform transformations and ingest data to Snowflake.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to load the processed streams to Snowflake.
- Worked on Rest API's to query the data from endpoints and loading data to internal tables.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, effective & efficient Joins, and Transformations during ingestion process.
- Developed the data pipeline that extract data from different sources and load to Redshift using Spark, EMR, S3, and Python.
- Developed ETL jobs using AWS Glue, and PySpark.
- Design and develop solutions using AWS services like AWS Lambda, Glue, Redshift.
- Worked on python packages like Boto3 on AWS
- Worked collaboratively and effectively with a range of people in a SCRUM Agile environment
Environment: Databricks, Spark, Python, AWS, S3, Snowflake, Pyspark, SparkSQL, Delta Lake, Kafka, Rest API, Redshift, EMR, AWS Glue, AWS Lambda, AWS Athena, Boto3
Confidential, Addison, TX
AWS/Big Data Engineer
Responsibilities:
- Worked with DataScientists and Business Analysts to gather and understand specific requirements and extract business relevant stories.
- Developed/maintained the data pipeline to ingest streaming and transactional data across different data sources using Spark, Kafka, Redshift, S3, and Python.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
- Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, effective & efficient Joins, and Transformations during ingestion process.
- Developed spark jobs with python to process JSON Data. Used Spark SQL to further perform joins and store the datain S3.
- Developed spark jobs, to load the datafrom various sources into MongoDB after applying transformations using Spark and Python.
- Developed effective end-to-end data pipeline using Python, Spark API and Hadoop ecosystem.
- Automated/Maintained/Scheduled Spark data pipelines using Apache Airflow
- Involved in writing complex SQL queries, joins, and user defined functions (UDF) to implement business logic.
- Developed Advanced SQL queries and stored procedures to generate data reports. Ensure appropriate data filters are applied based on business requirements.
- Performed SQL query performance tuning.
- Worked on importing and exporting data from HDFS to RDBMS using Sqoop.
- Worked on Hadoop Ecosystem with AWS using Elastic MapReduce.
- Store big data with S3 and DynamoDB in a scalable and secure manner.
- Created Sqoop scripts to import/export user profile data from RDBMS to S3 Data Lake.
- Processed big data with AWS Lambda and Glue ETL.
- Knowledge in job work-flow scheduling and monitoring tools like Airflow and Zookeeper.
- Worked on Hive, Spark, shell scripting to perform various ETL operations.
- Scheduled clusters with Cloud watch and monitored operational alerts for various workflows.
- Involved in project/tasks estimation for smooth execution of sprint in Agile methodology
- Responsible for project documentation and maintenance after delivery.
Environment: Hive, Sqoop, Spark, Spark SQL, Kafka, AWS EMR, AWS S3, AWS DynamoDB, Python, Pyspark, Zookeeper, AWS Lambda, and Glue, Data Lake, Redshift
Confidential, Atlanta, GA
Data Engineer
Responsibilities:
- Implemented a data pipeline to process semi-structured data by integrating million records from different data sources using Python and Spark API and stored processed data in HDFS
- Create, schedule and monitor Spark jobs in Databricks and get alerts.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Python using Apache Spark and AWS EMR.
- Developed ETL jobs using AWS Glue, AWS Glue Crawler, and Spark and EMR.
- Utilized Crawler to populate AWS Glue Catalog which contains table definitions.
- Used Spark, Python, EMR to process streaming data from different sources in real time
- Datamigration from various relational data platforms to Hadoop and building datawarehouse on Hadoop ecosystems such as Hive, Oozie and Sqoop.
- Experienced in loading and transforming of large sets of structured datausing Spark.
- Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- Building DataIngestion layer using Spark and Sqoop in distributed cluster.
- Developed an ETL pipeline with the help of Sqoop and Hive to be able to frequently bring in datafrom the source and make it available for consumptions.
- Configured periodic incremental imports of datafrom Oracle into HDFS using Sqoop.
- Extensive experience in working with structured Datausing Hive QL, join operations, writing custom UDF's and experienced in optimizing Hive Queries.
- Involved in gathering requirements from client and estimating a timeline for developing complex queries using Hive for logistic applications.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
- Created Hive tables, loaded dataand implemented Hive queries to analyse user request patterns and implemented various performance optimizations like partitions and bucketing in Hive.
- Worked on developing queries in Hive QL for reporting purposes.
- Involved in designing of HDFS storage to have efficient number of block replicas of data
- Setup Oozie workflow for HIVE/Sqoop actions.
Environment: Apache Hadoop, Spark, HIVE Warehouse, HDFS, Zookeeper, UNIX, MYSQL, Oracle, Oozie, Sqoop, AWS, Databricks
Confidential
Software Engineer
Responsibilities:
- Migrated the existing data from SQL Server to Hadoop and performed ETL operations on it.
- Involved in developing ETL data pipelines using Spark API for batch processing.
- Implemented Sqoop jobs to import data from RDBMS to HDFS in different formats like AVRO, ORC, Sequence and Text formats.
- Created Sqoop scripts to export data from HDFS to RDBMS for reporting teams and data visualizations.
- Implemented Sqoop incremental imports on tables without primary keys and dates from SQL Server and appends directly into Hive Warehouse
- Implemented Hive Partitioning, Dynamic Partitions, Buckets.
- Implemented Sqoop jobs to import data from traditional warehouses to Hive tables.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDDs and Python.
- Explored the usage of Spark for improving the performance and optimization of the existing algorithms inHadoopusing Spark Context, Spark SQL, and Spark Yarn.
- Used Oozie workflow engine to manage/automate several types ofHadoopjobs such as Hive and Sqoop.
- Develop ETL workflows using Informatica for periodic data loads from different source systems to Client databases.
- Involved in writing complex SQL queries, joins, and user defined functions (UDF) to implement business logic.
- Involved in creating and modifying several UNIX shell scripts according to the changing needs of the project and client requirements
- Responsible for performance tuning and query optimization.
Environment: Spark, Hadoop, HDFS, Hive, HBase, Sqoop, SQL, Shell Scripting, NoSQL, Informatica
