We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

IL

SUMMARY

  • Result - driven IT Professional with overall IT experience more than 7 years in different domain clients and technology spectrums. Experienced in working in highly scalable and large-scale applications building with different technologies using Cloud, bigdata, DevOps and Spring boot. Also, expert in working in different working environments like Agile and Waterfall. Experience in working multi cloud, migration, and scalable application projects.
  • Built spark data pipelines with various optimization techniques using python and Scala.
  • Experience in working various Hadoop distributions like Cloudera, Hortonworks and MapR.
  • Experienced in data transfer between HDFS and RDBMS using tools like Sqoop, Talend and Spark.
  • Extensive experience deploying cloud-based applications using Amazon Web Services such as Amazon EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, and DynamoDB.
  • Expert in ingesting data for incremental loads from various RBMS tools using Apache Sqoop.
  • Developed scalable applications for real-time ingestions into various databases using Apache Kafka.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
  • Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
  • Developed Pig Latin scripts and MapReduce jobs for large data transformations and Loads.
  • Experience in design, develop and maintain Datalake projects using different bigdata tool stack.
  • Experience in building Scala applications for loading data into NoSQL databases (MongoDB).
  • Implemented various optimizing techniques in Hive and Spark scripts for data transformations.
  • Expert in writing various scripts using Shell Script and Python Scripting.
  • Migrated data from different data sources like Oracle, MySQL, Teradata to Hive, HBase and HDFS.
  • Experienced in building Jupyter notebooks using PySpark for extensive data analysis.
  • Experience in working with various cloud distributions like AWS, Azure and GCP.
  • Experience in ingesting and exporting data from Apache Kafka using Apache Spark Streaming.
  • Implemented streaming applications to consume data from Event Hub and Pub/Sub.
  • Experience in using different optimized file formats like Avro, Parquet, Sequence.
  • Excellent understanding and knowledge in handling database issues and connections with SQL and NOSQL databases like Mongo DB, Cassandra, HBase and SQL server.
  • Experience is using Azure cloud tools like Azure data factory, Azure Data Lake, Azure Synapsis.
  • Created Partitions and Bucketing Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Developed scalable applications using AWS tools like Redshift, DynamoDB.
  • Worked on building pipelines using snowflake for extensive data aggregations.
  • Experience on GCP tools like BigQuery, Pub/Sub, Cloud SQL and Cloud functions.
  • Built custom dashboards using Power BI for reporting purpose.
  • Experience in building continuous integration and deployments using Jenkins, Travis CI.
  • Expert in building containerized apps using tools like Docker, Kubernetes and Terraform.
  • Experience in building metrics dashboards and alerts using Grafana and Kibana.
  • Worked on containerization technologies like Docker and Kubernetes for scaling applications.
  • Experience in various integration tools like Talend, NIFi for ingesting batch and streaming data.
  • Experience in migration of data warehouse applications into snowflake.
  • Experience in working Agile environments and waterfall models.

TECHNICAL SKILLS

Bigdata Ecosystem: HDFS, Map Reduce, YARN, Hive, HBase, Pig, Impala, Sqoop, Oozie, Tez, Zookeeper, Sqoop, Spark.

Cloud Environment: AWS, Azure and GCP

NoSQL: HBase, Cassandra, Mongo DB

Databases: Oracle 11g/10g, Teradata, DB2, MS-SQL Server, MySQL, MS-Access, PostgreSQL, Teradata.

Programming Languages: Scala, Python, SQL, PL/SQL, Linux shell scripts, Pyspark

BI Tools: Tableau, Power BI, Apache Superset

Alerting & Logging: Grafana, Kibana, Spark, Splunk, Cloud watch

Automation: Airflow, NiFi, Oozie

Software Tools: Kubernetes, Docker, Jenkins, SAS

PROFESSIONAL EXPERIENCE

Confidential, IL

Data Engineer

Responsibilities:

  • Used AWS Athena extensively to ingest structured data from S3 into other systems such as RedShift or to produce reports.
  • The Spark-Streaming APIs were used to conduct on-the-fly transformations and actions for creating the common learner data model, which receives data from Kinesis in near real time.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis.
  • Hive As the primary query engine of EMR built external table schemas for the data being processed.
  • Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Created AWS RDS (Relational database services) to serve as a Hive meta store, and it was possible to integrate the meta data from 20 EMR clusters into a single RDS, avoiding data loss even if the EMR was terminated.
  • Developed ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes Snow SQL Writing SQL queries against Snowflake.
  • Worked on information pre-handling and cleaning the information to perform highlight designing and performed information ascription procedures for the missing qualities in the dataset utilizing Python.
  • Involved in the development of a shell script that collects and stores logs created by users in AWS S3 (Simple storage service) buckets. This contains a record of all user actions and is a good indicator of security to detect cluster termination and safeguard data integrity.
  • Created Spark tasks by building RDDs in Python and data frames in Spark SQL to analyze data and store it in S3 buckets.
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines.
  • Partitioning and Bucketing ideas were implemented in the Apache Hive database, which increases query retrieval performance.
  • Using AWS Glue designed and deployed ETL pipelines on S3 parquet files in a data lake.
  • Created a cloud formation template in JSON format to leverage content delivery with cross-region replication through Amazon Virtual Private Cloud
  • AWS Code Commit Repository was used to save programming logics and scripts and then replicate them to new clusters.
  • Used the Multi-node Redshift technology to implement Columnar Data Storage, Advanced Compression, and Massive Parallel Processing
  • Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Worked on the code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as the creation of logical datasets to administrate quality monitoring on snowflake warehouses.
  • Used Apache Kafka as open source message broker for reliable and asynchronous exchange and worked with Kafka cluster using Zookeeper.
  • Developed XSLT files for transforming the XML response from the web service into HTML as per the business requirements and used different XML technologies such as XPATH.
  • Developed the UI screens using HTML5, CSS3, Ajax, jQuery, Angular 8.0/4.0 and was involved in resolving cross browser JavaScript issues.

Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Dynamo DB, Cloud Watch, Hive, Java, Python, HBase, Apache Spark, Spark SQL, Shell Scripting, Tableau, Cloudera.

Confidential

Data Engineer

Responsibilities:

  • Experience in developing scalable real-time applications for ingesting clickstream data using Kafka Streams and Spark Streaming.
  • Worked on Talend integrations to ingest data from multiple sources into Data Lake.
  • Experience in migrating existing legacy applications into optimized data pipelines using Spark with Scala and Python, supporting testability and observability.
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Implemented cloud integrations to GCP and Azure for bi-directional flow setups for data migrations.
  • Pushed application logs and data streams logs to Kibana server for monitoring and alerting purpose.
  • Developed optimized and tuned ETL operations in Hive and Spark scripts using techniques such as partitioning, bucketing, vectorization, serialization, configuring memory and number of executors.
  • Implemented cloud integrations to GCP and Azure for bi-directional flow setups for data migrations.
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked on an Azure copy to load data from an on-premises SQL server to an Azure SQL Data warehouse.
  • Worked on redesigning the existing architecture and implementing it on Azure SQL.
  • Experience with Azure SQL database configuration and tuning automation, vulnerability assessment, auditing, and threat detection.
  • Integration of data storage solutions in spark - especially with Azure Data Lake storage and Blob snowflake storage.
  • Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous data load.
  • Pushed application logs and data streams logs to Kibana server for monitoring and alerting purpose.
  • Experience designing solutions in Azure tools like Azure Data Factory, Azure Data Lake, SQL DWH, Azure SQL & Azure SQL Data Warehouse, Azure Functions.
  • Worked on migrating data from HDFS to Azure HD Insights and Azure Databricks.
  • Migrated existing processes and data from our on-premises SQL Server and other environments to Azure Data Lake.
  • Implemented multiple modules in microservices to expose data through Restful API’s.
  • Developed Jenkins pipelines for continuous integration and deployment purpose.
  • Experience in working on analyzing snowflake datasets performance.

Environment: PySpark, Kafka, Spark, Sqoop, Hive, Azure, Databricks, Grafana, Jenkins, Azure Data Lake, Azure SQL, Jenkins, Grafana, Python, Shell, Microservices, Restful API’s

Confidential

Data Engineer

Responsibilities:

  • Contributed to the analysis of functional requirements by collaborating with business users/product owners/developers.
  • Worked on analyzing Hadoop clusters using various big data analytic tools such as Pig, HBase database, and Sqoop.
  • Transferred data from HDFS to Relational Database Systems using Sqoop for Business Intelligence, visualization, and user report generation.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Created Hive Tables, used Sqoop to load claims data from Oracle, and then put the processed data into the target database. Imported metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data -from Kafka in Near real time and persist it to HBase.
  • Deployed services on AWS and utilized Lambda function to trigger the data pipelines.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
  • Responsible for building Data lake using various AWS cloud services like S3, EMR, Redshift etc.,
  • Worked closely with Business Analysts to gather requirements and design a reliable and scalable data pipelines using AWS EMR.
  • Contributed to the conversion of HiveQL into Spark transforms utilizing Spark RDD and Scala programming.
  • Integrated Kafka-Spark streaming for high efficiency throughput and reliability
  • Job management experience using Fair scheduler, as well as script development with Oozie workflow.
  • Worked on Spark and Hive to perform the transformations required to link daily ingested data to historical data.
  • Analyzed Snowflake Event, Change, and Job data and built a dependency tree-based model based on the occurrence of incidents for each application service present internally.
  • Performed Sqoop ingestion from MSSQL server and SAP HANA Views using Oozie processes.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Designed and implemented data integration applications in a Hadoop environment for data access and analysis using the NoSQL data store Cassandra.
  • Used Cloudera Manager for installation and management of Hadoop Cluster.

Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, Scala, PL/SQL, HDFS, JSON, Hibernate.

Confidential

Jr. Software Engineer

Responsibilities:

  • Developed MapReduce jobs in java for data cleaning and preprocessing.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Experienced in building streaming jobs to process terabytes of xml format data using Flume.
  • Worked on batch data ingestion using Sqoop from various sources like Teradata, Oracle.
  • Worked on various pig Latin scripts for data transformations and cleansing.
  • Involved in creating Hive tables, loading with data and writing hive queries.
  • Running Spark SQL operations on JSON, converting the data into a tabular format with data frames, then saving and publishing the data to Hive and HDFS.
  • Developing and refining shell scripts for data input and validation with various parameters, as well as developing custom shell scripts to execute spark Jobs.
  • Working with JSON files, parsing them, saving data in external tables, and altering and improving data for future use.
  • Taking part in design, code, and test inspections to discover problems throughout the life cycle. At appropriate meetings, explain technical considerations and upgrades to clients.
  • Creating data processing pipelines by building spark jobs in Scala for data transformation and analysis.
  • Worked on data cleaning using Pig scripts and storing in HDFS.

Environment: Cloudera, HDFS, Hive, MapReduce, Pig, Sqoop, Oracle, Java, Python, Oozie, Impala, Tableau

We'd love your feedback!