We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

VA

SUMMARY

  • Data Engineer Professional with 7+ years of work experience in software analysis, datasets, design, development, testing, implementation of Cloud, Big Data, Big Query, Spark, Scala, Hadoop and in maintaining data pipeline.
  • Expert in developing data models, pipeline architectures, and providing ETL solutions for project models.
  • Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation, Data processing/Transformation and Data Provision with in - depth knowledge on Oracle Database and Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
  • Skilled in designing and implementing ETL architecture and actively tuned it for better performance.
  • Proficient in data processing with Hadoop MapReduce & Apache Spark.
  • Extensive experience in understanding security requirements of Hadoop and data governance.
  • Proficient in Oracle Database, SQL, PostgreSQL, Python programming and DBMS concepts.
  • Worked with BI tools and services & Data visualization tools such as Tableau, Amazon Quicksight, Plotly, Matplotlib etc.
  • Strong programming skills in Python and Scala to build efficient and robust data pipelines.
  • Experience in using and tuning relational databases (e.g. Microsoft SQL Server, Oracle, MySQL) and columnar databases (e.g. Amazon Redshift, Microsoft SQL Data Warehouse).
  • Expertise in end-to-end Data Processing jobs to analyze data using MapReduce, Spark, and Hive.
  • Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
  • Extensive experience with Apache Airflow, Bash/Python scripting for scheduling tasks and process automation. Hands-on-experience with ETL and ELT tools such as Kafka, NIFI, AWS Glue.
  • Worked with Jenkins for CI/CD and New Relic dashboards for pipeline event logging.
  • Designed, Implemented and Developed large scale solutions to solve complex problems with data from multiple areas with different types of data.
  • Worked with various streaming ingest services like Kafka, Kinesis, flume, and JMS and also worked on importing and exporting data using Apache Sqoop from RDBMS to Hadoop Platform and vice versa.
  • Designed, configured, and deployed Amazon Web Services (AWS) for a multitude of applications utilizing the AWS stack (Including EC2, Glue, Lambda, SNS, S3, RDS, Cloud Watch, SQS, IAM), focusing on high-availability, fault tolerance, and auto-scaling.
  • Used Jenkins pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes
  • Building/Maintaining Docker container clusters managed by Kubernetes Linux, Bash, GIT, Docker, on GCP (Google Cloud Platform). Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test deploy.
  • Involved in development of test environment on Docker containers and configuring the Docker containers using Kubernetes.
  • To achieve Continuous Delivery goal on high scalable environment, used Docker coupled with load-balancing tool Nginx.
  • Virtualized the servers using Docker for the test environments and dev-environments needs, also configuration automation using Docker containers.
  • Experience in creating Docker Containers leveraging existing Linux Containers and AMI's in addition to creating Docker Containers from scratch.
  • Designed and developed AWS data pipeline to migrate data from sources like Teradata, oracle into Amazon S3
  • Implemented conceptual, logical, physical models and meta data solutions for Data Modeling
  • Have experience in building data models and Dimensional modelling with Star & Snowflake schemas for OLAP and ODS applications.
  • Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and Pyspark, Kafka.
  • Developed a pipeline using spark and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
  • Strong analytical experience in determining database structural requirements from existing systems and providing sound reusable architectural solutions.
  • Can work parallelly in both Azure and AWS Clouds coherently.

TECHNICAL SKILLS

Programming: Python, PySpark, Shell Scripting

Big Data: Apache Spark, Hadoop, HDFS, MapReduce, Hive, Oozie, HBase, Impala, Hue

Big Data Platforms: Cloudera, Hortonworks, Palantir

Database Technologies: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL, HBase, Cassandra, Mongo db.

Data Warehousing: Amazon Redshift, Talend

Cloud Services: Amazon Web Services (AWS), Azure

Data visualization and reporting tools: Tableau, Amazon Quicksight

Scheduling tools: Apache Airflow, Linux Cron, Windows scheduler

Tools: Terraform, ETL, GitHub, JIRA, Rally, Confluence, Jenkins, Jupyter Lab, IntelliJ, Databricks

Operating Systems: Windows, Linux/MacOS

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, VA

Responsibilities:

  • Responsible for the execution ofbig data analytics, predictive analytics and machine learning initiatives.
  • Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • DevelopedScalascripts,UDF's using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
  • Experience indata cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
  • Gathered a brief knowledge of Snowflake data warehouse, by extracting the data from data Lake and sending it to other stages of integration Also used Snowflake to Maintain and develop complex SQL queries, views, functions, and reports. ETL pipelines were used with SQL, No SQL.
  • UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process usingPythonandScalaas needed through various sources such asAPI, AWS S3, Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
  • Involved in designing and deploying multi-tier applications using all the AWS services like EC2, S3, RDS, Dynamo DB, SNS, SQS focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
  • Implementations of generalized solution model using AWS SageMaker and AWS Polly.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Worked on the Hortonworks based Hadoop platform deployed on 120 nodes cluster to build the Data Lake, utilizing the Spark, Hive and NoSQL for data processing.
  • Worked on Apache Spark 2.0 Utilizing the Spark SQL and Streaming components to support the intraday and real-time data processing.
  • ImplementedAWS Lambdafunctions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
  • Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
  • Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary query’s orpythonscripts based on source.
  • Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
  • Created DAG to use theEmail Operator, Bash Operator and spark Livy operatorto execute and inEC2instance.
  • Deploy the code toEMRviaCI/CD using Jenkin

Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Bitbucket, AWS

Data Engineer

Confidential, Austin TX

Responsibilities:

  • Develop New Spark-SQL ETL logics in Big Data for the migration and availability of the Facts and Dimensions used for the Analytics
  • Develop of PySpark SQL application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.
  • Requirement Gathering and Leading Team for the development of the Big Data environment and Spark ETL logics migrations.
  • Responsible for end-to-end design on PySpark-Sql, Development to meet the requirements.
  • Advice the business on best practices in the PySpark-Sql while making sure the solution meet the business needs.
  • Involve in preparation, distribution and collaboration of client specific quality documentation on developments for Big Data and Spark along with regular monitoring on reflecting the modifications or enhancements done in Confidential Schedulers.
  • Migrate the Data from Teradata to Hadoop and data preparation using HIVE Tables.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Accessing the hive tables using Spark Hive context (Spark sql) and used Scala for interactive operations.
  • Develop the Spark Sql logics which mimics the Teradata ETL logics and point the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables.
  • Make sure Data is matched with Teradata and Pyspark-Sql logics.
  • Creating Views on Top of the Hive tables and give it to customers for the analytics.
  • Analyzing Hadoop cluster and different big data analytic tools including Pig, HBase and Sqoop.
  • Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache and stored the data into HDFS for analysis.
  • Strong knowledge on creating and monitoring cluster on Hortonworks Data platform.
  • Developed Unix shell scripts to load large number of files into HDFS from Linux File System
  • Developed Custom Input Formats in MapReduce jobs to handle custom file formats and to convert them into key-value pairs.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
  • Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
  • Design, develop, test, implement and support of Data Warehousing ETL and Hadoop Technologies.
  • Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
  • Worked with BI teams in generating the reports and designing ETL workflows on Tableau
  • Prepared the Technical Specification document for the ETL job development.
  • Involved in loading data from UNIX file system and FTP to HDFS
  • Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
  • Developed UDF's in java for enhancing functionalities of Pig and Hive scripts.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS and pre-processing with Pig using Oozie co-coordinator jobs.
  • Worked on the Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
  • Experience in managing MongoDB environment from availability, performance and scalability perspectives.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.

Environment: HDFS, MapReduce, Cloudera, HBase, Hive, Pig, Elastic search, Kibana, Sqoop, Spark, MongoDB, Scala, Flume, Azure Container, Oozie, Zookeeper, AWS, Maven, Linux, Bitbucket, UNIX Shell Scripting, Spark-SQL, Ad hoc queries, Teradata, Tableau.

Data Engineer

Confidential, VA

Responsibilities:

  • Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
  • Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
  • Developed common Flink module for serializing and deserializing AVRO data by applying schema.
  • Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
  • Built analytical warehouses inSnowflakesand queried data in staged files by referencing metadata columns in a staged file.
  • Designed and Developed Spark workflows using Scala for data pull from Azure blob storage bucket and Snowflake applying transformations on it.
  • Migrated data from Azure Blob storage bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
  • Install and configureApache Airflowfor Azure Blob storage and Snowflake data warehouse and createdDagsto run the Airflow.
  • Implemented user provisioning, password reset, creating and mapping groups to users using Azure identity management. feature Installed and configure for user provision and day to Identity administration.
  • Development of key modules and custom requirements in the project. Perform User Access Administration using Azure Active Directory.
  • Manage User Access/Login Security to Azure IAM Applications.
  • Coordinating with the Clients / on-site team for gathering enhancement requirements, status updates and issue handling.
  • Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development. Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and metadata management processes.
  • Evaluated and working on Azure Data Factory as an ETL tool to process business critical data into aggregated tables in Hive Cloud. Deployed and Development in Bigdata applications like Spark, Hive, Kafka and Flink in Azure cloud.
  • Developed aQuery able StateforFlinkby Scala to query streaming data and enriched the functionalities of the framework.
  • Implemented Spark with Scala and utilizing Data frames andSpark SQLAPI for faster processing of data.
  • Involved in ingestion, transformation, manipulation and computation of data usingStreamSets, Spark with Scala.
  • Involved in data ingestion intoMySQLusingFlink pipelinesfor full load and Incremental load on variety of sources like web server,RDBMSand Data API’s.
  • Worked on Spark Data sources, Spark Data frames,Spark SQLand Streaming using Scala.
  • Worked extensively on Azure Components such as Databrick, Virtual machine, Blob storage
  • Experience in developingSparkapplication usingScala SBT
  • Experience in integratingSpark-MySQL connectorandJDBC connectorto save the data processed inSparktoMySQL.
  • Used Flink Streaming for pipelined Flink engine to process data streams to deploy new API including definition of flexible windows.
  • Data pipeline with Kafka producers (Node JS) streaming data into large scale Kafka clusters, events being consumed by large scale Spark/Flink consumers
  • Expertise in using different file formats likeText files, CSV, Parquet, JSON
  • Experience in custom compute functions usingSpark SQLand performed interactive querying.
  • Responsible for masking and encrypting the sensitive data on the fly
  • Responsible in maintaining and creating DAG’s using Apache Airflow
  • Responsible for setting up a MySQL cluster on Azure Virtual Machine Instance

Environment: Spark 2.2, Scala, Linux, MySQL 5.8, Kafka 1.0, Striim, Streamsets, Spark SQL, Spark Structured Streaming, Azure Data Factory, Azure Blob storage, Azure Virtual Machine, Databricks, Apache Flink, Kafka.

Confidential

Hadoop Engineer

Responsibilities:

  • Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Setup and benchmarked Hadoop/Hbase clusters for internal use.
  • Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
  • Developed Simple to complex Map/reduce Jobs using Hive and Pig
  • Developed Map Reduce Programs for data analysis and data cleaning.
  • Developed PIG Latin scripts for the analysis of semi structured data.
  • Developed and involved in the industry specific UDF (user defined functions)
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
  • Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
  • Developed Hive queries to process the data for visualizing.

Environment: Apache Hadoop, HDFS, Cloudera Manager, CentOS, Java, Map Reduce, Eclipse, Hive, PIG, Sqoop, Oozie and SQL.

We'd love your feedback!