We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

New, JerseY

SUMMARY

  • Over 7+ years of experience in software analysis, datasets, design, development, testing, implementation of Cloud, Big Data, Big Query, Spark, Scala, and Hadoop.
  • Expertise in Big Data technologies, Data Pipelines, SQL, Cloud based RDS, Distributed Database, Serverless Architecture, Data Mining, Web Scrapping, Cloud technologies like AWS EMR, Cloud Watch.
  • Hands on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie and HBase.
  • Experience in implementing E2E solutions on Big Data using Hadoop framework, executed, and designed big data solutions on multiple distribution systems like Cloudera (CDH3 & CDH4), Hortonworks.
  • Strong knowledge in writing pyspark UDF, Generic UDF's to in corporate complex business logic into data pipelines.
  • Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in - depth knowledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
  • Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
  • Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
  • Can work parallelly in both GCP and AWS Clouds coherently.
  • Hands-on-experience with ETL and ELT tools such as airflow, Kafka, NIFI, AWS Glue.
  • Expertise in end-to-end Data Processing jobs to analyze data using MapReduce, Spark, and Hive.
  • Strong Experience in working with Linux/Unix environments, writing Shell Scripts.
  • Developed a pipeline using spark and Kafka to load data from a server to Hive with automatic ingestions and quality audits of the data to the RAW layer of the Data Lake.
  • Developed end to end Analytical/Predictive model applications leveraging Business intelligence, and insights with both Structured and Unstructured data in Big Data Environment.
  • Strong experience in using Spark Streaming, Spark SQL, and other components of spark like accumulators, Broadcast variables, different levels of caching and optimizations for spark jobs.

TECHNICAL SKILLS

  • Hadoop
  • Spark
  • Hive
  • YARN
  • HDFS
  • Zookeeper
  • HBase
  • Kafka
  • Oracle
  • Teradata
  • DB2
  • Python/R/SQL/Scala 2.11.11
  • AWS EC2
  • EMR
  • Lambda
  • Terraform
  • ISPARQL

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, New Jersey

Responsibilities:

  • Participated in all phases including Analysis, Design, Coding, Testing and Documentation and gathered requirements and performed Business Analysis.
  • Developed Entity-Relationship diagrams and modeling Transactional Databases and Data Warehouse using ER/ Studio and Power Designer.
  • Design and Develop complex Data pipelines using Sqoop, Spark, and Hive to Ingest, transform and analyze customer behavior data.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Maintained data pipeline up-time of 99.9% while ingesting streaming and transactional data across 7 different primary data sources using Spark, Redshift, S3, and Python.
  • Ingested data from disparate data sources using a combination of SQL.
  • Google Analytics API, and Salesforce API using Python to create data views to be used in BI tools like Tableau.
  • Working with two different datasets one usingHiveQLand other usingPig Latin.
  • Experience onmoving the raw databetween different systems using Apache Nifi.
  • Participated in building data lake in AWS.
  • Automating the data flowprocess using Nifi and hands-on experience on tracking the data flow in a real time manner usingNifi.
  • Wrote terraform scripts for CloudWatch Alerts.
  • Developed data warehouse model in snowflake for over 100 datasets using whereScape and created Reports in Looker based on Snowflake Connections.
  • Writing MapReduce code using pythonin order to get rid of certain security issues in the data.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Used Pig Latin at client-side cluster and HiveQL at server-side cluster.
  • Importing the complete data from RDBMS to HDFS cluster usingSqoop.
  • Worked on AWS environment and technologies such as S3, EC2, EMR, Glue, CFT, Lambda and databases Oracle, SMS, DynamoDB, MongoDB.
  • Creating external tables and moving the data onto the tables from managed tables.
  • Performing the subqueries in Hive and partitioning and bucketingthe imported data using HiveQL.
  • Moving this partitioned data onto the different tables as per as business requirements.
  • Invoking an externalUDF/UDAF/UDTF python scriptfrom Hive using Hadoop Streaming approach which is supported byGanglia.
  • Validating the data from SQL Server to Snowflake to make sure it has a correct match
  • Setting up the work schedule usingoozie and identifying the errors in the logs, rescheduling/resuming the job.
  • Able to handle whole data usingHWI (Hive Web Interface)using Cloudera Hadoop distribution UI.
  • Enhance the existing product with newly features like User roles (Lead, Admin, Developer), ELB, Auto scaling, S3, Cloud Watch, Cloud Trail and RDS-Scheduling.
  • Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS.
  • Working with Informatica 9.5.1 and Informatica 9.6.1 Big Data edition. Scheduling the jobs.
  • After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application usingSpark streamingandKafka.
  • Created RDD’s inSpark technology.
  • Extracting data fromdata warehouse (Teradata)on to the SparkRDD’s,
  • Working onStateful TransformationsinSpark Streaming.
  • Good hands-on experience onLoading data onto Hive from Spark RDD’s.
  • Worked onSpark SQL UDF’s and Hive UDF’s also worked withSpark accumulators and broadcast variables.
  • Usingdecision treeas a model evaluation for both classification and regression.
  • Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
  • Developed, created and test environments of different applications by provisioning Kubernetes clusters on AWS usingDocker, Ansible, and Terraform

Environment: Hadoop, Sqoop, Hive, HDFS, YARN, Pyspark, Zookeeper, HBase, Apache Spark, Scala, AWS EC2, S3, EMR, RDS, VPC, Lambda, Redshift, Glue, Athena, data lake, Terraform, Snowflake Kafka, Oracle, Python, Scala, Restful web service.

Data Engineer

Confidential, Virginia

Responsibilities:

  • Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and preparing low and high-level documentation.
  • Performing transformations using Hive, MapReduce, hands on experience in copying .log, snappy files into HDFS from Greenplum using Flume & Kafka, loaded data into HDFS and extracted the data into HDFS from MYSQL using Sqoop.
  • Imported required tables from RDBMS to HDFS using Sqoop and used Storm/ Spark streaming and Kafka to get real time streaming of data into HBase.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.
  • Worked on google cloud platform (GCP) services in creating queries on BigQuery for different data sets and other cloud functions.
  • Worked on Snowflake and built the Logical and Physical data model for it as per the changes required.
  • Wrote a Python scripts in GCS bucket to maintain raw file archival.
  • Hands on experience with transferring data from various servers and clients to GCP using Big Table.
  • Experience in Writing Map Reduce jobs for text mining and worked with predictive analysis team and Experience in working with Hadoop components such as HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Impala and Flume using Java.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Open SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
  • Wrote HIVE UDF's as per requirements and to handle different schemas and xml data.
  • Wrote programs using Python and apache beam and executed them in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Implemented ETL code to load data from multiple sources into HDFS using Pig Scripts.
  • Developed data pipeline using Python, hive to load data into data link. Perform data analysis data mapping for several data sources.
  • Process and load bound and unbound Data from Google pub/subtopic to Bigquery using cloud Dataflow with Python
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
  • Designed new Member and Provider booking system which allows providers to book new slots, with sending out the member leg and provider Leg directly to TP through Datalink.
  • Write a Python program to maintain raw file archival in GCS bucket.
  • Analyze various type of raw file like Json, Csv, Xml with Python using Pandas, NumPy etc.
  • DevelopedSparkapplications using Scalafor easy Hadoop transitions.And Hands on experienced in writingSparkjobs and Sparkstreaming API using Scalaand Python.
  • Used SparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive developedSpark code andSpark-SQL/Streaming for faster testing and processing of data.
  • Designed and developed User Defined Function (UDF) for Hive and Developed the Pig UDF’S to pre-process the data for analysis as well as experience in (UDAFs) for custom data specific processing.
  • Created Airflow Scheduling scripts in Python.
  • Automated the existing scripts for performance calculations using scheduling tools like Airflow.
  • Designed and developed the core data pipeline code, involving work in Python and built onKafkaand Storm.
  • Good knowledge on Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive for optimized performance.
  • Performance tuning using Partitioning, bucketing of IMPALA tables.
  • Created cloud-based software solutions written in Scala Spray IO, Akka, and Slick.
  • Hands on experience on fetching the live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Worked on NoSQL databases including HBase and Cassandra.
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.

Environment: Map Reduce, HDFS, Hive, Pig, HBase, Python, SQL, Sqoop, Flume, Oozie, Impala, Scala, Spark, Apache Kafka, Play, GCP, Snowflake, AKKA, Zookeeper, J2EE, Linux Red Hat, HP-ALM, Eclipse, Cassandra, SSIS.

Data Engineer

Confidential, Atlanta

Responsibilities:

  • Responsible for designing and implementing End to End datapipeline using Big Datatools including HDFS, Hive& Spark.
  • Extracting, Parsing, Cleaning and ingesting the incoming web feed dataand server logs into the HDFS by handling structured and unstructured data.
  • Worked on loading CSV/TXT/AVRO/PARQUET files using pyspark language in Spark Framework and process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table.
  • Worked extensively on tuning SQL queries and database modules for optimum performance.
  • Writing complex SQL queries like CTEs, subqueries, joins, Recursive CTEs.
  • Good experience in Database, Data Warehouse and schema concepts like SnowFlake Schema.
  • Worked on Cluster size with many nodes Communicate with business users and source data owners to gather reporting requirements and to access and discover source data content, quality, and availability.
  • Imported millions of structured datafrom relational databases using Sqoop import to process using Spark and stored the datainto HDFS in CSV format.
  • Involved in file movements between HDFS & AWS S3 and extensively worked with S3 bucket in AWS.
  • Integration of data stored in S3 with Databricks to perform ETL processes using pyspark and spark SQL.
  • Using Spark-SQL to load data from JSON to create schema RDD and loading in Hive Tables.
  • Used Scala SBT to develop Scala coded spark projects and executed using spark-submit.
  • Expertise onSpark, Spark SQL, Tuning and Debugging the Spark Cluster (Yarn).
  • Improving Efficiency by modifying existing Data pipelines on Matillion to load the data into AWS Redshift.
  • Deployed the Airflow server and setup dags for scheduled tasks.
  • Very good Experience with Hashi Corp Vault to write and read secrets into and from the lockboxes.
  • Migration of MicroStrategy reports and data from Netezza to IIAS.
  • Experienced with batch processing of Datasources using Apache Spark.
  • Extensive Usage of Python Libraries, Pylint and Auto testing framework behave.
  • Well versed with Pandas data frames and Spark data frames.
  • Developed Power enters mappings to extract data from various databases, Flat files and load into DataMart using the PySpark and Airflow.
  • Created data partitions on large data sets in S3 and DDL on partitioned data.
  • Implemented rapid-provisioning and life-cycle management for using Amazon EC2 and custom Bash scripts.

Environment: Unix Shell Script, Python 2&3, Scheduler (Cron), Jenkins, Artifactory, Matillion, EMR, Databricks, PyCharm, Spark SQL, Hive, SQL Jupyter, MicroStrategy, Putty, Power BI, Hive, AWS.

SQL Developer

Confidential

Responsibilities:

  • Analyzed requirements and impact by participating in Joint Application Development sessions with business.
  • Created various scripts (using different database objects) and SSIS packages (using different tasks) to Extract, Transform and Load data from various servers to client databases.
  • Optimized Stored Procedures and long running queries using indexing strategies and query optimization techniques.
  • Leveraged dynamic SQL for improving performance and efficiency.
  • Performed optimization and performance tuning on Oracle PL/SQL procedures and SQL Queries.
  • Developed PL/SQL Objects (Views, Packages, function and Procedures), SQL Loader for data migration.
  • Successfully developed and deployed SSIS packages into QA/UAT/Production environments and used package configuration to export various package properties.
  • Developed Tableau workbooks to perform year over year, quarter over quarter, YTD, QTD and MTD type of analysis.
  • Worked with team of developers designed, developed and implement a BI solution for Sales, Product and Customer KPIs.
  • Created and analyzed complex dashboards in tableau using the various sources of data like Excel sheets, SQL Server.
  • Developed SSRS reports and configured SSRS subscriptions per specifications provided by internal and external clients.
  • Designed and coded application components in an Agile environment utilizing a test-driven development approach.
  • Extensively worked on Excel using pivot tables and complex formulas to manipulate large data structures.
  • Interacted with the other departments to understand and identify data needs and requirements and worked with other members of the organization to deliver and address those needs.
  • Designed and created distributed reports in multiple formats such as Excel, PDF and CSV using SQL Server 2008 R2 Reporting Services (SSRS).

Environment: SQL Server 2008 R2, SSMS, SSIS, SSRS, XML, MS Access.

We'd love your feedback!