We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

Seattle, WA

SUMMARY

  • Professional Software developer with 8+ years of technical expertise in all phases of Software development life cycle (SDLC), expertizing in Design, Analysis using Big Data/ Hadoop, Spark Technologies.
  • Knowledge in Spark Core, Spark - SQL, Spark Streaming and machine learning using Scala and Python Programming languages.
  • Worked on Open Source Apache Hadoop, Cloudera Enterprise (CDH) and Hortonworks Data Platform (HDP)
  • Extensive experience in building batch and steaming data pipelines using cutting edge technologies (Docker, Kubernetes, Hadoop, AWS and AZURE).
  • Designed and Developed applications using Apache Spark, Scala, Python, Redshift, Nifi, S3, AWS EMR on AWS cloud to format, cleanse, validate, create schema and build data stores on S3.
  • Hands on experience on major components in Hadoop Ecosystem like Hadoop Map Reduce, HDFS, HIVE, PIG, Pentaho, HBase, Zookeeper, Task Tracker, Name Node, Data Node, Sqoop, Oozie, Cassandra, Flume and Avro.
  • Developed various Map Reduce applications to perform ETL workloads on terabytes of data
  • Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Used standard Python modules e.g. csv, robotparser, itertools, pickle, jinja2, lxml for development.
  • Good working noledge on Snowflake and Teradata databases.
  • Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage
  • Good Hands on experience on Python, SQL and R.
  • Good understanding of RDD operations in Apache Spark like Transformations &Actions, Persistence/ Caching, Accumulators, Broadcast Variables, Optimising Broadcasts.
  • Hands on experience in performing aggregations on data using Hive Query Language (HQL).
  • Good experience in extending the core functionality of Hive and Pig by developing user-defined functions to provide custom capabilities to these languages.
  • Expertise in Hadoop Ecosystem components HDFS, Map Reduce, Hive, Pig, Sqoop, Hbase and Flume for Data Analytics.
  • Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (MemCache & Redis)
  • Has a hands-on experience on fetching the live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.
  • Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
  • Capable of processing large sets of structured, semi-structured and unstructured data sets.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Expertise in writing Map-Reduce Jobs in Java for processing large sets of structured, semi-structured and unstructured data sets and store them in HDFS.
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
  • Experience in developing Custom UDFs for datasets in Pig and Hive.
  • Analyse latest Big Data Analytic technologies and their innovative applications in both business intelligence analysis and new service offerings.
  • Designed and Developed Shell Scripts and Sqoop Scripts to migrate data in and out of HDFS
  • Designed and Developed Oozie workflows to execute MapReduce jobs, Hive scripts, shell scripts and sending email notifications
  • Worked on pipeline and partitioning parallelism techniques and ensured load balancing of data
  • Deployed different partitioning methods like Hash by field, Round Robin, Entire, Modulus, and Range for bulk data loading
  • Hands on experience in working with input file formats like parquet, json, Avro.
  • Worked on Extraction, Transformation, and Loading (ETL) of data from multiple sources like Flat files, XML files and Databases.
  • Used Agile Development Methodology and Scrum for the development process
  • Worked extensively in design and development of business process using SQOOP, PIG, HIVE, HBASE.

TECHNICAL SKILLS

OLAP/Reporting Tools: SSRS, SSAS, MDX, Tableau, PowerBI.

Relational Databases: SQL Server database 2014/2012/2008 R2/2005, Oracle 11g, SQL Server Azure, MS Access, PostgreSql

SQL Server Tools: Microsoft Visual Studio 2010/2013/2015 , SQL server management studio

Big Data Ecosystem: HDFS, Nifi, Map Reduce, Oozie, Hive/Impala, Pig, Sqoop, Zookeeper and Hbase, Spark, Scala, Kafka, Apache Flink, AWS- EC2, S3, EMR.

Other Tools: MS Office 2003/2007/2010 and 2013, Power pivot, Power Builder, GIT, CI-CD, Jupiter Note Book.

Programming languages: C, SQL, PL/SQL, T-SQL, Batch scripting, R, Python

Data Warehousing & BI: Star Schema, Snowflake schema, Facts and Dimensions tables, SAS, SSIS, and Splunk

Operating Systems: Windows XP/Vista/7/8 and 10; Windows 2003/2008R2/2012 Servers

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

  • Developed Spark scripts by using Python in PySpark shell command in development.
  • Experienced in Hadoop Production support tasks by analysing the Application and cluster logs
  • Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases
  • Worked Extensively on ETL process using SSIS package.
  • Experience in both on-prem and cloud solutions. Extensive experience with Microsoft Azure ecosystem such as: Power Bi Service, ADF2, Azure SQL Server, Azure SQL IaaS/SaaS/PaaS, Azure SQL DW, Azure Blob, AKV Management. Also, hands-on experience with AWS cloud solutions such as AWS RDS SQL server, MySQL, Amazon Redshift, AWS EC2, AWS S3 buckets.
  • Experience in Data Conversion and Data Migration using SSIS and DTS services across different databases like Oracle, MS access and flat files.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML
  • Created Pipelines in Airflow using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Cloud Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Using rest API with Python to ingest Data to BIGQUERY
  • In-depth noledge of. Snowflake Database, Schema and Table structures.
  • Experience in using Snowflake Clone and Time Travel.
  • Implement One time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
  • Extract data from data lakes, EDW to relational databases for analysing and getting more meaningful insights using SQL Queries and PySpark
  • Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage
  • Implemented ETL jobs using Nifi to import from multiple databases such as Teradata, MS-SQL to HDFS for Business Intelligence
  • Hands-on experience with Amazon EC2, Amazon S3, AWS Glue, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch and other services of the AWS family.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Used python Boto 3 to configure the services AWS glue, EC2, S3
  • Utilized SQOOP, Kafka, Flume and Hadoop Filesystem APIs for implementing data ingestion pipelines
  • Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
  • Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager
  • Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
  • Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making
  • Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications
  • Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager
  • Used Agile Scrum methodology/ Scrum Alliance for development.

Environment: Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux.

Confidential, Seattle, WA

Data Engineer

Responsibilities:

  • Involved in requirements gathering, analysis, design, development, change management, deployment.
  • Experienced in designing and deployment of Hadoop cluster and various Big Data components including HDFS, MapReduce, Hive, Sqoop, Pig, Oozie, Zookeeper in Cloudera distribution.
  • Work closely with data scientists to assist on feature engineering, model training frameworks, and model deployments at scale
  • Migrated an existing on-premises data to AWS S3. Used AWS services like EC2 and S3 for data sets processing and storage.
  • Experienced in Maintaining the Hadoop cluster over Horton works on GCP.
  • Developing real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Documented and managed migration and development process of Airflow Data Pipelines using Airflow DAGs.
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Partitioning data streams using Kafka, designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer API's to produce messages.
  • Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
  • Responsible for loading Data pipelines from webservers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Expertise in snowflake to create and Maintain Tables and views.
  • Designed and Developed applications using Apache Spark, Scala, Python, Redshift, Nifi, S3, AWS EMR on AWS cloud to format, cleanse, validate, create schema and build data stores on S3.
  • Extracted data from heterogeneous sources and performed complex business logic on network data to normalize raw data which can be utilized by BI teams to detect anomalies.
  • Developed Spark jobs in PySpark to perform ETL from SQL Server to Hadoop and worked on Spark Streaming using Kafka to submit the job and start the job working in Live manner.
  • Designed and developed Flink pipelines to consume streaming data from Kafka and applied business logic to massage and transform and serialize raw data.
  • Developed entire frontend and backend modules using Python on Django Web Framework and created User Interface (UI) using JavaScript, bootstrap, Cassandra with MySQL and HTML5/CSS
  • Extensively worked on Text, Avro, Parquet, CSV, and Json file formats and developed Data Serialization spark common module for converting Complex objects into sequence bits using them.
  • Responsible for operations and support of Big Data Analytics platform and Power BI visualizations.
  • Managed, developed, and designed a dashboard control panel for customers and Administrators using Tableau, Postgres SQL and RESTAPI calls.
  • Worked with the Ab Initio team in development/enhancement of the existing models by adding extensions according to the business needs.
  • Developed CI-CD pipeline to automate build and deploy to Dev, QA, and production environments.
  • Supported production jobs and developed several automated processes to handle errors and notifications. Also, tuned performance of slow jobs by improving design and configuration changes of PySpark jobs.
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
  • Development and execution automation test scripts using QTP (VB Script) for functional, regression, smoke, ad hoc and end-to-end Testing.
  • Created standard report Subscriptions and Data Driven Report Subscriptions.

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, Airflow, SQL, Excel, PIG, Hive, AWS, PostgreSQL, Python, PySpark, Flink, Kafka, Sqoop, SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.

Confidential

Data Engineer

Responsibilities:

  • Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
  • Good understanding and related experience with Hadoop stack - internals, Hive, Pig and Map/Reduce
  • Wrote MapReduce jobs to discover trends in data usage by users.
  • Involved in managing and reviewing Hadoop log files
  • Load and transform large sets of structured, semi structured and unstructured data
  • Import the data from different sources like HDFS/HBase into Spark RDD.
  • Importing and exporting data into HDFS and HIVE using Sqoop.
  • Implementations of generalized solution model using AWS SageMaker
  • Developed Python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.
  • Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.
  • Involved in gathering the requirements, designing, development and testing.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Developed PIG scripts for source data validation and transformation.
  • Designing and developing tables in HBase and storing aggregating data from Hive.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Developed Spark core and Spark-SQL scripts using Scala for faster data processing.
  • Involved in code review and bug fixing for improving the performance.
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Implemented Partitioning, Bucketing in Hive for better organization of the data
  • Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods technologies.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Real time streaming the data using Spark with Kafka data stores such as DynamoDB, Cassandra.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala and related tools and systems
  • Generate final reporting data using Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
  • Strongly recommended to bring in Elastic Search and was responsible for installing, configuring and administration.
  • Developing and maintaining efficient ETL Talend jobs for Data Ingest.
  • Worked on Talend RTX ETL tool, develop jobs and scheduled jobs in Talend integration suite.
  • Modified reports and Talend ETL jobs based on the feedback from QA testers and Users in development and staging environments.
  • Involved in migration Hadoop jobs into higher environments like SIT, UAT and Prod.

Environment: Cloudera, HDFS, Hive, Scala, Map Reduce, Storm, Java, HBase, Pig, Sqoop, Shell Scripts, Oozie, Coordinator, MySQL, Tableau, Elastic search, Talend and SFTP. Spark RDD, Kafka, Python, Horton works, Intellij, Azkaban, Ambari/Hue, Jenkins, Apache Nifi.

Confidential, Westwood, MA

Data Engineer

Responsibilities:

  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server less data pipeline which can be written to Glue Catalog and can be queried from Atana.
  • Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.
  • Created Airflow Scheduling scripts in Python.
  • Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
  • Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
  • Develop conceptual solutions & create proof-of-concepts to demonstrate viability of solutions.
  • Technically guide projects through to completion within target timeframes.
  • Collaborate with application architects and DevOps.
  • Identify and implement best practices, tools and standards.
  • Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
  • Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.

Environment: Azure SQL data warehouse, AzureSQL, Azure Data Lake, Azure Data Factory, Data Lake Analytics, Stream Analytics, NoSQL DB, SQL Server, Oracle, Excel.

Confidential, Santa Clara, CA

PL/SQL Developer

Responsibilities:

  • Involved in creation ofTables, Join Condition, Nested Queries, Views, SequencesandSynonymsfor the business application development.
  • DevelopedDatabase Triggers, Packages, FunctionsandStored Proceduresusing PL/SQL and maintained the scripts for various data feeds.
  • Created Indexes for faster retrieval of the customer information and enhance the database performance.
  • Created dynamic SQL to support dynamic nature of front-end applications.
  • UtilizedSQL Loaderto perform bulk data loads into database tables from external data files.
  • Used the UNIX for Automatic Scheduling of jobs. Involved inUnit Testingof newly created PL/SQL blocks of code.
  • Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and tan into HDFS location
  • For increasing the performance of all the collection based DML usedFORALLand also scanned the collections usingFIRST, LASTandNEXTin loops.
  • To conserve the resources used the anonymous blocks within theIFstatements.
  • Used thePragma Exception Initto name the system exceptions dat are raised by the program.
  • Wrote complex queries to generate reports as per client request as a part of production support.

Environment: Oracle8i, Windows XP, PL/SQL, SQL*PLUS, SQL Developer, UNIX.

Confidential, Chattanooga, TN

SQL Developer/ DBA

Responsibilities:

  • Participated in analysis, design, development, testing, and implementation of various financial Systems using Oracle, Developer and PL/SQL.
  • This system consists of the various functional modules.
  • Define database structure, mapping and transformation logic. Creation of External Table scripts for loading the data from source for ETL (Extracting Transforming and Loading) Jobs.
  • Wrote UNIX Shell Scripts to run database jobs on server side.
  • Developed new and modified existing packages, Database triggers, stored procedure and other code modules using PL/SQL in support of business requirements.
  • Worked with various functional experts to implement their functional noledge into business rules in turn as working code modules like procedures and functions.

Environment: Unix, PL/SQL, Oracle and Developer.

Confidential

SQL developer / DBA

Responsibilities:

  • Involved in complete Software Development Lifecycle (SDLC).
  • Wrote complex SQL Queries, Stored Procedure, Triggers, Views & Indexes using DML, DDL commands and user defined functions to implement the business logic.
  • Advised optimization of queries by looking at execution plan for better tuning of Database.
  • Performed Normalization & De-normalization on existing tables for faster query results.
  • Wrote T-SQL Queries and procedures to generate DML Scripts dat modified database objects dynamically based on inputs.
  • Created SSIS package to import and export data from various CSV files, Flat files, Excel spread sheets and SQL Server.
  • Designed and developed different types of reports like matrix, tabular, chart reports using SSRS.
  • Involved in migration on SQL Server 2012 databases to SQL Server 2014.
  • Maintained positive communication and working relationship with all business levels.
  • Coordinating with onshore/offshore & stake holder’s team for task clarification, fixes and review.
  • Reviewed, analyzed and implemented necessary changes in appropriate areas to enhance and improve existing systems.

Environment: SQL Server, T-SQL, SQL, SSIS and SSRS.

We'd love your feedback!