We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

5.00/5 (Submit Your Rating)

Scottsdale, ArizonA

SUMMARY

  • About 9+ years professional experinec in Big data Development primarly using Hadoop and spark Ecosystems
  • Expeience in Design,DevolpExperience in design, development, and Implementation of Big data applications using Hadoop ecosystem frameworks and tools like HDFS, MapReduce, Sqoop, Spark, Scala, Storm HBase, Kafka, Flume.
  • Hands of experience in HDP, GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
  • Good Expertise in ingesting, processing, exporting, analyzing Terabytes of structured and unstructured data on Hadoop clusters in Information Security and Technology domains.
  • Relevant Experience in working wif various SDLC methodologies like Agile Scrum for developing and delivering applications.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
  • In-depth noledge of Hadoop Architecture and working wif Hadoop components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode, and MapReduce concepts.
  • Demonstrated experience in delivering data and analytic solutions leveraging AWS, Azure or similar cloud data lake.
  • Strong noledge in designing and developingQlikViewandQlikSensedashboards by extracting data from different sources like SQL Server, Oracle,SAP, Flat Files, Excel files, XML Files.
  • Hands on experience wif AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2 instances and Data Warehousing.
  • Worked wif various file formats such as CSV, JSON, XMLfile formats.
  • Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
  • Expertise in developing streaming applications in Scala using Kafka and Spark Structured Streaming.
  • Experience in importing and exporting data from HDFS to RDBMS systems like Teradata (Sales Data Warehouse), SQL-Server, and Non-Relational Systems like HBase using Sqoop by efficient column mappings and maintaining the uniformity.
  • Strong expertises in Relational Data Base systems like Oracle, MS SQL Server, TeraData, MS Access, DB2 design and database development using SQL, PL/SQL, SQL PLUS, TOAD, SQL-LOADER. Highly proficient in writing, testing and implementation of triggers, stored procedures, functions, packages, Cursors using PL/SQL.
  • Experience in working wif Flume and NiFi for loading log files into Hadoop.
  • Experience in working wif NoSQL databases like HBase and Cassandra.
  • Extensive ETL testing and Automation experience on Amazon Web Services platform and Expertise in ETL tools such as Redpoint Data Management,Redpoint Ineractions,IBM Data Stage and Neolane
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Good Working Knowledge on working wif AWS cloud services like EMR, S3,Redshift, EMR cloud watch, for big data development.
  • Strong noledge of Data Warehousing implementation concept in Redshift. Has done a POC wif Matillion and Redshift for DW implementation.
  • Experience in Data Integration and Data Warehousing using various ETL tools Informatica PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), Talend.
  • Experience wif Snowflake Multi - Cluster Warehouses.
  • Experience in working wif various build and automation like Maven, SBT, GIT, SVN, Jenkins.
  • Experience in understanding of the Specifications forData Warehouse ETL Processand interacting wif the designers and the end users for informational requirements.
  • Worked wif Cloudera and Hortonworks distributions.
  • Experienced in performing code reviews, involved closely in smoke testing sessions, retrospective sessions.
  • Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
  • Has good exposure wif the star, snowflake schema, data modelling and work wif different data warehouse projects.
  • Trouble-shooting production incidents requiring detailed analysis of issues on web and desktop applications, Autosys batch jobs, and databases.
  • Strong troubleshooting and production support skills and interaction abilities wif end users.
  • I has closely worked wif technical teams, business teams, and product owners.
  • Strong analytical and problem-solving skills and the ability to follow through wif projects from inception to completion.
  • Ability to work TEMPeffectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Hadoop/BigData Technologies: Sqoop, Flume, Hive, Impala

No SQL Database: HBase, Cassandra, MongoDB

Monitoring and Reporting: Tableau, Custom Shell Scripts

Hadoop Distribution: HortonWorks, Cloudera, MapReduce, SPARK

Build and Deployment Tools: Maven, Sbt, Git, SVN, Jenkins

Programming and Scripting: Scala, Java,PL/SQL, JavaScript, Shell Scripting, Python, Scala, Pig Latin, HiveQL

Java Technologies: J2EE, Java Mail API, JDBC

Databases: Oracle, MY SQL, MS SQL Server

Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS

Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript

ETL Tools: Informatica PowerCenter, QlikSense, QlikView

Operating Systems: Linux, Unix, Windows 8, Windows 7, Windows Server 2008/2003

AWS Services: EC2, EMR, S3, Redshift, EMR, Lambda, Athena,Snowflake

PROFESSIONAL EXPERIENCE

Confidential, Scottsdale, Arizona

Sr. Data Engineer

Responsibilities:

  • Involved in all phases of SDLC including Requirement Gathering, Design, Analysis and Testing of customer specifications, Development, and Deployment of the Application.
  • Involved in designing and deploying a large application utilizing almost the entire AWS stack (Including EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Working on migration project of moving current applications in traditional datacenter to AWS by using AWS services.
  • Launching AmazonEC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL) and configuring launched instances wif respect to specific applications.
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets. Assisted the team experienced in deploying AWS andCloud Platform.
  • Managed IAM policies, providing access to different AWS resources, design and refine the workflows used to grant access.
  • Implemented and maintained the monitoring and alerting of production and corporate servers/storage using AWS Cloud watch.
  • Designed AWS Cloud Formation templates to create custom sized VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Launched Compute (EC2) and DB (Aurora, Cassandra) instances from Amazon Management Console and CLI.
  • Installed and configured Splunk Universal Forwarders on both UNIX (Linux, Solaris, and AIX) and Windows Servers.
  • Hands on experience in customizing Splunk dashboards, visualizations, configurations using customized Splunk queries.
  • Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
  • Implemented the Dockerfor wrapping up the final code and setting up development and testing environment using DockerHub, Docker Swarm and DockerContainer Network.
  • Elastic search experience and capacity planning and cluster maintenance. Continuously looks for ways to improve and sets a very high bar in terms of quality.
  • Implemented real time log analytics pipeline using Elastic search.
  • Setup and configured Elastic search in a POC test environment to ingest over million records from oracle DB.
  • Designed the data models to be used in data intensiveAWS Lambdaapplications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements fromAurora.
  • Deployed applications on AWS by using Elastic Beanstalk. Integrated delivery (CI and CD) using Jenkins and puppet.
  • Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex. longer running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift.
  • Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
  • Responsible for moving data between different AWS compute and storage services by using AWS Data Pipeline.
  • Design and Develop ETL Processes in AWS Glue to relocate Campaign information from outside sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Strong background in Data Warehousing, Business Intelligence and ETL process (Informatica, AWS Glue) and expertise on working on Large data sets and analysis.
  • Data Extraction, accumulations, and combination of Adobe information inside AWS Glue utilizing PySpark.
  • Experience in Setting up databases in AWS using RDS, storage using S3 bucket and configuring instance backups to S3 bucket.
  • Configuring and deploying Open Stack Enterprise master hosts and Open Stack node hosts.
  • Experienced in deployment of applications on Apache Web server, Nix and Application Servers like Tomcat, JBoss.
  • Extensively used Splunk Search Processing Language (SPL) queries, Reports, Alerts and Dashboards.
  • Installation and implementation of the Splunk App for Enterprise Security and documented best practices for the installation and performed noledge transfer on the process.
  • Using DB connect for real-time data integration between Splunk Enterprise and databases.
  • Virtualized the servers using the Docker for the test environments and dev-environments needs and configuration automation using Docker containers.

Environment: Amazon Web Services, IAM, S3, RDS, EC2, VPC, cloud watch, AWS Glue,GCP, Informatica Power Center 10.x/9.x, IDQ, Bit Bucket, Chef, Puppet, Ansible, Docker, Apache HTTPD, Apache Tomcat, JBoss, Junit, Cucumber, Python.

Confidential, Malvern, PA

Sr. Big Data Engineer/ Cloud Data Engineer

Responsibilities:

  • Worked on DB2 for SQL connection to Spark Scala code to Select, Insert, and Update data into DB.
  • Used Broadcast Join in SPARK for making smaller datasets to large datasets wifout shuffling of data across nodes.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Developed Spark application for loading CSV file data and applying business validation on data frame to find invalid and valid data frames. Wrote a valid data frame into the actual Hive partition table and invalid data frame into error table, partitioned by load date and load type.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.
  • Experience wif scheduling ETL jobs using SQL Agent, and external scheduling too.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Developed Spark programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
  • Created data sharing between two snowflake accounts.
  • Worked on Building data pipelines in airflow in GCP for ETL related jobs using different airflow operators
  • Implemented Spark using Python and SparkSql for faster testing and processing of data.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
  • Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.
  • Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
  • Worked wif Spark Session Object on Spark SQL and Data-Frames for faster execution of Hive queries.
  • Import the data from different sources like SQL Server into Spark RDD and developed a data pipeline using Kafka and Spark to store data into HDFS.
  • Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
  • Used SparkSql to load JSON data and create schema RDD and load it into Hive tables and handled Structured data using SparkSql.
  • Reviewing the explain plan for the SQLs in Redshift.
  • Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations.
  • Worked closely wif Business analysts and Enterprise architects for understanding the rules provided by the business.
  • Created shell scripts to access staging location on edge nodes and moves specified inbound files to HDFS publish the location and used D-series to invoke invoker code (Spring Boot) as scheduled.
  • Worked extensively wif Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
  • Wrote unit test cases in the Spark Scala code using FunSuite.
  • Works on loading data into Snowflake DB in the cloud from various sources.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Experience on QlikView Scripting,Set analysisandSection Access.
  • Configured multiple AWS services like EMR and EC2 to maintain compliance wif organization standards.
  • Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
  • Used Apache NiFi to copy data from local file system to HDP.
  • Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods technologies.
  • Worked wif Tidal Enterprise Scheduler in scheduling daily batch jobs wif ease.

Environment: Scala, Spark Core, SparkSql, GCP, Apache Hadoop 2.7.6, Spark 2.3 Hive SQL, Snowflake, Spring Boot, CDH5, HDFS, Cassandra, Zookeeper, Spark, Kafka, Oracle 19c, MySQL, Redshift Shell Script, AWS,S3, EC2, Tomcat 8, Hive, QlikView, QlikSense

Confidential, St Louis, Missouri

Sr. Big Data Engineer

Responsibilities:

  • Understanding the business and user requirement from the client to deliver better documentation.
  • Event based logging configuration for ELK to push Application errors and EMR errors specifically.
  • Worked extensively on Disaster Recovery applications to maintain its stability when the regional disaster occurs.
  • Configured EMR to process the millions of customers data using spark applications in less TEMPthan half an hour.
  • Created custom UDF’s using both data frames/ SQL and RDD in spark for data aggregation queries reverting into OLTP through Sqoop.
  • Customized Hive UDF’s to develop the structured format of data from unstructured customers data and loaded into HBase environment from data base using Sqoop.
  • Implemented Scala over Spark RDD’s structure to overwrite Hive/SQL queries for faster data processing
  • Developed serverless infrastructure using multiple AWS services.
  • Configured multiple AWS services like EMR, EC2 and S3 to maintain compliance wif organization standards.
  • Configured lambdas using YAML and JSON parameterized CFT.
  • Used AWSRedshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake wifout having to go through ETL process.
  • Used Redpoint Interaction to generate the automatic emails to the various customers on daily and weekly basis.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Migrate Matillion pipelines and Looker report from Amazon Redshift to Snowflake data warehouse.
  • Automated and scheduled daily data loads of QVW documents usingQlikView Publisher.
  • Worked extensively on SQL, PL/SQL, and UNIX shell scripting.
  • Event notification subscriptions are configured on S3, SNS topics and Lambda to process the data based on the required marker files.
  • Most notable clients include Technology Crossover Ventures, Redpoint Ventures, VantagePoint Venture capital and Jafco Ventures.
  • Worked on Mongo DB (NoSQL framework) to store the unstructured data before processing wif HiveQL.
  • Queue process messages are being pushed to mobile devices using Storm and Kafka.
  • Utilized Matillion ETL solution to develop pipeline that extract and transform data from multiple sources and loading to Snowflake. pushed the application and transformational incremental logs to Kafka and zookeeper using marker file being listened by log producer in Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Delivering different visualization patterns for business analysts based on the structured transformed data.
  • Used AWS Environments dev and pre prod for testing the application wif simulated data for obtaining the performance results and maintained stabilized production environment for better application services.
  • Maintaining the phase of delivering quality to meet the business requirements regularly.
  • Business Analysis for delivering the client requirement and detailed documentation to explain the functional requirements by going thoroughly wif business requirements.

Environment: Spark, Scala, python, AWS, Kafka, PL/SQL, Redpoint, Hive, Matillion, Sqoop, Storm, ELK, Jenkins, Redshift, S3, Aethna.

Confidential, Rochester, MN

Big Data Engineer

Responsibilities:

  • Involved in architecture design, development, and implementation of Hadoop deployment, backup, and recovery systems.
  • Developed MapReduce programs in Python using Hadoop to parse the raw data, populate staging tables, and store their fine data in partitioned HIVE tables.
  • Enabled speedy reviews and first-mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and Pig to pre-process the data.
  • Converted applications that were on MapReduce to PySpark which performed the business logic.
  • Involved in creating Hive tables, loading wif data, writing hive queries that will run internally in map reduce way.
  • Implemented Spark using Scala and SparkSql for faster testing and processing of data.
  • Imported Teradata datasets onto the HIVE platform using Teradata JDBC connectors.
  • Was involved in writing Fast Load and Multi Load scripts to load the tables.
  • Worked wif the SQL assistant to ingest and execute queries, stored procedures, and update the tables.
  • Worked in extracting XML type files using XPath and storing it into Hive tables.
  • Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
  • Involved in designing the tables in Teradata while importing the data.
  • Developed the UNIX shell scripts for creating the reports from Hive data.
  • Experienced in managing and reviewing the Hadoop log files.
  • Main duties are resolving the incidents, performing code migration from lower environment to production, in case of code related issues.
  • Responsible for code deployment into the production environment.
  • Developed Hive jobs to parse the logs, structure them in tabular format to facilitate TEMPeffective querying on the log data.
  • Developed Scala scripts, UDFs using both Data frames in Spark for Data Aggregation, queries, and writing data back into the OLTP system through Sqoop.
  • Analyze production issues to determine root cause and provide fixed recommendations to the Support team. Created, developed, and tracked solutions to application errors reported.
  • Note interruptions or bugs in operation and carry out mitigation / problem management.
  • Assist wif troubleshooting and issue resolution relating to current applications, providing assistance to the development.
  • Understanding of ETL concepts of data flow, data enrichment, data consolidation, change data capture and transformation.
  • Coordinate wif Support teams during application deployments.
  • Working on system issues on production clusters like file system issues, connection issues, system slow and monitoring the HDFS file system of all digital analytics.
  • Extensively used UNIX for shell Scripting and pulling the Logs from the Server.
  • Used Solr/Lucene for indexing and querying the JSON formatted data.
  • Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
  • Worked wif the Avro Data Serialization system to work wif JSON data formats.
  • Used Solr/Lucene for indexing and querying the JSON formatted data.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Completed testing of integration and tracked and solved defects.
  • Worked on AWSservices like EC2 and S3 for small data sets.
  • Involved in loading data from the UNIX file system to HDFS.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract and Zookeeper for providing coordinating services to the cluster.

Environment: Hadoop Hortonworks2.2, Hive, Pig, HBase, Scala, Sqoop and Flume, Oozie, AWS, S3, EC2, EMR Spring, Kafka, SQL Assistant, Python, UNIX, Teradata.

Confidential, San Francisco, CA

Big data/Cloud Developer

Responsibilities:

  • Involved in building scalable distributed data solutions using Pyspark on Cloudera Hadoop using Azure Data Factory pipelines.
  • Explored PySpark framework on Azure Data Databricks for improving the performance and ptimization of the existing algorithms in Hadoop using PySpark Core, Spark SQL and Spark
  • Streaming APIs.
  • Ingested data from relational databases to HDFS on regular basis using Sqoop incremental import.
  • Involved in Development of PySpark applications to process and analyze text data from emails, complaints, forums, and click streams to achieve comprehensive customer care.
  • Extracted structured data from multiple relational data sources as Data Frames in Spark SQL on
  • Databricks.
  • Implemented massive transformation and scheduling on Azure Data Bricks for advanced data analytics and provided data to downstream applications.
  • Worked on the integration of kafka service for stream processing, website tracking, log aggregation.
  • Involved in configuring and developing kafka producers, consumers, topics, brokers using java.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and
  • Persists into Cassandra and redshift implementing massive data lake pipelines.
  • Handled large datasets using Partitions, Broadcasts in PySpark, TEMPEffective & efficient Joins,
  • Transformations and other during ingestion process itself.
  • Involved in data modeling, ingesting data into Cassandra using CQL, java APIs and other drivers.
  • Involved in converting the data from Avro format to Parquet format and vice versa.
  • Transformed the Data Frames as per the requirements of data science team.
  • Involved in accessing the hive tables using Hive Context and transform the data and store it to
  • HBase.
  • Involved in creating Hive tables from wide range of data formats like text, sequential, avro, parquet, orc.
  • Experienced in working wif Spark eco system using Spark SQL and Python queries on different formats like Text file, CSV file.

Environment: Spark, Scala, Maven, Hive, Kafka, Spark Streaming, Pyspark, Oozie, Map-Reduce, Kerberos, Azure,Java.

Confidential

Hadoop Developer

Responsibilities:

  • Worked on extracting data from the Oracle database and load to the Hive database.
  • Used Spark structured Streaming to perform necessary transformations and actions on the fly from Kafka topics in real-time and persist on Cassandra using the required connectors and drivers.
  • Integrated Kafka, Spark, and Cassandra for streamlined analytics for creating a predictive model.
  • Worked on modifying and executing the UNIX shell scripts files for processing data and loading to HDFS.
  • Worked extensively on optimizing transformations for better performance.
  • Was involved in carrying out the important design decisions in creating UDFs, partitioning the data in hive tables at two different levels based on the related columns for efficient retrieval and processing of queries.
  • Tweaked a lot of options to get a performance boost like trying it out wif different executer count and memory options.
  • My team was also involved in maintenance, adding the feature of stable time zones across all records in the database.
  • Uploaded and processed more TEMPthan 20 terabytes of data from various structured and unstructured, heterogeneous sources into the HDFS file system using Sqoop and Flume enforcing and maintaining the uniformity across all the tables.
  • Developed complex transformations using HiveQL to build aggregate/summary tables.
  • Developed UDF's in Java to implement functions according to the specifications.
  • Developed Spark scripts, configured according to business logic, good noledge of actions available.
  • Well versed wif the HL7 international standards as the data were organized according to this format.
  • Formatted and built analytics on top of the data sets that were complied wif HL7 standards.
  • Analyze the JSON data using hive SerDe API to DeSerialize and convert it into a readable format.
  • Used Pig to do transformations, event joins and some Pre-Aggregations before storing the data into HDFS.
  • Involved in increasing and optimizing the performance of the application using Partitioning and Bucketingin Hive tables, developing efficient queries by using Map-side joins and Indexes.
  • Worked wif the downstream team in generating the reports on Tableau.
  • Conducted code reviews to ensure systems operations

Environment: CDH 5.1.x, Hadoop, HDFS, Map Reduce, Sqoop, Flume, Hive, SQL Server, TOAD, Oracle, Solr/Lucene, PL/SQL, Eclipse, JAVA, Shell scripting, Vertica, Unix, Cassandra.

We'd love your feedback!