We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Sunnyvale, CA

SUMMARY

  • 8+ years of extensive experience in Information Technology with 6 years of Hadoop/Bigdata processing and 2 years of Exploratory Data Analysis.
  • Comprehensive working experience in implementing Big Data projects using ApacheHadoop, Pig, Hive, HBase, Spark, Sqoop, Flume, Zookeeper, Oozie.
  • Firm grip on data modeling, data marts, database performance tuning and NoSQL map - reduce systems
  • Experience in managing and reviewing Hadoop log files
  • Real time experience in Hadoop/Big Data related technology experience in Storage, Querying, Processing and analysis of data
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Experience in setting up Hadoopclusters on cloud platforms like AWS.
  • Customized the dashboards and done access management and identity in AWS
  • Worked on Data serialization formats for converting complex objects into sequence bits by using Avro, Parquet, JSON, CSV formats.
  • Expertise in converting AWS existing infrastructure to server less architecture (AWS Lambda, Kinesis) and deployed AWS Cloud formation
  • Experience working on Hortonworks / Cloudera / Map R.
  • Excellent working knowledge of HDFS Filesystem and Hadoop Demons such as Resource Manager, Node Manager, Name Node, Data Node, Secondary Name Node, Containers etc.
  • In depth understanding of Apache spark job execution Components like DAG, lineage graph, DAG Scheduler, Task scheduler, Stages and task.
  • Experience working on Spark and Spark Streaming.
  • Hands-on experience with major components in Hadoop Ecosystem like Map Reduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra, Impala and Flume.
  • Experience with new Hadoop 2.0 architecture YARN and developing YARN Applications on it
  • Worked on Performance Tuning to ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system uptime to 99.9% availability
  • Experience with distributed systems, large-scale non-relational data stores and multi-terabyte data warehouses.
  • Expertise in extending Hive and Pig core functionality by writing custom UDFs and UDAF’s.
  • Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
  • Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing.
  • Proficient in NoSQL databases like HBase.
  • Experience in importing and exporting data using Sqoop between HDFS and Relational Database Systems.
  • Populated HDFS with vast amounts of data using Apache Kafka and Flume.
  • Knowledge in Kafka installation & integration with Spark Streaming.
  • Hands-on experience building data pipelines using Hadoop components Sqoop, Hive, Pig, MapReduce, Spark, and Spark SQL.
  • Loaded and transformed large sets of structured, semi structured and unstructured data in various formats like text, zip, XML and JSON.

TECHNICAL SKILLS

Big Data Frameworks: Hadoop (HDFS, MapReduce), Spark, Spark SQL, Spark Streaming, Hive, Impala, Kafka, HBase, Flume, Pig, Sqoop, Oozie, Cassandra.

Bigdata distribution: Cloudera, Hortonworks, Amazon EMR, Azure

Programming languages: Python, Shell scripting.

Operating Systems: Windows, Linux (Ubuntu, Cent OS)

Databases: Oracle, SQL Server, MySQL, MSSQL, PostgreSQL, NoSQL

Designing Tools: UML, Visio

IDEs: Eclipse, NetBeans

Python Technologies: Pandas, NumPy, NLP.

Web Technologies: XML, HTML, JavaScript, jQuery, JSON

Linux Experience: System Administration Tools, Puppet

Development methodologies: Agile, Waterfall

Logging Tools: Log4j

Application / Web Servers: Apache Tomcat, WebSphere

Messaging Services: ActiveMQ, Kafka, JMS

Version Tools: Git and CVS

Others: Putty, WinSCP, Data Lake, Talend, AWS, Terraform

PROFESSIONAL EXPERIENCE

Confidential, Sunnyvale, CA

Senior Data Engineer

Responsibilities:

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team
  • Created Airflow Scheduling scripts in Python, Scala
  • Configured Spark streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Install and configureApache Airflowfor GCS bucket and BQ data warehouse and createddagsto run the Airflow, Big Table.
  • Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
  • Familiarity with NoSQL databases such as Cassandra.
  • Wrote shell scripts for rolling day-to-day processes and it is automated.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
  • Installed and configured MapReduce, HIVE and the HDFS. Developing Spark scripts by using Java per the requirement to read/write JSON files. Working on Importing and exportingdatainto HDFS and Hive using Sqoop
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm.
  • Compared Self hosted Hadoop with respect to GCPs Data Proc, and explored Big Table (managed HBase) use cases, performance evolution.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Created HBase tables and column families to store the user event data.
  • Scheduled and executed workflows in Oozie to run various jobs.
  • Real time streaming the data using Spark with Kafka
  • Architect, Design and developHadoopETL by using Kafka.
  • Responsible for creating Hive tables, loading the structured data resulted from MapReduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
  • Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters. Used in productionby multiple report suites.
  • Using GCP and native GCP tools to load data on to the platform as well as using KAFKA on prem, confluence KAFKA, NIFI, PUBSUB etc to move data between on prem and cloud platforms.

Environment: Hadoop, Kafka, Teradata, HBase, Oozie, Zookeeper, Flume, Airflow, GCP, NiFi, BQ, Big Table

Confidential, Cleveland, OH

Senior Data Engineer

Responsibilities:

  • Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Used Stack Driver Monitoring in GCP to check the alerts of the applications that run on theGoogle Cloud Platformand deployed on GCP using Google Cloud Deployment Manager.
  • Worked with Google function for event driven data ingestion and route data to Pub/sub.
  • ImplementedPySpark andSparkSQL for faster testing and processing of data.
  • Developed multiple MapReduce jobs in Java for data cleaning.
  • Developed Hive UDF to parse the staged raw data to get the Hit Times of the claims from a specific branch for a particular insurance type code.
  • Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data.
  • Working with data delivery teams to setup new Hadoop users. This job includes setting up Linux users, setting upKerberosprincipals and testing HDFS, Hive.
  • Implemented large scale pub/sub message queues using Apache Kafka
  • Involved in creating Hadoop streaming jobs using Python.
  • Ran data formatting scripts in python and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
  • Ran many performance tests using theCassandra-stress tool in order to measure and improve the read and write performance of the cluster.
  • Created data model for structuring and storing the data efficiently. Implemented partitioning and bucketing of tables in Cassandra.
  • Worked on migrating MapReduce programs intoPySpark transformation.
  • Built wrapper shell scripts to hold Oozie workflow.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs,Pythonand Scala.
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
  • Created concurrent access for hive tables with shared and exclusive locking that can be enabled in hive with the help of Zookeeper implementation in the cluster.
  • Designed and developedevent triggered data pipelinebased on Cloud Pubsubfor ingestion of the PII and non-PII data to the landing area on the Google Cloud Storage (GCS) buckets
  • Worked on various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like ApacheSparkwritten in Scala.
  • Used Scala to write the code for all the use cases in Spark and Spark SQL.
  • Expertise in implementing Spark and Scala application using higher order functions for both batch and interactive analysis requirement.
  • Implemented SPARK batch jobs.
  • Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
  • Created a Lambda Deployment function, and configured it to receive events from your S3 bucket.
  • Worked with Spark core, Spark Streaming and Spark SQL module of Spark.
  • Worked on reading multiple data formats on HDFS using PySpark.
  • Worked on Distributed/Cloud Computing (Map Reduce/Hadoop, Pig, Hbase, AVRO, Zookeeper, etc.), Amazon Web Services (S3, EC2, EMR etc.)
  • Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
  • Developed POC for Apache Kafka and implementing real-time streaming ETL pipeline using Kafka Streams API.

Environment: Hadoop, Map Reduce, HDFS, Hive, Hue, Pig, HBase, Cloudera, Impala, MongoDB, Kafka, Teradata, HBase, Oozie, Zookeeper, Flume, GCP.

Confidential, Boca Raton, FL

Senior Big Data Developer

Responsibilities:

  • Ability to spin up differentAWS instancesincludingEC2-classic and EC2-VPCusing cloud formation templates
  • Used the AWS-CLI to suspend an AWS Lambda function processing an Amazon Kinesis stream, then to resume it again
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement
  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
  • Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
  • Developed and deployed Dataflow jobs to write events data from Pubsub to BigQuery and from Pubsub to Pubsub
  • Implement GCS, Big Query, App Engine, Data Flow, App Engine, Google Container Engine, VPC.
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Using GCP Cloud Console, monitor dataproc cluster and jobs. Stack Driver to monitor Dashboards and do a performance tuning and optimization of jobs which are memory intensive and provide L3 support for the applications in production environment
  • Provide the permissions and required access to all thepub/sub topicsandsinksto push/write the data toStackdriver.
  • Worked and learned a great deal from Amazon Web Services (AWS) Cloud services likeEC2,S3,EBS,RDSandVPC.
  • ImplementedAWSprovides a variety of computing and networking services to meet the needs of applications
  • Developed Spark scripts as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.
  • Substantial experience working with Big Data infrastructure tools such as Python and on GCP using Cloud Storage, Cloud Pub/Sub, BigQuery and Data Studio
  • Advanced knowledge on Confidential Redshift and MPP database concepts.
  • Migrated on premise database structure to Confidential Redshift data warehouse
  • Knowledge on GCP eco systems big query, big table
  • Data pipeline consists Spark, Hive and Sqoop and Custom build Input Adapters to ingest, transform and analyze operational data.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Worked onBatch processingandReal-time data processingonSpark Streaming using Lambda architecture
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra
  • Designthe solution and develop the program for data ingestion using - Sqoop, map reduce, Shell script & python
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
  • Worked on POC’s with Apache Spark using Scala to implement spark in project.
  • Consumed the data from Kafka using Apache spark.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Involved in file movements betweenHDFSandAWS S3and extensively worked withS3 bucketinAWS
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS
  • Ingested syslog messages parse them and streams the data to Apache Kafka.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, and loaded final data into HDFS
  • Extensively worked on Python and build the custom ingest framework
  • Analyzed the data by performing Hive queries (Hive QL) to study customer behavior.

Environment: Windows 10, Office 365, Hive, Oozie, Pyspark, MapReduce, Python, AWS, SQOOP, Spark, SQL, NoSQL, HBase, Flume, Impala, Kafka, Hortonworks, HUE, XML, JSON, CSV.

Confidential

Data Engineer

Responsibilities:

  • Worked on Hive for exposing data for further analysis and for generating transformation files from different analytical formats to text files.
  • Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
  • Managed and scheduled Jobs on a Hadoop cluster.
  • Implemented and maintained various projects in Java.
  • Assisted in designing, building, and maintaining a database to analyze life cycle of checking and debit transactions.
  • Excellent application development skills with strong experience in Object Oriented Analysis.
  • Extensively involved throughout Software Development Life Cycle (SDLC).
  • Strong experience of XML, Web Services, WSDL, SOAP and, TCP/IP.
  • Strong experience in software and system development using JSP, Servlet, JSF, EJB, JDBC, Struts, Maven, Subversion, Trac, JUnit and, SQL language.
  • Worked with several clients and performed jobs for a day-to-day requests and responsibilities.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper, and Sqoop.
  • Involved in analyzing system failures, identifying root causes and recommended a course of actions.
  • Processed the Design of ERD (Entity Relationship Diagrams) for the Relational database.
  • Extensively used SQL, PL/SQL, Triggers, and Views using IBM DB2.
  • Utilized Python and MySQL from day to day to debug and fix issues with client processes.
  • Developed, tested, and implemented a financial-services application to bring multiple clients into standard database format.
  • Rich experience of database design and hands-on experience with large database systems: Oracle 8i and Oracle 9i, DB2 and languages like SQL and PL/SQL.
  • Worked with Weblogic Application Server, Websphere Application Server application deployment technology.

Environment: Apache Hadoop, HDFS, Cassandra, MapReduce, HBase, Impala, Kafka, MySQL, Amazon, DB Visualizer, Linux, Sqoop, Apache Hive, Apache Pig, Infosphere Python, Scala, NoSQL, Flume, Oozie

Confidential

Data Analyst

Responsibilities:

  • Created SQL script for the Oracle database.
  • Acquireddata from primary or secondary data sources and maintaining databases.
  • Worked with management toprioritizebusiness and information needs.
  • Provide support for Analytics Processes monitoring and troubleshooting.
  • Redesignedthe datafollowing data cleaning, data mining and produced results used for creating reports and dashboards in Power BI.
  • Interacted with business clientson organizational statistics which helped to reduce losses and increase margins.
  • Interpreted data, analyzed results using statistical techniques
  • Developed andimplementeddatabases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
  • Supported business users answering complex business questions via ad-hoc SQL queries.
  • Developed and maintained applications/databases by evaluating client needs; analyzing requirements; developing software systems.

Environment: Jupyter, MySQL, Excel, Power BI, GIT, GitHub, JSON, RESTful, HTML5, CSS3, JavaScript, Rally, Agile/Scrum

We'd love your feedback!