Senior Data Engineer Resume OR - Hire IT People

SUMMARY

Senior Data Engineer with 6+ years of Big Data experience in Hadoop ecosystem components in Ingestion, Data Modeling, Querying, Processing, Storage, Analysis, Data Integration and Implementing Enterprise level systems spanning Big Data.
Excellent knowledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node, Data node, Resource manager, Node Manager and Job history server.
Experience working with Horton works distribution and Cloudera Hadoop distribution, MapR and EMR.
Experience in designing end to end scalable architecture to solve business problems using various Azure Services HDInsight, Data Factory, Data Lake, Data Bricks and Machine Learning Studio.
Experience in Configuring Azure LinkedServices and Integration Runtimes to setup pipelines using Azure Data Factory(ADF) and automate it using Azure scheduler.
Experience in usage ofAmazon EMRfor processing Big Data across aHadoop clusterof virtual serversonAmazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
Migrated an existing on - premises application to AWS. Designed, Built, and Deployed a multitude application utilizing the AWS stack EC2, S3, EMR focusing on high-availability, fault tolerance, auto-scaling.
Experienced at Amazon Web Services (AWS) Cloud services Elastic Compute Cloud (EC2), Simple storage Service(S3), EBS, RDS and Elastic map Reduce (EMR).
Experience in developing Web-based clients/server applications.Designing, developing professional web applications using front-end technologies like HTML, CSS, jQuery, Bootstrap, Angular2, and back-end technologies like Servlets, JSP, JDBC, Spring, Hibernate, Spring MVC, Web Services.
Hands on experience in coding MapReduce/Yarn Programs using Java, Scala and Python for analyzing Big Data and Strong experience in building Data-pipe lines using Big Data Technologies
Experience in Creating real-time data streaming solutions using Spark Core, Spark SQL, Kafka, Spark Streaming, Apache Storm.
Experience in importing and exporting data from various databases like RDBMS, MYSQL, Teradata, Oracle and DB2 into HDFS using Sqoop and also experience with different data formats like Json, Avro, parquet, RC and ORC and compressions like snappy,Gzip.
Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, Databases and integration with popular NoSQL database for huge volume of data.
Hands on experience working onNoSQLdatabases includingHBase, Cassandra, MongoDBand its integration withHadoop cluster for huge volume of data.
Experience in data processing like collecting, aggregating from various sources using Apache Kafka & Flume
Hands on experience in working withFlumeto load the log data from multiple sources directly into HDFS.
ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala and exposure on usage of Apache Kafka to develop data-pipe line of logs as a stream of messages using producers and consumers.
Strong experience of Pig, Hive and Impala analytical functions, extending Hive, Impala and Pig core functionality by writing Custom User Defined Function's (UDF).
Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
Expertize in working with internal, external tables in Hive and batch Processing jobs using Map Reduce, Hive.
Experience in analyzing data usingHiveQL, Pig Latinandcustom MapReduceprograms in Java and writingHive-ACIDtablesand Pigqueries for data analysis to meet the business requirement.
In - depth understanding of Spark Architecture including Spark Core, RDD, Data Frames, Data Sets, Spark SQL, Spark Streaming and experience in importing the data from source HDFS into Spark RDD for in-memory data computation to generate the output response.
Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using SparkSQL.
Experience in convertingHive/SQLqueries intoRDDtransformationsusing Spark, Scala and Pyspark
Worked with ApacheSpark components which provides fast and general engine for large data processing integrated with Pyspark and functional programming languageScala.
Experience in developing and designing POCs deployed on the Yarn cluster, compared the performance ofSpark, withHiveandSQL/Oracle.
Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
Expertise in Oozie for configuring job work flows Scheduling, Automation and Managing based on time driven and data driven.
Designed and developed automation test scripts using Python. Experience working with Python, Linux/UNIX and shell scripting.
Experience with BI tools like Tableau for report creation and further analysis.
Good Knowledge on Machine learning algorithms like supervised, non-supervised techniques.

TECHNICAL SKILLS

Big Data Eco-system: HDFS, MapReduce, Yarn, Pig, Hive, Impala, Sqoop,Talend, Flume, Kafka, Oozie, Spark, Zookeeper, NiFi, Glue

Hadoop Technologies and Distributions: Apache Hadoop, Yarn, Cloudera CDH3, CDH4, Hortonwork, MapR

Operating System: Linux, Ubuntu, Windows (7/8/10)

Languages: C, Java, Scala, Python, Shell Scripting

Databases: Oracle, MySQL, Teradata, DB2

NoSQL: HBase, Cassandra, Mongo DB

IDE Tools: Eclipse, NetBeans, IntelliJ

Java Technologies: Servlets, JSP, JDBC, Spring, Hibernate, Spring MVC, Spring boot, Spring security, Spring REST

Web Technologies: HTML, CSS, Bootstrap, JavaScript, jQuery, AJAX, Angular 2, PHP

Cloud Services: AWS (EC2, S3, EBS, RDS, EMR, IAM) Azure - ADLS, DataFactory, Databricks, HDInsights

Build Tools: Maven, SBT, CBT

Version controls: SVN, GIT, BitBucket

BI Tools: Power BI, Tableau

PROFESSIONAL EXPERIENCE

Confidential, OR

Senior Data Engineer

Responsibilities:

Hands-on development and implementation on Big Data Management Platform (BMP) using Hadoop 2.x, HDFS, MapReduce/Yarn/Spark, Hive, Airflow, Sqoop and other Hadoop eco-system components as Data Storage and Retrieval systems.
Automating the jobs and retrieving the data from Teradata and pushing the result dataset to Hadoop Distributed File System and running MR and Hive jobs using Airflow (Workflow management).
Creating the data pipelines from Teradata to snowflake.
Imported data from AWS S3 data storage intosnowflake, Performed transformations and actions on snowflake.
Teradata concepts were used for the early instance creation with the DBMS concepts.
Helped in troubleshooting Scala problems while working with Micro Strategy to produce illustrative reports and dashboards along with ad-hoc analysis.
Developed python scripts to automate manual tasks for middleware applications.
Extensively worked on user interface for few modules using HTML, JSP's, JavaScript, and Python. Wrote Python migration scripts for web application.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala, Encryption Mechanisms using Python.
Created the Load Balancer on AWS EC2 for unstable cluster.
Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files. Also used Hbase in accordance with PIG/Hive as and when required for real time low latency queries.
Worked on importing and exporting data from Teradata and DB2 into Snowflake and HIVE using airflow dags.
Optimized Hive queries to extract the customer information from HDFS. Designed logical data models, generated DDL and DML scripts to extract the data as per requirement.
Developed MapReduce jobs to calculate the total usage of data by commercial routers in different locations.
· Wrote MapReduce jobs to generate reports for the number of activities created on a particular day, during a dumped from the multiple sources and the output was written back to HDFS
· Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team. Created databases using Hbase/Python-MapReudce to replace oracle databases.
Goof experience in setting up and configure clusters in AWS. Documented tool to perform chunk uploads of big data into google big query.
Designed and developed automation test scripts using Python. Created visual trends and calculations in Tableau on customers and products data as per client requirement.
Written Programs in Spark using Scala and Python for Data quality check. Fault tolerance in the presence of machine failure using streaming tool. Reporting the data to analysts for further tracking of trends according to various consumers.

Environment: Hadoop Map Reduce, snowflake, Hive, Sqoop, Teradata, HBase, Zoo Keeper, Shell Scripts, pyspark, Spark, Python, SparkSQL, Spark Streaming, Kafka, Oracle, IntelliJ, AWS, S3, EC2, Data Bricks.

Confidential, Plano TX

Senior Data Engineer

Responsibilities:

Engaging with complete Big information stream of the application beginning from information ingestion from upstream to HDFS, handling and breaking down the information in HDFS
Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years' worth of claim data to detect and separate fraudulent claims.
Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
Tibco Jasper Soft studio was used for the ireport analysis using AWScloud.
Hands-on experience in installing, configuring and monitoring HDFS clusters (on premise & cloud AWS)
Implemented a process to automatically update the Hive tables by reading a change file provided by users.
Expertise in designing and optimizing complex SparkSQL queries, joins and transformations rules to create the DataFrames as per the requirement.
Implemented Extract/Transform/Load through Kafka-Spark-MongoDB integration as per the requirements.
Transferred data from different data sources into HDFS systems using Kafka producers, consumers, Kafka brokers and used Zookeeper as built coordinator between different brokers in Kafka.
Used Spark-SQL to load JSON data and create Data Frames and loaded it into Hive Tables and handled structured data using SparkSQL.
Developing Spark Core in Scala, Spark Streaming and SparkSQL API environment for faster testing and processing of data. Loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
Developing and maintaining Work flow Scheduling Jobs in Airflow for importing data from RDBMS to Hive, Developed Spark jobs and Hive Jobs to summarize and transform data.
Data extraction, Data integration from different sources into Hadoop by ETL pipelines - Sqoop, Hive, Spark.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDD’s, Data frames, Scala.
Developed Spark programs with Scala, and applied principles of functional programming to process the complex unstructured and structured data sets .
Analyzing the cluster configurations, setting the driver, executor memory and number of cores accordingly.
Involved in transforming the format of the transformations and Connected to MongoDB environment using Spark as per the requirement where the data will get dumped in to MongoDB.
Automation of all the jobs starting from pulling the Data from Oracle and pushing the result dataset to Hadoop Distributed File System and running MR and Hive jobs using Airflow (Work Flow management).
Developed flow XML files using Apache NIFI, a workflow automation tool to ingest data into HDFS.
Worked on performance tuning of Apache NIFI workflow to optimize the data ingestion speeds.
Hands on experience in Spark Streaming to ingest data from multiple data sources into HDFS.
Migrated an existing on-premises application to AWS. Designed, Built, and Deployed a multitude application utilizing the AWS stack EC2, S3, EMR focusing on high-availability, fault tolerance, auto-scaling.
Imported data from AWS S3 intoSparkRDD, Performed transformations and actions on RDDs.
Worked and learned a great deal from Amazon Web Services (AWS) Cloud services Elastic Compute Cloud (EC2), Simple storage Service(S3), EBS, RDS and Elastic map Reduce (EMR).
Dealt with several source systems (RDBMS/ HDFS/S3) and file formats (text, JSON/ORC, Parquet, Avro) to ingest, transform and persist data in hive for further downstream consumption
Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
AWS provides a secure global infrastructure, plus a range of features that use to secure the data in the cloud. Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Redshift, Data Pipeline, EMR, DynamoDB, WorkSpaces, Lambda, Kinesis, RDS, SNS, SQS).
Good experience of AWS Elastic Block Storage (EBS), dierent volume types and use of various types of EBS volumes based on requirement.

Environment: Hadoop Map Reduce, Hive, Sqoop, NiFi, Teradata, HBase, Zoo Keeper, Shell Scripts, Spark Python, SparkSQL, Spark Streaming, Kafka, Oracle, IntelliJ, Azure-Data Lake, Data Factory, Data Bricks.

Confidential, Irving TX

Big Data Developer

Responsibilities:

Creating Pipelines inAzure Data Factory(ADF)by configuringLinked Services/Integration Runtime to Extract, Transform and load data from different sources intoAzure Data Lake Store(ADLS), Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Extracting, Parsing, Cleaning and ingesting the incoming web feed data and server logs into the HDInsight and Azure DataLake Store by handling structured and unstructured data.
Planning and Developing roadmaps and deliverables to advance the migration of existing solutions on-premise systems/applications to Azure cloud and proposing architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure
Setup Databricks Enterprise Platform environment, created cross-account role in Azure for Databricks to provision Spark clusters to access Azure Data Lake Store(ADLS).
Provisioning Hadoop and Spark Clusters on the Azure HDInsight, to build the On-Demand data warehouse to process the PBs of data and provide the datasets to the data scientists.
Programmed in Hive, Spark SQL and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines using Azure Data Factory.
Worked on ORC, Parquet file formats on HDInsight, Azure Blobs and Azure tables to store for raw data.
Imported data from data different sources into HDFS using Sqoop, created hive internal, external tables
Performed incremental loads using Sqoop commands and automated the process for the consistent refresh and made the data available to Data Scientist team to train their models.
Written shell scripts and automated scripts, incremental loads, sqoop, hive, spark jobs using Oozie and cron tab schedulers. Spark core
Involved in Hive performance optimizations like partitioning, bucketing and perform several types of joins on Hive tables and implementing Hive SerDes like JSON and Avro.
Worked on migration of an existing feed from hive to Spark. In order to reduce latency of feeds the existing HQL was transformed to run using Spark SQL and Hive Context
Built the Efficient Real-time data processing pipeline using Kafka, spark streaming and HBase for processing the incoming trades instantly.
Developing NiFi flows to monitor the Logs of NiFi, Yarn applications and sending alerts to Operations team.
Written the Kafka producer in Java, to consume the messages from JMS Queues and used the AVRO Serialization to send the stream into Kafka brokers for partitioning and distributing in cluster.
Written the Kafka-Spark Streaming module acting as consumer to Kafka which executes the business logic on the trades using Spark Structured Streaming, Dataframes, RDD methods.
Developed Spark applications using Python(Pyspark) for easy hadoop transitions and used Spark API’s over Hortonworks Hadoop YARN to perform analytics on data in Hive .
Developed Spark code and Spark - SQL/Streaming for faster testing and processing of data.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning. handled large datasets using partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformation and other during ingestion process itself.
Developed python, Scala scripts, UDFs using Spark-SQL, Datasets, Dataframes, RDD’s in Spark for Data Aggregation

Environment: HDFS, Hadoop Map Reduce, Hive, Sqoop, Airflow, RDBMS, HBase, Zoo Keeper, Shell Scripting, Spark Scala,SparkSQL, Spark Streaming, Kafka, Oracle, MongoDB, IntelliJ, SBT, AWS

Confidential, Phoenix AZ

Hadoop Developer

Responsibilities:

As a Hadoop Developer, I worked onHadoopeco-systems including Hive, Spark, HBase, Zookeeper, Oozie,SparkStreaming MCS (MapR Control System) with MapR distribution.
Installed and configuredHadoopMapReduce, HDFS, Developed multiple MapReduce jobs in Java for data cleaning and Pre-processing.
Involved in working with data extracted from two different sources MYSQL, Web Servers and used Sqoop to import and export data from HDFS to RDBMS and vice-versa for visualization and to generate reports.
Developed simple and complex MapReduce programs in Java for Data Analysis on different data formats.
Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis and used Sqoop to efficiently transfer data between databases and HDFS.
Built code for real time data ingestion using Java, MapR-Streams (Kafka) and STORM.
Worked on importing metadata into Hive using Sqoop and migrated existing tables and applications to work on Hive for furthey analysis according to the business requirement.
Created Hive Internal or External tables defined with appropriate static and dynamic partitions, intended for efficiency and developed Hive queries and UDFS to analyze/transform the data in HDFS.
Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic. Spark, Scala, Spark SQ
Worked on Spark using Scala and Spark SQL for faster testing and processing of data. Hive, spark SparkSql
Used Spark for interactive queries, processing of streaming data and integration with popular SQL database
Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's and used Spark-SQL to load JSON data and create Data Frames and loaded it into Hive Tables and handled structured data using SparkSQL.
UsedSparkStreaming APIs to perform transformations and actions for building common data model which gets the data from Kafka in near real time and persist it to Cassandra
Good understanding of Cassandra architecture, replication strategy, gossip, and snitch and used the SparkDataStax Cassandra Connector to load data to and from Cassandra.
Experienced in Creating data-models for Client’s transactional logs, analyzed the data from Casandra tables for quick searching, sorting and grouping using the Cassandra Query Language(CQL).
Implemented a process to automatically update the Hive tables by reading a change file provided by business.
Used Oozie and Control - M workflow engine for managing and scheduling Hadoop Jobs
Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warnings and failure conditions. Power BI, Tableau
Used Oozie Operational Services for batch processing and scheduling workflows dynamically.
Validated with various POC applications to eventually adopt them to benefit from the Big Data Hadoop.
Worked and learned a great deal from AmazonWebServices (AWS) Cloud services like EC2, S3, EBS, RDS and VPC. Migrated an existing on-premises application to AWS.
Implemented AWS provides a variety of computing and networking services to meet the needs of applications. Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain conguration.
On demand secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function. Extensive knowledge of working on NiFi.

Environment: HDFS, Hadoop Map Reduce, Hive, Sqoop, RDBMS, HBase, Zoo Keeper, Shell Scripting, Spark, Scala, Kafka, Cassandra.

Confidential

Hadoop Developer

Responsibilities:

Experience in configuration, supporting and monitoring Hadoop cluster using Cloudera distribution.
Worked in Agile scrum development model on analyzing Hadoop cluster and different Big Data analytic tools including Map Reduce, Pig, Hive, Flume, Oozie and SQOOP.
Configured Hadoop MapReduce, HDFS, developed MapReduce jobs in Java for data cleaning, preprocessing.
Established custom MapReduce programs to analyze data and used Pig Latin to clean unwanted data.
Involved in creating Hive tables and writing hive queries that will run internally in map reduce way.
Implemented Partitioning, dynamic Partitions and Buckets in Hive for increasing performance benefit.
Implemented in loading, transforming of data sets of different types of data formats like structured and semi-structured data.
Involved in scheduling Oozie workflow engine to run jobs automatically.
Implemented No SQL database like HBase for storing and processing different formats of data.
Involved in Testing and coordination with business in User testing.
Involved in Unit testing and delivered Unit test plans and results documents.

Environment: Apache Hadoop, Map Reduce, HDFS, Hive, Pig, Sqoop, Oozie, HBase, UNIX shell scripting, Zookeeper, Java, Eclipse.

Confidential

Java/ J2EE Developer

Responsibilities:

Worked in the Agile/Scrum development environment and actively participated in scrum meetings and involved in the analysis, design, and development phase of the application.
Developed the application using the technologies using JSP, Servlets, Hibernate.
Designed components for the project using the design patterns such as Model-View-Controller (MVC).
Extensively used the Spring Core for Dependency Injection (DI), Inversion of Control (IOC).
Used Hibernate as the ORM tool to communicate with the database, Used Hibernate Query Language (HQL).
Created tables, triggers, stored procedures, SQL queries, joins, constraints & views for Oracle database.
Used Jersey API to implement Restful web service to retrieve JSON response.
Designed, Developed client-side graphical UI using HTML, CSS, Bootstrap, JavaScript, Angular, jQuery.
Log4j is used for debugging process and worked on Unit Testing using Junit.
Used MAVEN scripts to create Jar, War files and deployed the application on Server.
Worked with version control GIT to manage the code repository.
Worked with JIRA a tool for bug tracking, issue tracking and project management.

Environment: JSP, Servlets, Spring, Hibernate, Web Services, Angular2, Html, CSS, JavaScript, jQuery, AJAX, Oracle, Eclipse, Apache Tomcat, Maven.

We provide IT Staff Augmentation Services!

Seni Data Engineer Resume

OR

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship