Senior Big Data Engineer Resume

SUMMARY:

Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Integrated Flume with Kafka, using Flume both as a producer and consumer (concept of FLAFKA).
Used Kafka for activity tracking and Log aggregation.
Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS) - Oracle, DB2 and SQL Server and from RDBMS to HDFS.
Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip.
Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them
Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
Worked on Amazon Web service(AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
Excellent knowledge on Kafka Architecture
Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Worked with NoSQL databases like Hbase, Cassandra, dynamo DB (AWS) and MongoDB.
Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, s and Azure Data Management.
Excellent programming skills with experience in Java, C, SQL and Python Programming.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper.
Developed a python script to hit REST API’s and extract data to AWS S3
Experience in working with Map reduce programs, Pig scripts and Hive commands to deliver the best results
Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, HBase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
Experience with operating systems: Linux, RedHat, and UNIX.

TECHNICAL SKILLS:

BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark,, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Languages: Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), PL/SQL, SQL, Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, Maria DB

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Auto scaling, Cloud Watch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

WORK EXPERIENCE:

Confidential

Senior Big Data Engineer

Responsibilities:

Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
Compiled data from various sources to perform complex analysis for actionable results
Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
Optimized the Tensor Flow Model for efficiency
Worked as L1 support on Jira requests for Kafka.
Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Responsible for data services and data movement infrastructures
Used HBase for storing the Kafka topic, partition number and Offsets value. Also used phoenix jar to connect HBase table.
Built performant, scalable ETL processes to load, cleanse and validate data
Migrate databases to cloud platform SQL Azure and as well the performance tuning.
Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
Involved in creating HiveQL on HBase tables and importing efficient work order data into Hive tables
Extensive experience on Hadoopecosystem components likeHadoop, Map Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper and Flume.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Databricks.
Experienced in ETL concepts, building ETL solutions and Data modeling
Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing
Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, s and Azure Data Management.
Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Using HBase to store majority of data which needs to be divided based on region.
Designed Oozie workflows for job scheduling and batch processing.
Loaded application analytics data into data warehouse in regular intervals of time
Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
Implemented python codebase for branch management over Kafka features.
Worked on confluence and Jira

Environment: Azure, Kafka, Map reduce, Scala, Python, spark, Hadoop, Hive, Hbase, pig, zookeeper, Oozie, Hdfs.

Confidential

Big Data Engineer

Responsibilities:

Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
Involved in AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
Worked with NoSQL databases like HBase in creating tables to load large sets of semi structureddata coming from various sources.
Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
Created a Lambda Deployment function, and configured it to receive events from S3 buckets
Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
Responsible for running Hadoop streaming jobs to process terabytes of xml ' s data, utilized cluster co - ordination services through Zookeeper.
Migrate data from on-premises to AWS storage buckets
Developed a python script to transfer data from on-premises to AWS S3
Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
Created yaml files for each data source and including glue table stack creation
Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase
Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
Developed Map Reduce programs to parse the raw data, populate staging tables and store teh refined data in partitioned tables in the EDW
Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
Used Sqoop to channel data from different sources of HDFS and RDBMS.
Strong understanding of AWS components such as EC2 and S3
Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
Imported documents into HDFS, HBase and creating HAR files.
Installed and configured Hive, Pig, Sqoop, Flume andOozieon the Hadoop cluster.
Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
Used Apache NiFi to copy data from local file system to HDP.
Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.

Environment: Erwin, Big Data, Hadoop, Oracle, PL/SQL, Scala, Spark-SQL, PySpark, Python, kafka1.1, SAS, SQL, MDM, Oozie, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig, Sqoop, MS Access.

Confidential, Atlanta, GA

Big Data Engineer

Responsibilities:

Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
Experience in usingZookeeperandOozieoperational services to coordinate clusters and scheduling workflows
Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
Airflow workflow engines to manage interdependent jobs and to automate several types of Hadoop jobs such as PythonMap reduce,spark,HiveandSqoopas well as system specific jobs.
Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and MapReduce programming
Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon Dynamo DB
Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
Have good Programming experience with Python and Scala.
Used HBase/Phoenix to support front end applications that retrieve data using row keys
Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
Worked in writing Spark SQL scripts for optimizing the query performance
Used Hive Queries in Spark-SQL for analysis and processing the data.
Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
Cloudera certified developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
Designed various dimension tables using HBase and written scripts to automate the data loading to dimension tables
Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
Experience working for EMR cluster in AWS cloud and working with S3.
Implemented real time system with Kafka and Zookeeper.
Built Big Data analytical framework for processing healthcare data for medical research using Python, Java, Hadoop, Hive and Pig. Integrated R scripts with Map reduce jobs.
Batch scripts have been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark framework.
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
Involved in writing parsers using Python
Implemented Hive UDF's and did performance tuning for better results
Tuned, and developed SQL on HiveQL, Drill and SparkSQL
Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access

Environment: Cloudera CDH, Hadoop, AWS, Pig, Hive, Informatics, Hbase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Java script, Oozie, Linux, No SQL, MongoDB, Talend, Git.

Confidential

Data Engineer

Responsibilities:

Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
Experience in developing scalable & secure data pipelines for large datasets.
Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
Supported data quality management by implementing proper data quality checks in data pipelines.
Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies.
Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
Implemented data streaming capability using Kafka and Talend for multiple data sources.
Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
Knowledge on implementing the JILs to automate the jobs in production cluster.
Troubleshoot user's analyses bugs (JIRA and IRIS Ticket).
Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Worked on analyzing and resolving the production job failures in several scenarios.
Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Confidential

Data Engineer

Responsibilities:

Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
Built APIs that will allow customer service representatives to access the data and answer queries.
Designed changes to transform current Hadoop jobs to HBase.
Extending the functionality of Hive with custom UDF s and UDAF's.
Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
Develop database management systems for easy access, storage, and retrieval of data.
Perform DB activities such as indexing, performance tuning, and backup and restore.
Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
Expert in creating Hive UDFs using Java to analyze the data efficiently.
Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.
Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries

Environment: Cloudera CDH, Hadoop, Pig, Hive, Map Reduce, HDFS, Sqoop, Impala, Tableau, Flume, Oozie, Linux.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship