Senior Big Data Engineer Resume
SUMMARY:
- Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Integrated Flume with Kafka, using Flume both as a producer and consumer (concept of FLAFKA).
- Used Kafka for activity tracking and Log aggregation.
- Experience in ingesting data using Sqoop from HDFS to Relational Database Systems (RDBMS) - Oracle, DB2 and SQL Server and from RDBMS to HDFS.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users.
- Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip.
- Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
- Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them
- Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Worked on Amazon Web service(AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
- Excellent knowledge on Kafka Architecture
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
- Worked with NoSQL databases like Hbase, Cassandra, dynamo DB (AWS) and MongoDB.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, s and Azure Data Management.
- Excellent programming skills with experience in Java, C, SQL and Python Programming.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper.
- Developed a python script to hit REST API’s and extract data to AWS S3
- Experience in working with Map reduce programs, Pig scripts and Hive commands to deliver the best results
- Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, HBase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
- Hands on experience in using other Amazon Web Services like Autoscaling, RedShift, DynamoDB, Route53.
- Experience with operating systems: Linux, RedHat, and UNIX.
TECHNICAL SKILLS:
BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark,, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Languages: Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), PL/SQL, SQL, Shell Scripting
NO SQL Databases: Cassandra, HBase, MongoDB, Maria DB
Development Tools: Microsoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans.
Public Cloud: EC2, IAM, S3, Auto scaling, Cloud Watch, Route53, EMR, RedShift
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, PostgreSql, Talend, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
WORK EXPERIENCE:
Confidential
Senior Big Data Engineer
Responsibilities:
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Optimized the Tensor Flow Model for efficiency
- Worked as L1 support on Jira requests for Kafka.
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Responsible for data services and data movement infrastructures
- Used HBase for storing the Kafka topic, partition number and Offsets value. Also used phoenix jar to connect HBase table.
- Built performant, scalable ETL processes to load, cleanse and validate data
- Migrate databases to cloud platform SQL Azure and as well the performance tuning.
- Installed and Configured Sqoop to import and export the data into Hive from Relational databases.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Involved in creating HiveQL on HBase tables and importing efficient work order data into Hive tables
- Extensive experience on Hadoopecosystem components likeHadoop, Map Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper and Flume.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Databricks.
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Configured Zookeeper, worked on Hadoop High Availability wif Zookeeper failover controller, add support for scalable, fault-tolerant data solution.
- Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
- Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, s and Azure Data Management.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Using HBase to store majority of data which needs to be divided based on region.
- Designed Oozie workflows for job scheduling and batch processing.
- Loaded application analytics data into data warehouse in regular intervals of time
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Implemented python codebase for branch management over Kafka features.
- Worked on confluence and Jira
Environment: Azure, Kafka, Map reduce, Scala, Python, spark, Hadoop, Hive, Hbase, pig, zookeeper, Oozie, Hdfs.
Confidential
Big Data Engineer
Responsibilities:
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Involved in AWS cloud services such as EC2, S3, RDS, ELB, EBS, VPC, Route53, auto scaling groups, Cloud watch, Cloud Front, IAM to build configuration and troubleshooting for server migration from physical to cloud on various Amazon photos.
- Worked with NoSQL databases like HBase in creating tables to load large sets of semi structureddata coming from various sources.
- Hands on experience in installing, configuring Cloudera ApacheHadoopecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig.
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Responsible for running Hadoop streaming jobs to process terabytes of xml ' s data, utilized cluster co - ordination services through Zookeeper.
- Migrate data from on-premises to AWS storage buckets
- Developed a python script to transfer data from on-premises to AWS S3
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
- Created yaml files for each data source and including glue table stack creation
- Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
- Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems and vice versa.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Experience other Hadoop ecosystem tools in jobs such as Zookeeper, Oozie, Impala
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Developed Map Reduce programs to parse the raw data, populate staging tables and store teh refined data in partitioned tables in the EDW
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Strong understanding of AWS components such as EC2 and S3
- Joined various tables in Cassandra usingspark and Scalaand ran analytics on top of them.
- Imported documents into HDFS, HBase and creating HAR files.
- Installed and configured Hive, Pig, Sqoop, Flume andOozieon the Hadoop cluster.
- Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Used Apache NiFi to copy data from local file system to HDP.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
Environment: Erwin, Big Data, Hadoop, Oracle, PL/SQL, Scala, Spark-SQL, PySpark, Python, kafka1.1, SAS, SQL, MDM, Oozie, SSIS, T-SQL, ETL, HDFS, Cosmos, Pig, Sqoop, MS Access.
Confidential, Atlanta, GA
Big Data Engineer
Responsibilities:
- Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
- Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
- Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
- Experience in usingZookeeperandOozieoperational services to coordinate clusters and scheduling workflows
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Airflow workflow engines to manage interdependent jobs and to automate several types of Hadoop jobs such as PythonMap reduce,spark,HiveandSqoopas well as system specific jobs.
- Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra.
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and MapReduce programming
- Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon Dynamo DB
- Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig.
- Developed Sqoop jobs for performing incremental loads from RDBMS into HDFS and further applied Spark transformations
- Have good Programming experience with Python and Scala.
- Used HBase/Phoenix to support front end applications that retrieve data using row keys
- Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
- Worked in writing Spark SQL scripts for optimizing the query performance
- Used Hive Queries in Spark-SQL for analysis and processing the data.
- Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
- Cloudera certified developer for Apache Hadoop. Good knowledge of Cassandra, Hive, Pig, HDFS, Sqoop and Map Reduce.
- Designed various dimension tables using HBase and written scripts to automate the data loading to dimension tables
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
- Experience working for EMR cluster in AWS cloud and working with S3.
- Implemented real time system with Kafka and Zookeeper.
- Built Big Data analytical framework for processing healthcare data for medical research using Python, Java, Hadoop, Hive and Pig. Integrated R scripts with Map reduce jobs.
- Batch scripts have been created to retrieve data from AWS S3 storage and to make appropriate transformations in Scala using the Spark framework.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
- Involved in writing parsers using Python
- Implemented Hive UDF's and did performance tuning for better results
- Tuned, and developed SQL on HiveQL, Drill and SparkSQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
- Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access
Environment: Cloudera CDH, Hadoop, AWS, Pig, Hive, Informatics, Hbase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Java script, Oozie, Linux, No SQL, MongoDB, Talend, Git.
Confidential
Data Engineer
Responsibilities:
- Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
- Experience in developing scalable & secure data pipelines for large datasets.
- Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
- Supported data quality management by implementing proper data quality checks in data pipelines.
- Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies.
- Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
- Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
- Implemented data streaming capability using Kafka and Talend for multiple data sources.
- Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
- S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
- Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
- Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
- Knowledge on implementing the JILs to automate the jobs in production cluster.
- Troubleshoot user's analyses bugs (JIRA and IRIS Ticket).
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
- Worked on analyzing and resolving the production job failures in several scenarios.
- Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.
Confidential
Data Engineer
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Extending the functionality of Hive with custom UDF s and UDAF's.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Expert in creating Hive UDFs using Java to analyze the data efficiently.
- Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries
Environment: Cloudera CDH, Hadoop, Pig, Hive, Map Reduce, HDFS, Sqoop, Impala, Tableau, Flume, Oozie, Linux.