We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

3.00/5 (Submit Your Rating)

SUMMARY

  • 6+ years of professional work experience in Information technology with an expert hand in the areas of Big Data, Hadoop, Spark, Hive, Impala, Sqoop, Flume, Kafka, SQL, ETL development, report development, database development, data modeling and strong knowledge of several Database Architectures.
  • Experienced in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
  • Designed pipelines to extract data from snowflake and to perform data transformations and filtering before pushing it to data warehouse.
  • Expertized in core Java, JDBC and proficient in using Java APIs for application development.
  • Expertized in Java Script, JavaScript MVC patterns, Object Oriented JavaScript Design Patterns and AJAX calls.
  • Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
  • Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
  • Good working experience in Application and web Servers like JBoss and Apache Tomcat.
  • Good Knowledge in Amazon Web Service (AWS) concepts like Athena, EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
  • Expertized in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL and HDFS, parallel processing - MapReduce framework.
  • Developed Spark-based applications to load streaming data with low latency, using Kafka and Pyspark programming.
  • Hands on experience on Hadoop /Big Data related technology with a greater experience in Storage, Querying, Processing, and analysis of data.
  • Experienced in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and MapReduce open-source tools.
  • Experienced in installation, configuration, supporting and managing Hadoop clusters.
  • Experienced in working with MapReduce programs using Apache Hadoop for working with Big Data.
  • Experienced in development, support, and maintenance of the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
  • Experienced in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
  • Strong hands-on experience with AWS services, including but not limited to EMR, S3, EC2, route53, RDS, ELB, DynamoDB, CloudFormation, etc.
  • Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
  • Worked on Spark, Spark Streaming and using CoreSparkAPI to explore Spark features to build data pipelines.
  • Experienced in working with different scripting technologies like Python, UNIX shell scripts.
  • Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans.
  • Expertized working on Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
  • Experienced in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4 & CDH5 clusters.
  • Proficiency in multiple databases like NoSQL databases (MongoDB, Cassandra), MySQL, ORACLE, and MS SQL Server.
  • Experience in database design, entity relationships and database analysis, programming SQL, stored procedures PL/SQL, packages, and triggers in Oracle.
  • Experience in working with different data sources like Flat files, XML files and Databases.
  • Ability to tune Big Data solutions to improve performance and end-user experience.
  • Managed multiple tasks and worked under tight deadlines and in fast-paced environment.
  • Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Map Reduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Apache Spark, Spark Streaming, Impala

Hadoop Distribution: Cloudera, Horton Works, AWS

Languages: SQL, Python, Scala, Regular Expressions, PL/SQL, Pig Latin, HiveQL, Linux

Operating Systems: Windows, UNIX, LINUX, UBUNTU, CENTOS

Portals/Application servers: WebLogic, WebSphere Application server, JBOSS

Build Automation tools: SBT, Ant, Maven

Databases: Amazon RDS, Amazon Redshift, Oracle, SQL Server, MySQL, MS Access, Teradata, Cassandra, HBase, MongoDB

ETL Tools: Informatica Power Center, Talend Open Studio for Big Data

Cloud Technologies: AWS, GCP, Snowflake, Azure Data Factory, Azure Data Lakes, Azure Blob Storage, Azure Synapse Analytics, Amazon S3, EMR, Redshift, Lambda, Athena

PROFESSIONAL EXPERIENCE

Confidential

Senior Big Data Engineer

Responsibilities:

  • Worked on analyzing Hadoop stack & different Big Data analytics tools like Pig, Hive, HBase database & Sqoop.
  • Written multipleMapReduceprograms to extract data for extraction, transformation, and aggregation from more than 20 sources having multiple file formats includingXML, JSON, CSV& othercompressedfile formats.
  • Designed pipelines to extract data from snowflake and to perform data transformations and filtering before pushing it to data warehouse.
  • ImplementedSparkCoreinScalato process data in memory.
  • Performed job functions usingSpark APIsinScalaforreal time analysis and for fast querying purposes.
  • Involved in creatingSparkapplicationsinScalausing cache, map, reduceByKey etc. functions to process data.
  • CreatedOozieworkflowsforHadoopbased jobs includingSqoop,HiveandPig.
  • CreatedHive External tablesand loaded the data into tables and query data usingHQL.
  • Performed data validation on the data ingested usingMapReduceby building a custom model to filter all the invalid data and cleanse the data.
  • Handled the importing of data from various data sources, performed transformations using hive,Map-Reduce, loaded data intoHDFSand extracted data fromMySQLintoHDFSusingSqoop.
  • WroteHiveQLqueries by configuring number ofreducersandmappersin the query needed for the output.
  • Transferred data between Pig Scripts and Hive using HCatalog, transferred relational database using sqoop.
  • Configured and maintained different topologies in Stormcluster and deployed them on regular basis.
  • Responsible for building scalable distributed data solutions using Hadoop, Installed and configured Hive, Pig, Oozie and Sqoop on Hadoop cluster.
  • Developed simple tocomplexMap-Reduce jobs using Java programming to implement usingHiveandPig.
  • Ran many performance tests using the Cassandra-stress tool to measure and improve the read and write performance of the cluster.
  • Configuring the Kafka, Stormand Hive to get and load the real time messaging.
  • SupportedMapReducePrograms that are running on the cluster.Cluster monitoring, maintenance, and troubleshooting.
  • Analyzed the data by performingHivequeries (HiveQL) and runningPig Scripts(Pig Latin).
  • Cluster coordination services throughZookeeper.Installed and configuredHive& also writtenHive UDFs.
  • Worked on the Analytics Infrastructure team to develop a stream filtering system on top ofApache Kafka & Storm.
  • Worked on a POC on Spark and Scala parallel processing. Real streaming the data using Spark with Kafka.
  • Worked extensively on PySpark to build Big Data flow.
  • Good hands-on experience with Apache Spark in my current project

Environment: s: Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL Workbench, Java,Eclipse, SQL Server.

Confidential

Sr. Big Data Engineer

Responsibilities:

  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts.
  • Installed and Configured Apache Hadoop clusters for application development and Hadoop tools.
  • Installed and configured Hive and written Hive UDFs and used repository of UDF's for Pig Latin.
  • Developed data pipeline using Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.
  • Migrated the existing on-prem code to AWS EMR cluster.
  • Installed and configured Hadoop Ecosystem components and Cloudera manager using CDH distribution.
  • Coordinated with Hortonworks support team through support portal to sort out the critical issues during upgrades.
  • Worked on modeling of Dialog process, Business Processes and coding Business Objects, Query Mapper and JUnit files.
  • Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.
  • Used HBase NoSQL Database for real time and read/write access to huge volumes of data in the use case.
  • Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into HBase.
  • Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
  • Created spark jobs to apply data cleansing/data validation rules on new source files in inbound bucket and reject records to reject-data S3 bucket.
  • Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
  • Created HBase tables to load large sets of semi-structured data coming from various sources.
  • Responsible for loading the customer's data and event logs from Kafka into HBase using REST API.
  • Created tables along with sort and distribution keys in AWS Redshift.
  • Created shell scripts and python scripts to automate our daily tasks (includes our production tasks as well)
  • Created, altered, and deleted topics using Kafka Queues when required with varying.
  • Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run a MapReduce.
  • Developed analytics enablement layer using ingested data that facilitates faster reporting and dashboards.
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Developed applications using Angular6 and lambda expressions in Java to store and process the data.
  • Implemented Angular 6 Router to enable navigation from one view to next as agent performs application tasks.
  • Pulling the data from Hadoop data lake ecosystem and massaging the data with various RDD transformations.
  • Used PySpark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
  • Developed and maintained batch data flow using HiveQL and Unix scripting.
  • Designed and Developed Real time processing Application using Spark, Kafka, Scala and Hive to perform streaming ETL and apply Machine Learning.
  • Developed and execute data pipeline testing processes and validate business rules and policies.
  • Implemented different data formatter capabilities and publishing to multiple Kafka Topics.
  • Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
  • Written automated HBase test cases for data quality checks using HBase command line tools.
  • Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.

Environment: Hadoop 3.0, MapReduce, Hive 3.0, Agile, HBase 1.2, NoSQL, AWS, EC2, Kafka, Pig 0.17, HDFS, Java 8, Hortonworks, Spark, PL/SQL, Python, Jenkins.

Confidential

Sr. Big Data Developer

Responsibilities:

  • Involved in completeSDLClife cycle of big data project that includes requirement analysis, design, coding, testing and production.
  • Extensively usedSqoopto import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
  • Involved in implementing the solution for data preparation which is responsible for data transformation as wells as handling user stories.
  • Developed and tested ETL pipelines for dataIngestion/Preparation/Dispatchjobs in azure Data Factory as BLOB storage.
  • Worked on migrating existingSQL data and reporting feeds to Hadoop.
  • DevelopedPigscript to read CDC files and ingest into HBase.
  • Worked onHBasetable setup and shell script to automate ingestion process.
  • Created Hive external tables on top of HBase to be used for feed generation.
  • Scheduled automated run for production ETL data pipelines inTalendOpen Studio for Big Data.
  • Worked on migration of an existing feed from hive to Spark. To reduce latency of feeds the existing HQL that got transformed to run using Spark SQL and Hive Context.
  • Worked on logs monitoring using Splunk. Performed setup of Splunk forwarders & built dashboards on Splunk.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics. Data Ingestion to one or more Azure Services (Azure Data Lake Storage Gen2, Azure Storage, Azure SQL DW) and processing the data inAzure Databricks.
  • Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
  • Created dispatcher jobs usingSqoop exportto dispatch the data into Teradata target tables.
  • Involved in indexing the files usingSolrfor removing the duplicates in type 1 insert jobs.
  • Implemented new PIG approach forSCD type1jobs usingPIG Latinscripts.
  • CreatedHive target tablesto hold the data after all the PIG ETL operations using HQL.
  • CreatedHQLscripts to perform the data validation once transformations done as per the use case.
  • Implemented compression technique to free up some space in the cluster usingSnappy compressionon HBase tables to reclaim the space.
  • Hands on experience with accessing and performCURDoperations against HBase data.
  • IntegratingSQLlayer on top ofHBaseto get the best performance while reading & writing using salting feature.
  • Writtenshellscripts to automate the process by scheduling and calling the scripts from scheduler.
  • Create Hive scripts to load the historical data and to partition the data.
  • Closely collaborated with both the onsite and offshore team
  • Closely worked with App support team to deploy the developed jobs into production.

Environment: Hadoop, HDFS, Map Reduce, Hive, Flume, Sqoop, PIG, Java (JDK 1.6), Eclipse, MySQL and Ubuntu, Zookeeper, SQL Server, Talend Open Studio for Big Data, Shell Scripting.

Confidential

Hadoop Developer

Responsibilities:

  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Bigdata technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
  • Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
  • Developed full SDLC of AWS Hadoop cluster based on client's business needs.
  • Involved in loading and transforming large sets of structured, semi-structured and unstructured data from relational databases into HDFS using Sqoop imports.
  • Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
  • Responsible for importing log files from various sources into HDFS using Flume.
  • Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
  • Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java.
  • Developed predictive analytic using Apache Spark Scala APIs.
  • Involved in working of big data analysis using Pig and User defined functions (UDF).
  • Created Hive External tables and loaded the data into tables and query data using HQL.
  • Implemented Spark Graph application to analyze guest behavior for data science segments.
  • Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
  • Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
  • Developed Shell, Perl, and Python scripts to automate and provide Control flow to Pig scripts.
  • Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, parquet and HDFS files.
  • Developed Hive SQL scripts for performing transformation logic and loading the data from staging zone to landing zone and Semantic zone.
  • Involved in creating Oozie workflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability.
  • Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
  • Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
  • Managed and lead the development effort with the help of a diverse internal and overseas group.

Environment: Big Data, Spark, YARN, HIVE, Pig, JavaScript, JSP, HTML, Ajax, Scala, Python, Hadoop, AWS, Dynamo DB, Kibana, Cloudera, EMR, JDBC, Redshift, NOSQL, Sqoop, MYSQL.

We'd love your feedback!