We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

4.00/5 (Submit Your Rating)

FL

SUMMARY

  • Highly dedicated, inspiring and expert Sr. Big Data Engineer wif over 7+years of IT industry experience exploring various technologies, tools and databases like Big Data, Hive, Spark, python, Sqoop, CDL (Cassandra), SQL, PLSQL, and Redshift but always making sure of living in teh world I cherish most i.e. DATA WORLD.
  • Excellent Programming skills at a higher level of abstraction using Scala and Python.
  • Experience in streaming analytics Spark Streaming, Dstream, Databricks Delta
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
  • Worked on reading multiple data formats on HDFS using Scala.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Experienced in building data ingestion platform from various channels to Hadoop using PySpark and Spark Streaming framework
  • Good experience in tracking and logging end to end software application build using Azure Devops.
  • Good experience in creating and designing data ingest pipelines using technologies such as Apache Storm - Kafka.
  • Good working experience on using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • Working exposure to Ingest tools like Cloudera Stream-sets & Apache NIFI
  • Debugging Pig and Hive scripts and optimizing MapReduce job and debugging Map reduce job.
  • Developed and designed automation framework using Python and Shell scripting
  • Implement PySpark sql module, which provides optimized data queries to teh Spark session
  • Experience wif Unix/Linux systems wif scripting experience and building data pipelines.
  • Hands-on experience in using Bitbucket, Subversion and Git as source code version control.
  • Experience wif all stages of teh SDLC and Agile Development model right from teh requirement gathering to Deployment and production support.
  • Extensive Knowledge on developing Spark SQL jobs by developing Data Frames.
  • Experience in Control-M/Autosys scheduling tool and running of ETL jobs through it.
  • Involved in daily SCRUM meetings to discuss teh development/progress and was active in making scrum meetings more productive.

TECHNICAL SKILLS

Big Data Space: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWS

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, Apache EMR

Databases & warehouses: NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, Teradata.

Java Space: Core Java, J2EE, JDBC, JNDI, JSP, EJB, Struts, Spring Boot, REST, SOAP, JMS

Languages: Python, Java, JRuby, SQL, PL/SQL, Scala, JavaScript, Shell Scripts, C/C++

Web Technologies: HTML, CSS, JavaScript, AJAX, JSP, DOM, XML, XSLT

IDE: Eclipse, NetBeans JDeveloper, IntelliJ IDEA.

Operating systems: UNIX, LINUX, Mac OS, Windows, Variants

RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL, DB2.

Version controls: GIT, SVN, CVS

ETL Tools: Informatica, AB Initio, Talend

Reporting: Cognos TM1, Tableau, SAP BO, Power BI

PROFESSIONAL EXPERIENCE

Sr. Big Data Engineer

Confidential, FL

Responsibilities:

  • Analyzed large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
  • Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Involved in building a data pipeline and performed analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Kinesis, Atana, SQS, Redshift, and ECS).
  • Core person in data ingestion team, involved in designing data flow pipelines.
  • Configured Linux on multiple Hadoop environments setting up Dev, Test, and Prod clusters wifin teh same configuration
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in In Azure Databricks.
  • ETL big dataset to Databricks Delta Lake
  • Utilized Spark's in memory capabilities to handle large datasets on S3 Data lake.
  • Used Spark Streaming APIs to perform transformations and actions on teh fly for building common learner data model which gets teh data from Kafka in near real time and persist it to Cassandra.
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Built and Deployed Industrial scale Data Lake on premise and Cloud platforms.
  • Worked on Kafka messaging queue for Data Streaming in both batch and real-time applications.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh Sql Activity.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Worked on Python to place data into JSON files for testing Django Websites. Created scripts for data modeling and data import and export.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Responsible for building scalable distributed data solutions using EMR cluster environment wif Amazon EMR.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming wif Kafka as a data pipe-line system.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Developed a Spark Streaming module for consumption of Avro messages from Kafka.
  • Worked wif Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
  • Design ETL data pipelines using combination of tools like Hive, SparkSQL and PySpark
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Analysed teh sql scripts and designed it by using PySpark SQL for faster performance.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run teh Airflow.
  • Has good experience in logging defects in Jira and Azure Devops tools.
  • Responsible for design and maintenance of teh BitBucket Repositories, views, and teh access control strategies.
  • Wrote various data normalization jobs for new data ingested into Redshift
  • Used JIRA for bug tracking and CVS for version control.

Environment: Hadoop, Map Reduce, Hive, HDFS, PIG, Scala, Airflow, Spark SQL, PySpark, Databricks, Delta Lake, Azure Data Factory, Azure Databricks, Azure DevOps, Oracle, Control M, Snowflake, Python, Git, Unix/Linux, JIRA

Big Data Engineer

Confidential, Charlotte, NC

Responsibilities:

  • Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem.
  • Prepared teh Technical Specification document for teh ETL job development.
  • Developed various Python scripts to find vulnerabilities wif SQL Queries by doing SQL injection, permission checks and analysis.
  • Maintain teh Data Lake in Hadoop by building data pipe line using Sqoop, Hive and PySpark.
  • Use Spark to process teh data before ingesting teh data into teh HBase. Both Batch and real-time spark jobs were created using Scala.
  • Developed an ETL pipeline using to extract archived logs from disparate sources and stored in S3 data lake.
  • Used SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Designed strategy and successfully migrated databases from on premise to AWS cloud.
  • Performed Real time event processing of data from multiple servers in teh organization using Apache Storm by integrating wif apache Kafka.
  • Implement PYSpark. Sql module, which provides optimized data queries to teh Spark session
  • Implemented near-real time data processing using Stream Sets and Spark/Databricks framework.
  • Created Hive queries that halped market analysts spot emerging trends by comparing fresh data wif reference tables and historical metrics.
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Implemented Spark SQL to connect to Hive to read teh data and distributed processing to make highly scalable
  • Involved in loading data from LINUX file system to HDFS.
  • Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
  • Worked wif Kafka message queue for Spark streaming.
  • Created Airflow Scheduling scripts in Python
  • Develop and test Databricks technologies on Databricks System
  • Involved in Hadoop cluster task like Adding and Removing Nodes wifout any effect to running jobs and data.
  • Followed agile methodology for teh entire project.
  • Worked on configuration and automation of workflows using Control-M
  • Created ticket in teh JIRA for teh tasks, created branches in teh Git
  • Experienced in Extreme Programming, Test-Driven Development and Agile Scrum

Environment: Apache Nifi, Delta Lake, ETL, Airflow, Hive, HDFS, Spark SQL, PySpark, Scala, Kafka, Git, Control-M, Linux.

Big Data Engineer

Confidential, MI

Responsibilities:

  • Develop New Spark Sql ETL logics in Big Data for teh migration and availability of teh Facts and Dimensions used for teh Analytics
  • Involve in requirement gathering from teh Business Analysts, and participate in discussions wif users, functional analysts for teh Business logics implementation.
  • Design Data flow in Hadoop starting from Data ingestion using SQOOP to Data Transformation using PIG/PySpark Script and finally storing them as Hive tables.
  • Worked as a key role in a team of developing an initial prototype of a NiFi big data pipeline.
  • Developing, testing and maintaining pipelines by connecting various data sources and building teh final products.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Develop Kafka producer and consumers, HBase clients, Spark jobs using scala API's along wif components on HDFS, Hive.
  • Developed and implemented ETL pipelines using Python, SQL, Spark and PySpark to ingest data and updates to relevant databases
  • Creating Views on Top of teh HIVE tables and give it to customers for teh analytics.
  • Developed Unix shell scripts to load large number of files into HDFS from Linux File System
  • Reviewed Kafka cluster configurations and provided best practices to get peak performance.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
  • Extensive experience in writing UNIX shell scripts and automation of teh ETL processes using UNIX shell scripting.
  • Involved in performance tuning of teh ETL process by addressing various performance issues at teh extraction and transformation stages.
  • Involved in developing automated workflows for daily incremental loads, moved data from RDBMS to Data Lake.
  • Implemented a distributed messaging queue to integrate wif Cassandra using Apache Kafka and Zookeeper.
  • Used HIVE to do transformations, event joins and some pre-aggregations before storing teh data onto HDFS.
  • Experienced in working wif spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Implemented development activities in complete agile model using JIRA, and GIT.

Environment: ETL, HDFS, Hive, Python, Apache Nifi, Kafka, PySpark, Scala, UNIX, Linux, Shell Scripting, Control-M, Spark SQL, Git.

Hadoop Developer

Confidential

Responsibilities:

  • Participated in all teh phases of teh Software development life cycle (SDLC) which includes Development, Testing, Implementation and Maintenance.
  • Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
  • Involved in loading data from UNIX file system to HDFS.
  • Installed and configured Hive and written Hive UDFs.
  • Importing and exporting data into HDFS and Hive using Sqoop
  • Used Cassandra CQL and Java API's to retrieve data from Cassandra table.
  • Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
  • Worked hands on wif ETL process using Informatica.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Extracted teh data from Teradata into HDFS using Sqoop.
  • Analyzed teh data by performing Hive queries and running Pigscripts to know user behavior.
  • Exported teh patterns analyzed back into Teradata using Sqoop.
  • Continuous monitoring and managing teh Hadoop cluster through Cloudera Manager.
  • Installed Oozie workflow engine to run multiple Hive.
  • Developed Hive queries to process teh data and generate teh data cubes for visualizing.

Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Pig Script, Cloudera, Oozie.

Hadoop Developer

Confidential

Responsibilities:

  • Extracted data files from MySQL, Oracle through Sqoop and placed in HDFS and processed.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Responsible to manage data coming from different sources.
  • Involved in loading data from UNIX file system to HDFs.
  • Involved in creating Hive tables, loading wif data and writing hive queries which will run Internally in map reduce way.
  • Experienced in running Hadoop streaming jobs to process terabytes of xml, csv format data.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Loaded structured data from oracle into Cassandra (NoSQL) using Sqoop.
  • Worked on file formats csv, xml, json, Avro, parquet.
  • Used compressions snappy, bz2, Avro.

Environment: HDFS, Hive, Map Reduce, Eclipse, Oracle, MySQL, Unix, Sqoop, Cassandra, Shell Scripting.

We'd love your feedback!