We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

West Chester, PA

SUMMARY

  • Over 8+ years of IT development experience, including experience in Big Data/Hadoop ecosystem, and related technologies.
  • Writing code to create single - threaded, multi-threaded or user interface event driven applications, either stand-alone and those which access servers or services.
  • Experience in analyzing data using Hive, Pig and custom MR programs in Java.
  • Experienced in writing Map Reduce programs & UDF's for both Pig & Hive in java.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Good Understanding of Hadoop architecture and Hands on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
  • Extensively worked on AWS services like EC2, S3, EMR, FSx, Lambda, Cloud watch, RDS, Auto scaling, Cloud Formation, SQS, ECS, EFS, DynamoDB, Route53, Glue etc.
  • Experienced with cloud: Hadoop-On-Azure, AWS/EMR, Cloudera Manager (also Direct-Hadoop-EC2 (non EMR)).
  • Solid experience on Big Data Analytics with hands on experience in installing, configuring and using ecosystem components like Hadoop Map reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka and Spark, Oozie, Airflow, NiFi(ETL).
  • Experience in using to analyze data from multiple sources and creating reports with Interactive Dashboards using power BI.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Running of ApacheHadoop, CDH and Map-R distros, dubbed Elastic MapReduce(EMR) on (EC2)
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Experienced in Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
  • Experience in Data load management, importing & exporting data using SQOOP&FLUME.
  • Extensive knowledge on designing Reports, Scorecards, and Dashboards using Power BI
  • Expertise in preparing Interactive Data Visualization's using Tableau Software from different sources.
  • Hands on experience inVPN Putty and WinSCP, CI/CD(Jenkins).
  • Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
  • Experience in dealing with log files to extract data and to copy into HDFS using flume.
  • Experience in integrating Hive and Hbase for effective operations.
  • Experienced in handling different file formats like Text file, Avro data files, Sequence files, Xml and Json files.
  • Experience in Impala, Solr, MongoDB, HBase and Spark, Kubernetes.
  • Hands on knowledge of writing code in Scala, Core Java and also with R.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good experience in using Data Modelling techniques to find the results based on SQL and PL/SQL queries.
  • Experience working with different databases, such as Oracle, SQL Server, MySQL and writing stored procedures, functions, joins, and triggers for different Data Models.
  • Experience in Testing and documenting software for client applications.
  • Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.
  • Expertise in Waterfall and Agile - SCRUM methodologies.

TECHNICAL SKILLS

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Programming Languages: Java, PL/SQL, SQL, Python, Scala, PySpark, C, C++

Cluster Mgmt& monitoring: CDH 4, CDH 5, Horton Works Ambari 2.5

Data Bases: MySQL, SQL Server, Oracle, MS Access

NoSQL Data Bases: MongoDB, Cassandra, HBase

Workflow mgmt. tools: Oozie, Apache Airflow

Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend

Cloud Technologies: Azure, AWS

IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ

Version Control Systems: Git, SVN

Operating Systems: Unix, Linux, Windows

PROFESSIONAL EXPERIENCE

Confidential, West Chester, PA

Senior Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Developed Apache Spark applications by using spark for data processing from various streaming sources.
  • Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
  • Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
  • Written the Map Reduce programs,HiveUDFsin Java
  • Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
  • Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
  • Migrated Map reduce jobs to Spark jobs to achieve better performance.
  • Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
  • Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets.
  • Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
  • Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
  • UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
  • Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
  • Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster
  • Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
  • Responsible to manage data coming from different sources through Kafka.
  • Using Spark Dataframe API in Scala for analyzing data.
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
  • Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
  • Extracted and updated the data into HDFSusing Sqoop import and export.
  • Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts
  • Worked with various HDFS file formats like Parque, IAM, Json for serializing and deserializing.
  • Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
  • Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
  • Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.
  • Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
  • Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
  • Experienced Good understanding of NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, DevOps,Spark YARN.
  • Good Exposure on Map Reduce programming using Java, PIGLatin Scripting and Distributed Application and HDFS.
  • Good experience in using Relational databasesOracle, MY SQL, SQL Server andPostgreSQL
  • Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster
  • Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala
  • Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark -SQL, Pair RDD's, Spark YARN
  • Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto Hadoop ecosystem.
  • Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.

Environment: Hadoop (HDFS, Map Reduce), Kafka, Scala, AWS Services (Lambda, EMR, Auto scaling), Yarn, IAM, PostgreSQL, Spark, Impala, Mongo DB, Java, Pig, DevOps, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git.

Confidential, Deerfield Beach, FL

Big Data Engineer

Responsibilities:

  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design inHadoopand Big Data.
  • Installed and configured ApacheHadoopto test the maintenance of log files in Hadoop cluster.
  • Installed and configuredHive, Pig, Sqoop, FlumeandOozieon the Hadoop cluster.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL data warehouse environment.
  • Used windows Azure SQL reporting services to create reports with tables, charts and maps
  • Populated HDFS and PostgreSQL with huge amounts of data using Apache Kafka.
  • Used NiFi to ping snowflake to keep Client Session alive.
  • Developed Map Reduce Programs for data analysis and data cleaning.
  • Expertise in snowflake to create and Maintain Tables and views.
  • Used Power BI, Power Pivot to develop data analysis prototype, and used Power View and Power Map to visualize reports
  • Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
  • Involve in building business intelligence reports and dashboards on snowflake database using Tableau.
  • Involved in large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
  • Developed Simple to complex Map/reduce Jobs using HiveandPig
  • Setup and benchmarked Hadoop/Hbase clusters for internal use.
  • Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
  • Generated ad-hoc reports in Excel Power Pivot and shared them using Power BI to the decision makers for strategic planning.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
  • InstalledOozie workflowengine to run multiple Hive and Pig Jobs.
  • Used Git for version control with Data Engineer team and Data Scientists colleagues.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Involved in Migrating Objects from Teradata to Snowflake.
  • Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs
  • Experience in using to analyze data from multiple sources and creating reports with Interactive Dashboards using Power BI.
  • UsedSCALAto storestreaming datato HDFS and to implementSparkfor faster processing of data.
  • Developed the Apache Storm, Kafka, and HDFS integration project to do a real time data analyses.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
  • Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.

Environment: Hadoop, Snowflake, Nifi, Spark, PySpark, Nifi, Jenkins, Kafka, Talend, SparkSQL, SparkMLIib, Power BI, MS Azure. Azure Data Bricks, Azure SQL. Azure Data Factory (ADF), Azure Data Lake, Pig, Python, Nltk, Pandas, Tableau, Ubuntu Teradata, GitHub

Confidential, Hunt Valley, MD

Data Engineer

Responsibilities:

  • Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
  • Experienced in handling data from different datasets, join them and pre-process using Pig join operations.
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
  • Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Responsible for implementing ETL process through Kafka-Spark-HBase Integration as per the requirements of customer facing API
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Built pipelines to move hashed and un-hashed data from XML files to Data lake.
  • Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Created Cassandra tables to store various data formats of data coming from different sources.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
  • Helped maintain and troubleshoot UNIX and Linux environment
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed

Environment: Spark, Kafka, Hadoop, HDFS, Spark-SQL, AWS, Python, Map Reduce, Pig, Hive, Oracle 11gMy SQL, MongoDB, Hbase, Oozie, Zookeeper, Tableau.

Confidential

Data Engineer

Responsibilities:

  • Developed Python utility to validate HDFS tables with source tables
  • Loaded data intoS3 bucketsusingAWS GlueandPySpark.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources
  • Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows
  • Implement code in Python to retrieve and manipulate data.
  • Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database.
  • ConfiguredAWSIdentity and Access Management (IAM) Groups and Users for improved login authentication.
  • Interacted with the business community and database administrators to identify the Business requirements and data realties.
  • Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow
  • Involved in filtering data stored inS3 bucketsusingElasticsearchand loaded data intoHive external tables.
  • Designed and developed UDF'S to extend the functionality in both PIG and HIVE
  • Import and Export of data using Sqoop between MySQL to HDFS on regular basis
  • Developed a shell script to create staging, landing tables with the same schema as the source and generate the properties which are used by Oozie Jobs
  • Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such asAmazon Web Services (AWS).
  • Developed Oozie workflows for executing Sqoop and Hive actions
  • Built various graphs for business decision making using Python matplotlib library.

Environment: Python, HDFS, Spark, Hive, Sqoop, AWS, Oozie, ETL, Pig, Oracle 10g, My SQL, No SQL, Hbase, Windows.

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Pig and also written Pig Latin scripts.
  • Installed and configured Hadoop MapReduce HDFS Developed multipleMapReducejobs in java for data cleaning and preprocessing.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration
  • Importing and exporting data intoHDFSandHiveusingSqoop.
  • Loading data from various sources like OLEDB, flat files to SQL Server database Using SSIS Packages and created data mappings to load the data from source to destination.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
  • Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop.
  • Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
  • Created batch jobs and configuration files to create automated process using SSIS.
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).
  • Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.

Environment: Hadoop, MapReduce, Pig,MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, Hbase, SSIS, SSRS, Report Builder, Office, Excel, Flat Files, .NET, T-SQL.

We'd love your feedback!