We provide IT Staff Augmentation Services!

Lead Data Engineer & Big Data Architect Resume

5.00/5 (Submit Your Rating)

Wilmington, DE

SUMMARY

  • Having 10+ years of experience in Hadoop, Big Data frameworks, Pyspark, Scala, Data Pipeline Design, Development, and Implementation in an End - to-End IT environment.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Solid experience in Linux and Big dataHadoop,HadoopEcosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Oozie, and Zookeeper.
  • Solid experience inHadoopframework and related technologies like HDFS, MapReduce, Pig, Hive, HBase, Sqoop and Oozie.
  • Expertise on SPARK engine creating batch jobs with incremental load through HDFS/S3,KINESIS, Sockets, AWS etc.,
  • In-Depth knowledge in working with Distributed Computing Systems and parallel processing techniques to efficiently deal with Big Data.
  • Extensive working experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • Inject data using Sqoop from various RDBMS like Oracle, MYSQL and Microsoft SQL Server into Hadoop HDFS.
  • Firm understanding of Hadoop architecture and various components including HDFS, Yarn, Mapreduce, Hive, Pig, HBase, Kafka, Oozie etc.,
  • Strong experience building Spark applications using scala and python as programming language.
  • Good experience troubleshooting and fine-tuning long running spark applications.
  • Strong experience using Spark RDD Api, Spark Dataframe/Dataset API, Spark-SQL and Spark ML frameworks for building end to end data pipelines.
  • Good experience working with real time streaming pipelines using Kafka and Spark-Streaming.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases toAzure Datalake storeusing Azure Data factory.
  • Detailed exposure with various hive concepts like Partitioning, Bucketing, Join optimizations, Ser-De’s, built-in UDF’s and custom UDF’s.
  • Imported data using Sqoop to load data from MySQL to S3Buckets on regular basis.
  • Hands on experience in using BI tools like Splunk/Hunk.
  • Strong experience with AWS services like EC2, VPC, Cloud Front, Elastic Beanstalk, Route 53, RDBMS and S3
  • Hands of experience on data extraction, transformation and load in Hive, Pig and HBase
  • Experience in the successful implementation of ETL solution between an OLTP and OLAP database in support of Decision Support Systems with expertise in all phases of SDLC.
  • Experience in creating DStreams from sources like Flume, Kafka and performed different Spark transformations and actions on it.
  • Worked on Confidential AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Experience in integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing.
  • Worked on improving the performance and optimization of the existing algorithms inHadoopusing Spark context, Spark-SQL, Data Frames, RDD's, Spark YARN.
  • Procedural knowledge in cleansing and analyzing data using HiveQL, Pig Latin and custom MapReduce programs in Java.
  • Hands on experience onSolrto Index the files directly from HDFS for both Structured and Semi Structured data.
  • Strong experience in RDBMS technologies like MySQL, Oracle,Postgresand DB2.
  • Experienced in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's.
  • Expert level experience in designing, building and managing applications to process large amounts of data in a Hadoop/DevOps (GCP) ecosystem.
  • Extensive experience with performance tuning applications on Hadoop/GCP and configuring Hadoop/GCP systems to maximize performance
  • Involved in configuring and working with Flume to load the data from multiple sources directly intoHDFS.
  • Experience in NoSQL Databases like HBase, Cassandra, Redis and MongoDB.
  • Hands- on experience with Hortonworks & Cloudera DistributedHadoop(CDH).
  • Worked with the ApacheNififlow to perform the conversion of Raw XML data into JSON, AVRO.
  • Experience in understanding security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
  • Experience on predictive intelligence and smooth maintenance in spark streaming is done using Conviva and MLlib from Spark.
  • Experience of MPP databases such as HP Vertica and Impala.
  • Involved in installing Cloudera distribution ofHadoopon amazonEC2 Instances.
  • Experience in deployment of Bigdata solutions and the underlying infrastructure of Hadoop Cluster using Cloudera, MapR and Hortonworks distributions.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs,Pythonand Scala.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Map Reduce, Hive, Sqoop, Oozie, Scala, Spark

Programming Languages: Java, C#, Python, T-SQL

Web Technologies: HTML5, CSS3, Node JS, Angular JS, Express JS, JQuery

Cloud Platform: AWS

Web/Application Servers: Tomcat, JBoss, WebSphere, WebLogic

Databases: Oracle, MySQL, MS SQL Server 2012, Snowflake, HBase

IDE and development tools: Eclipse, NetBeans, IntelliJ

Build tools: ANT, MAVEN

Repositories: CVS, GitHub, SVN

PROFESSIONAL EXPERIENCE

Confidential, Wilmington, DE

Lead Data Engineer & Big Data Architect

Responsibilities:

  • Involve in working with Confidential Elastic MapReduce (EMR)and setting up environments on Confidential AWS EC2 Linux/Windows instances. Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Worked on Importing and exporting data into HDFS and Hive using Sqoop.
  • Involved in moving all log files generated from various sources to HDFS for further processing through PIG.
  • Optimize the EMR workloads for different types of data loads by choosing right compression, cluster type, instance type, storage type and EMRFS in order to analyze data with low cost and high scalability.
  • Creation of jobs for scheduling the batch processing and used AWS Lambda as a Scheduler.
  • Migrating the existing on-premises servers like Teradata and MS SQL server tables into Snowflake Data warehouse.
  • Extensively used Cloudera distribution for Hadoop ecosystem in the entire project.
  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns. Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
  • Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
  • Develop generic data frameworks and data products using Apache Spark, Scala to maintain the highest availability, performance, and strive for simplicity.
  • Created pipelines to move data fromon-premise servers to Azure Data Lake.
  • Strong experience in working with ELASTIC MAPREDUCE(EMR) and setting up environments on Confidential AWS EC2 instances.
  • Develop Spark applications using pyspark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover insight into the customer usage patterns.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Written Hive UDFs to sort Structure fields and return complex data type.
  • Used distinctive data formats (Text format and ORC format) while stacking the data into HDFS.
  • Worked with Kerberos and integrated it to the Hadoop cluster to make it more strong and secure from unauthorized access.
  • Acted for bringing in data under HBase using HBase shell also HBase client API.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks. Imported the data from different sources like AWS S3, LFS into Spark RDD.
  • Worked as Onshore lead to gather business requirements and guided the offshore team on timely fashion.
  • Migration of on premise data (SQL Server) to Azure Data Lake Store (ADLS) using Azure Data Factory.
  • Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Used Spark API to generate Pair RDD using Java programming.

Environment: AWS, Redshift, MapReduce, Cloudera, Snowflake Kafka, Spark, Lambda, Hadoop, Hbase, Scala, Sqoop, Tableau, Informatica, Python, Hive, PL/SQL, Oracle, T-SQL, Sql Server, OLTP, OLAP, Oozie, Unix, Shell Scripting. HDFS, YARN, Hive, Pig, Sqoop, HBase, Apache Oozie Scheduler, Kerberos, Azure, ADF, Azure-DataBricks, Teradata, Tableau, Java, Talend, HUE, HCATALOG, Flume, Solr, Git, Maven.

Confidential, Scottsdale, AZ

Sr Data Engineer / Big Data Developer

Responsibilities:

  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks. Imported the data from different sources like AWS S3, LFS into Spark RDD.
  • Built pipelines to move hashed and un-hashed data fromAzure Blob to Datalake.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created custom SOLR Query segments to optimize ideal search matching.
  • Implemented PySpark data pipelines utilizing DataFrames and Spark SQL for faster processing of data.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file, transformation in GCP.
  • Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
  • Worked with NLTK library to NLP data processing, calculating the polarity and finding the patterns.
  • Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for predictive analytics and uploading inferenced data to redshift.
  • Built Artificial Neural Network using Tensor Flow in Python to identify the customer's probability of canceling the connections. (Churn rate prediction)
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks. Imported the data from different sources like AWS S3, LFS into Spark RDD.
  • Loading salesforce Data every 15 min on incremental basis to BIGQUERY raw and UDM layer using SOQL, Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil and Shell Script.
  • Experience working with Apache SOLR for indexing and querying.
  • Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
  • Worked with Kerberos and integrated it to the Hadoop cluster to make it more strong and secure from unauthorized access.
  • Acted for bringing in data under HBase using HBase shell also HBase client API.
  • To meet specific business requirements wrote UDF’s in Scala and Pyspark.
  • Experience working with Apache SOLR for indexing and querying.
  • Created custom SOLR Query segments to optimize ideal search matching.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file, transformation in GCP.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server.
  • Designed and implemented for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
  • Created pipelines to move data fromon-premise servers to Azure Data Lake.
  • Strong experience in working with ELASTIC MAPREDUCE(EMR) and setting up environments on Confidential AWS EC2 instances.
  • Built pipelines to move hashed and un-hashed data fromAzure Blob to Data lake.

Environment: Hadoop, Cloudera, Scala, Hive, Splunk, Kafka, Map Reduce, Sqoop, Spark, Apache Drill, Apache Arrow, AUTOSYS, Tableau, UNIX, Linux, Python, GitHub, Jenkins, Azure Cloud, Azure Data Factory, Azure HDInsights Azure Blob Storage, Azure Data Explorer, Azure Eventhub, Databricks

Confidential, San Francisco, CA

Sr Big Data Developer / Scala & Pyspark Developer

Responsibilities:

  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Worked on Importing and exporting data into HDFS and Hive using Sqoop.
  • Involved in moving all log files generated from various sources to HDFS for further processing through PIG.
  • Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
  • Analyzed substantial data sets by running Hive queries and Pig scripts.
  • Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing, and developing MongoDB clusters.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Written Hive UDFs to sort Structure fields and return complex data type.
  • Used distinctive data formats (Text format and ORC format) while stacking the data into HDFS.
  • Involved in creating Hive internal and external tables, loading with data and writing hive scripts which will run internally in map reduce way.
  • Experience in developing PigLatin and HiveQL scripts for Data Analysis and ETL purposes.
  • Created Hive internal and external Tables, Partitions, Bucket for further Analysis using Hive joins.
  • Used Spark stream processing to get data into in-memory, implemented RDD transformations, actions to process as units
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Managing and scheduling Jobs on a Hadoop cluster using Oozie workflows and Oozie Coordinator engine.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
  • Expertized in implementing Spark using Scala and Spark SQL for faster testing and processing of data responsible to manage data from different sources.
  • Installed and configured Hadoop MapReduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and processing.
  • Also used Spark SQL to handle structured data in Hive.
  • Closely worked with Kafka Admin team to set up Kafka cluster setup on the QA and Production environments.
  • Had knowledge on Kibana and Elastic search to identify the Kafka message failure scenarios.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Used Spark API to generate Pair RDD using Java programming.
  • Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Spark and Shell scripts (for scheduling of jobs) Extracted and loaded data into Data Lake environment.
  • Development of Sparkjobs for Data cleansing and Data processing of flat files.
  • Worked on Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
  • Migrated their Big Data Platform from on-premise Hadoop to Google Cloud Platform (GCP) to one of the projects that we are working.
  • Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
  • Optimizing of existing algorithms inHadoopusingSparkContext,Spark -SQL, Data Frames and Pair RDD's.
  • Installation & configuration of ApacheHadoopon Confidential AWS (EC2) system.
  • Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
  • Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
  • Create firewall rules to access Google Data proc from other machines.
  • Open SSH tunnel to GCP-Google DataProc to access to yarn manager to monitor spark jobs.
  • Submit spark jobs using gsutil and spark submission get it executed in Dataproc cluster
  • Write a Python program to maintain raw file archival in GCS buckets.
  • Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • DevelopedSparkApplications in Scala and build them using SBT.
  • UsedSparkAPI over ClouderaHadoopYARN to perform analytics on data in Hive.
  • Experienced in handling large terabytes to petabytes of datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Handled large datasets using Partitions,Sparkin Memory capabilities, Broadcasts inSpark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.

Environment: HadoopYARN,AWS, Impala, Cassandra, Spark-Core,Gcp, Spark SQL, Scala, Kafka, Hive, HBase, Pig, Sqoop, MapR, Confidential Tableau, Oozie, Jenkins, Oracle 11g, Core Java, HTML, Cloudera, Oracle 12c, RedHat Linux, Python

Confidential, Seattle, WA

Sr Scala, Pyspark Developer / Machine Learning Engineer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • This project will download the data that was generated by sensors from the cars activities, the data will be collected in to the HDFS system online aggregators by Kafka.
  • Experience in creating Kafka producer and Kafka consumer for Spark streaming which gets the data from different learning systems of the patients.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Built pipelines to move hashed and un-hashed data fromAzure Blob to Data lake.
  • Created pipelines to move data fromon-premise servers to Azure Data Lake.
  • Built a Hortonworks Cluster on Confidential Azure to extract actionable insights for data collected from IOT sensors installed in excavators.
  • Installed HortonWorks Hadoop cluster on Confidential Azure cloud in the UK region to satisfy customer’s data locality needs.
  • Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
  • Experience in AWS to spin up the EMR cluster to process the huge data which is stored in S3 and push it to HDFS.
  • Implemented Spark SQL to access hive tables into spark for faster processing of data.
  • Involved in Converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.
  • Interacting with Cloudera support and log the issues in Cloudera portal and fixing them as per the recommendations.
  • Upgraded the Cloudera Hadoop ecosystems in the cluster using Cloudera distribution packages.
  • Used Apache Oozie for scheduling and managing the Hadoop Jobs. Extensive experience with Confidential Web Services (AWS).
  • Developed Python/Django application for Google Analytics aggregation and reporting.
  • Good understanding of NoSQL databases such as HBase, Cassandra and MongoDB.
  • Supported Map Reduce Programs running on the cluster and wrote custom Map Reduce Scripts for Data Processing in Java.
  • Worked with Apache Nifi for Data Ingestion. Triggered the shell Script and Schedule them using Nifi.
  • Monitoring all the Nifi flows to get notifications in case if there is no data flow through the flow more than the specific time.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance Involved in moving all log files generated from various sources to HDFS for further processing through Flume and process the files by using some piggybank.
  • Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
  • Used Flume to stream through the log data from various sources.
  • Using Avro file format compressed with Snappy in intermediate tables for faster processing of data. Used parquet file format for published tables and created views on the tables.
  • Created sentry policy files to provide access to the required databases and tables to view from impala to the business users in the dev, test and prod environment.
  • Implemented test scripts to support test driven development and continuous integration.
  • Good understanding of ETL tools and how they can be applied in a Big Data environment.
  • Involved in Agile methodologies, daily Scrum meetings, Sprint planning.

Environment: Hadoop, Cloudera, Jupyter, PySpark, Airflow, SQL Server, MySQL, AWS, Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), Github, Docker, Jenkins, Agile, ETL, Machine Learning (Naïve Bayes, KNN, Regressions, Random Forest, SVM, XGboost, Ensemble), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL), Tableau

We'd love your feedback!