We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Greenwood, CO

PROFESSIONAL SUMMARY:

  • Having 8+ years of IT industry experienced Data Engineer with hands on experience in installing, configuring and using Hadoop ecosystem components like Hadoop, Map reduce, HDFS, HBase, Zookeeper, Hive,Sqoop, Pig, Flume, Cassandra, Kafka and Spark. Experienced in using Agile methodologies including extreme programming, SCRUM and Test - Driven Development (TDD)
  • Excellent knowledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
  • Experience tuning spark jobs for efficiency in terms of storage and processing.
  • Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
  • Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
  • Experience in developing the big data applications and services using in Amazon Web Services (AWS) platform using EMR, S3, EC2, Lambda, CloudWatch and cloud computing using AWS RedShift
  • Experience in analyzing data using HQL, Pig Latin and custom MapReduce programs in Python.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and process of data.
  • Hands on experience in developing ETL jobs in Hadoop eco-system using Oozie&Stream sets.
  • Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data LakeStore, Azure Data Bricks, Azure Data Factory and created POC in moving the data from flat files and SQL Server using U-SQL jobs.
  • Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports,SQL*Plus, Toad, Crystal Reports.
  • Extensive Experience of designing, developing, and deploying various kinds of reports using SSRS using relational and multidimensional data.
  • Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
  • Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
  • Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa.
  • Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
  • Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Played key role in MigratingTeradataobjects intoSnowflakeenvironment.
  • Excellent Knowledge on Cloud data warehouse systemsAWS Redshift, S3 Buckets and Snowflake.
  • Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
  • Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
  • Extensive experience usingMAVENas a Build Tool for the building of deployable artifacts from source code
  • Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
  • Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
  • Experience in Text Analytics, developing different Statistical Machine Learning, Data Miningsolutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
  • Created reports using visualizations such as Bar chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, Tree map etc. in Power BI.
  • Expertise in relational database systems (RDBMS) such as My SQL, Oracle, MS SQL, and No SQL database systems like Hbase, MongoDB and Cassandra.
  • Experience with Software development tools such as JIRA, GIT, SVN.
  • Having experience in developing story telling dashboards data analytics, designing reports with visualization solutions using TableauDesktop and publishing on to the Tableau Server.
  • Tested, Cleaned, and Standardized Data to meet the business standards using Execute SQL task, Conditional Split, Data Conversion, and Derived column in different environments.
  • Flexible working Operating Systems like Unix/Linux (Centos, Redhat, Ubuntu) and Windows Environments.

TECHNICAL SKILLS:

Languages: SQL, PL/SQL, PYTHON, Java, Scala, C, HTML, Unix, Linux

ETL Tools: AWS Redshift Matillion, Alteryx, Informatica PowerCenter, Ab Initio

Big Data: HDFS,Map Reduce, Spark, Airflow, Yarn, NiFi, HBase, Hive, Pig, Flume, Sqoop, Kafka, Oozie, Hadoop, Zookeeper, Spark SQL.

RDBMS: Oracle 9i/10g/11g/12c, Teradata, My SQL, MS SQL

NO SQL: MongoDB, HBase, Cassandra

Cloud Platform: Microsoft Azure, AWS (Amazon Web Services)

Concepts and Methods: Business Intelligence, Data Warehousing, Data Modeling, Requirement Analysis

Data Modeling Tools: ERwin, Power Designer, Embarcadero ER Studio, IBM Rational Software Architect, MS Visio, ER Studio, Star Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables

Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss

Other Tools: Azure Databricks, Azure Data Explores, Azure HDInsight, Power BI

Operating Systems: UNIX, Windows, Linux

PROFESSIONAL EXPERIENCE:

Confidential, Greenwood, CO

Senior Big Data Engineer

Responsibilities:

  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Worked extensively onsparkandMLlibto develop aregression modelfor cancer data.
  • Hands on design and development of an application using Hive (UDF).
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using PIG by importing data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing PIG scripts to process data in a batch to perform trend analysis of data.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Used Pig Latin at client side cluster and HiveQL at server side cluster.
  • Importing the complete data from RDBMS to HDFS cluster usingSqoop
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
  • Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances
  • Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Writing map reduce code using pythonin order to get rid of certain security issues in the data
  • Hivetables are created as per requirement wereInternalorExternaltables defined with appropriatestatic, dynamic partitions and bucketing, intended for efficiency.
  • Load and transform large sets of structured, semi structured data using hive.
  • Handle billions of log lines coming from several clients and analyze those using big data technologies likeHadoop (HDFS), Apache KafkaandApache Storm.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • We used the most popular streaming toolKafkato load the data on Hadoop File system and move the same data to Cassandra NoSQL database.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Configured Spark Streaming to receive real time data from theKafkaand store the stream data to HDFS.
  • Migrated existingMapReduceprograms toSparkusingScalaandPython.
  • ImplementedSpark SQLto connect toHiveto read the data and distributed processing to make highly scalable.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Exported the Analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Developed Simple to complex MapReduce streaming jobs using Python language that are implemented using Hive and Pig.Work with Architects, Stakeholders and Business to design Information Architecture of Smart Data Platform for the Multistate deployment in Kubernetes Cluster.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Involved in Designing and Developing Enhancements product features.
  • Involved in Designing and Developing Enhancements of CSG using AWS APIS.
  • Created and maintained various DevOps related tools for the team such as provisioning scripts, deployment tools, and development and staging environments on AWS, Rack space and Cloud.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network

Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, Openshift, CI/CD(Jenkins), Kubernetes

Confidential, Atlanta, GA

Big Data Engineer

Responsibilities:

  • Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
  • Used Pyspark for data frames, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure DataFactory, SSIS, PowerShell.
  • Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
  • Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
  • Implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes.
  • Optimized the Tensor Flow Model for efficiency
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Implemented business use case in Hadoop/Hive and visualized in Tableau
  • Create data pipelines to use for business reports and process streaming data by using Kafka on premise cluster.
  • Process the data from Kafka pipelines from topics and show the real time streaming in dashboards
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Extensively used ApacheKafka, ApacheSpark,HDFSand ApacheImpalato build a near real time data pipelines that get, transform, store and analyze click stream data to provide a better personalized user experience.
  • Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Create Spark code to processstreaming datafromKafkacluster and load the data to staging area for processing.
  • Experienced in ETL concepts, building ETL solutions and Data modeling
  • Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Worked on developingETL Workflowson the data obtained using Scala for processing it in HDFS andHBaseusing Oozie
  • Compiled data from various sources to perform complex analysis for actionable results
  • Measured Efficiency of Hadoop/Hive environment ensuring SLA is met

Environment: Kafka, Impala, Pyspark, Azure, HDInsight, Data factory, Databricks, Datalake, Apache Beam, Cloud Shell, Tableau, Cloud Sql, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql, No Sql, MongoDB, TensorFlow, Jira.

Confidential, Wheeling, WV

Big Data Engineer

Responsibilities:

  • Experience in Loading the data into Spark RDD’s, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
  • Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Experience writing scripts using Python (or Go Lang) and familiarity with the following tools: AWS CloudLambda, AWS S3, AWS EC2, AWS Redshift, AWS Postgres
  • Developed Autosys scripts to schedule the Kafka streaming and batch job.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Snowflake.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
  • Implemented schema extraction for Parquet and Avro file Formats in Hive.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
  • Worked with BI team to create various kinds of reports using Tableau based on the client's needs.
  • Experience in Querying on Parquet files by loading them in to Spark's data frames by using Zeppelin notebook.
  • Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra for data access and analysis.

Environment: Hadoop Yarn, Spark-Core, Spark-Streaming, Spark-SQL, AWS Cloud, Scala, Python, Kafka, Hive, Sqoop, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux

Confidential

Data Engineer

Responsibilities:

  • Used Python programs automated the process of combining the large SAS datasets and Data files and then converting as Teradata tables for Data Analysis.
  • Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka
  • Developed the Interfaces in SQL, for data calculations and data manipulations.
  • Data Ingestion into the Indie-Data Lake using Open source Hadoop distribution to process Structured, Semi-Structured and Unstructured datasets using Open source.
  • Apache tools like FLUME and SQOOP into HIVE environment.
  • Developed Python programs for manipulating the data read from Teradata data sources and convert them as CSV files.
  • Worked on Micro Strategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
  • Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
  • Used MS Excel and Teradata for data pools and adhocs reports for business analysis
  • Performed in depth analysis in data & prepared weekly, biweekly, monthly reports by using SQL, MS Excel and UNIX.
  • Experience in automation scripting using shell and Python.
  • Leveraged with the design and development of Micro Strategy dashboards and interactive documents using Micro Strategy web and mobile.
  • Continuous monitoring and managing the Hadoop cluster through Ganglia and Nagios.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability
  • Kafka- Used for building real-time data pipelines between clusters.

Environment: AWS, Kafka, Spark, Python, SQL, UNIX, MS EXCEL, Kafka, Hive, Pig, Hadoop

Confidential

Data Engineer

Responsibilities:

  • Involved in review of functional and non-functional requirements.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Importing and exporting data into HDFS from OracleDatabase and vice versa using sqoop.
  • Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
  • Written Hive queries for data analysis to meet the business requirements.
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Tableau, Eclipse, Oracle 10g, Toad

We'd love your feedback!