We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

AZ

SUMMARY

  • Over 8+ years of diversified experience in Software Design & Development. Experience as Big Data Engineer/solving business use cases for several clients. Experience in the field of software with expertise in backend applications.
  • Excellent Experience in Designing, Developing, Documenting, Testing of ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.
  • Experience in usage of Hadoop distribution like Cloudera and Hortonworks.
  • Deep understanding of MapReduce with Hadoop and Spark. Good working knowledge of Big Data ecosystem like Hadoop (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitorHadoop and Spark jobs on AWS.
  • Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.
  • Integrated Kafka with Spark Streaming for real time data processing.
  • Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, KAFKA.
  • Good Exposure on Apache Hadoop Map Reduce programming PIG Scripting and Distribute Application and HDFS. Good Knowledge on Hadoop Cluster architecture and monitoring the cluster.
  • Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.
  • Utilized analytical applications like SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
  • Have experience inApache Spark, Spark Streaming, Spark SQL and NoSQLdatabases likeHBase, Cassandra, andMongoDB.
  • Worked on Snowflake Schemas and Data Warehousing
  • Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
  • Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
  • Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow.
  • Worked on Microsoft azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
  • Experienced with JSON basedRESTfulweb services, and XML/QML basedSOAPweb services and also worked on various applications using python integrated IDEs like Sublime Text andPyCharm
  • Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau and user-filters using Tableau.
  • Experience with Unix/Linux systems with scripting experience and building data pipelines.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
  • Ability to independently multi - task, be a self-starter in a fast-paced environment, communicate fluidly and dynamically with the team and perform continuous process improvements with out of the box thinking.

TECHNICAL SKILLS

Big Data: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, MapReduce, PIG, SQOOP, Kafka, Hbase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, PL/SQL

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL

Cloud Technologies: AWS, Microsoft Azure

Frameworks: Django REST framework, MVC, Hortonworks

ETL/Reporting: Ab Initio, Informatica, Tableau

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Web/App Server: UNIX server, Apache Tomcat

Operating System: UNIX, Windows, Linux, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, AZ

Senior Big Data Engineer

Responsibilities:

  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Coordinated with team and Developed framework to generate Daily ad-hoc, Report’s and Extracts from enterprise data and automated.
  • Designed and implemented ETL pipelines between from various Relational Data Bases to the Data Warehouse using Apache Airflow.
  • Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Played a key role inmigrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies
  • Written multiple Map Reduce program in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Hands of experience inGCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver installation, configuration and deployment of product soft wares on new edge nodes that connect and contact Kafka cluster for data acquisition
  • Worked on creating a framework usingApache Beam Dataflowfor processing streaming data from Google Cloud Pub/Sub in both batch and parallel processing.
  • Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Working on Docker containerized service to leverage the infrastructure.
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Installing, configuring and maintaining Data Pipelines
  • Designed and developed Informatica Mappings and Sessions, Workflows based on business rules to load data from source Oracle tables to Teradata target tables
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker
  • Developed a framework in Apache Beam Dataflow so that the other team members can easily add the classes related to business logic.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Worked on deploying the project on the servers using Jenkins
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Migrated the data from Redshift data warehouse to Snowflake.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Worked on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Using Flume, Kafka and Spark streaming to ingest real time or near real time data in HDFS.
  • Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.

Environment: Cloudera Manager (CDH5), Hadoop, Hive, S3, Kafka, Pyspark, HDFS, NiFi, Pig, Scrum, Git, Sqoop, Oozie. Pyspark, Informatica, Tableau, Snowflake, OLTP, OLAP, SQL Server, Python, Shell Scripting, HBase, Cassandra, Informatica, XML,Unix.

Confidential, Rensselaer, NY

Big Data Engineer

Responsibilities:

  • Analyzing large amounts of data sets to determine optimal way to aggregate and report on these data sets.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Built real time pipeline for streaming data usingKafkaandSparkStreaming.
  • Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed ELT jobs using Apache beam to load data into Big Query tables.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
  • Involved in all the steps and scope of the project data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Data warehouse experience in Star Schema, Snowflake Schema, Slowly Changing Dimensions (SCD) techniques etc.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Spark-Streaming, Hive, Scala, Hadoop, Kafka, Spark, Sqoop, Docker, Spark SQL, TDD, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper, Snowflake, Azure, Unix/Linux Shell Scripting,Python, PyCharm, Informatica, Informatica PowerCenter Linux,, Shell Scripting, Git

Confidential, Atlanta, GA

Data Engineer

Responsibilities:

  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Developed code to handle exceptions and push the code into the exception Kafka topic.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Implementing and Managing ETL solutions and automating operational processes.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
  • Integrated Kafka with Spark Streaming for real time data processing
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Migrating an entire oracle database to BigQuery and using of power bi for reporting.
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running ad-hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin
  • Responsible for the execution ofbig data analytics, predictive analytics, and machine learning initiatives.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
  • Used Git version control to manage the source code and integrating Git with Jenkins to support build automation and integrated with Jira to monitor the commits.

Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, MySQL, PostgreSQL.

Confidential

Hadoop Developer

Responsibilities:

  • Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a MapReduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
  • Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop
  • Written Hive queries for data analysis to meet the business requirements.
  • Collected the JSON data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Responsible for Designing Logical and Physical data modeling for various data sources.
  • Performed logical data modeling, physical Data modeling (including reverse engineering) using the Erwin Data modeling tool.
  • Created dimensional model for the reporting system by identifying required dimensions and facts using Erwin.
  • Extensively used event-driven and scheduled AWS Lambda functions to trigger various AWS resources.
  • This plugin allowsHadoop MapReduce programs HBase Pigand Hive to work unmodified and access files directly.
  • Created ETL Pipeline using Spark and Hive for ingest data from multiple sources.
  • Involved in using SAP and transactions done in SAP - SD Module for handling customers of the client and generating the sales reports. Creating reports using SQL Reporting Services (SSRS) for customized and ad-hoc Queries.
  • Installed and configuredHadoop MapReduceHDFS Developed multipleMapReducejobs in java for data cleaning and preprocessing.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Developed shell scripts for running Hive scripts in Hive and Impala.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
  • Developing Scripts and Batch Job to schedule various Hadoop Program.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system
  • Handled Hive queries using Spark SQL that integrate with Spark environment.

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, AWS, Pig, HBase, Linux, XML, Eclipse, Oracle, PL/SQL, MongoDB, Toad.

Confidential

Software Engineer

Responsibilities:

  • Used Software development best practices for Object Oriented Design and methodologies throughout Object oriented development cycle.
  • Created Hive tables and wrote Hive queries using HiveQL
  • Involved in the configuration of Hibernate O/R mapping files.
  • Developed Persistence service layer by Using Hibernate to populate and fetch data from DB.
  • Extensively worked with Hibernate Query Language (HQL) to store and retrieve the data from Oracle database.
  • Implemented server-side programs by using Servlets and JSP.
  • Extracted files from Cassandra through Sqoop and placed in HDFS and processed.
  • Worked on Installing and configuring MapReduce, HDFS and developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Wrote Hive UDFs in Java where the functionality is too complex.
  • Used Pig (Pig Latin) scripts for ad-hoc data retrieval
  • Developed application on Struts MVC architecture utilizing Action Classes, Action Forms and validations.
  • Implemented authentication and authorization using spring security.
  • Developed shell scripts to run the nightly batch cycle and to set environment variables.
  • Used Maven to build the project, run unit tests and deployed artifacts to Nexus repository.
  • Involved in writing SQL queries and procedures.
  • Used JMS API for asynchronous communication to put the messages in the Message queue.
  • Developed JUnit Test cases for Unit Test cases and as well as System and User test scenarios.
  • Used log4j for logging the information.

Environment: Apache Hadoop, CDH (Cloudera Distribution), Java, HDFS, Hive, Sqoop, Eclipse, JSP, Hibernate, Oracle 9i, JBoss, JMS, Mave

We'd love your feedback!