Sr Big Data Engineer Resume
Seattle, WA
SUMMARY
- Overall 9+ Years of experience in Software industry and around 5+ years as Azure data engineer.
- Experience in building data pipelines using Azure Data factory, Azure databricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Developed highly scalable Spark applications using Spark Core, Data frames, Spark - SQL and Spark API's in Scala. Worked on real time data integration using Kafka, Spark streaming and HBase.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Good understanding of Big Data Hadoop and Yarn architecture along wif various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node,Resource/Cluster Manager, and Kafka (distributed stream-processing) .
- Experience in Database Design and development wif Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.
- Solid understanding of Hadoop MRV1 and Hadoop MRV2 (or) YARN Architecture.
- Strong skills in visualization tools Power BI, Confidential Excel - formulas, Pivot Tables, Charts and DAX Commands.
- Developed core modules in large cross-platform applications using JAVA, JSP, Servlets, Hibernate, RESTful, JDBC, JavaScript, XML, and HTML.
- Experience in analyzing data using HiveQL, and MapReduce Programs.
- Experienced in ingesting data into HDFS from various Relational databases like MYSQL, Oracle, DB2, Teradata, Postgres using sqoop.
- Experienced in importing real time streaming logs and aggregating teh data to HDFS using Kafka and Flume.
- Experience in creatingDocker Containersleveraging existing Linux Containers and AMI's in addition to creatingDocker Containersfrom scratch.
- Well versed wif various Hadoop distributions which include Cloudera (CDH), Hortonworks (HDP), Azure HD Insight.
- Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
- Experience working on NoSQL Databases like HBase, Cassandra and MongoDB.
- Experience in Python, Scala, shell scripting, and Spark.
- Experience wif Testing Map Reduce programs using MRUnit, Junit and EasyMock.
- Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
- Experience in Developing Spark applications using Spark - SQL, Pyspark and Delta Lake in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
- Good understanding of Spark Architecture, MPP Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Experienced and Implemented SAN TR migrations like Host based and Array based migrations.
- Hands on Experience in performing Host based online SAN migrations.
- Working as Cloud Administrator on Microsoft Azure, involved in configuring virtual machines, storage accounts, resource groups.
- Experience wif MS SQL Server Integration Services (SSIS), T-SQL skills, stored procedures, triggers.
- Design and develop Spark applications using and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
- Good understanding of Big Data Hadoop and Yarn architecture along wif various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing).
- Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion.
TECHNICAL SKILLS
Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Hue, Ambari, Zookeeper, Kafka, Apache Spark, Spark Streaming, Impala, HBase, Flume
Hadoop Distributions: Cloudera, Horton Works, Apache, AWS EMR, Docker, Databricks
Languages: C, Java, PL/SQL, Python, Pig Latin, Hive QL, Scala, Regular Expressions
IDE & Build Tools, Design: Eclipse, NetBeans, IntelliJ, JIRA, Microsoft Visio, PyCharm
Web Technologies: HTML, CSS, JavaScript, XML, JSP, RESTful, SOAP
Operating Systems: Windows (XP,7,8,10), UNIX, LINUX, Ubuntu, CentOS
Reporting Tools: Tableau, Docker, Power view for Microsoft Excel, Talend, Micro Strategy
Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, MongoDB), Teradata, IBM DB2
Build Automation tools: SBT, Ant, Maven
Version Control Tools: GIT
Cloud: AWS S3, AWS EMR, AWS EC2, AZURE Data Lake, Azure Data Factory, Blob storage, HDInsight, Azure SQL Server.
PROFESSIONAL EXPERIENCE
Sr Big Data Engineer
Confidential, Seattle, WA
Responsibilities:
- Worked closely wif teh business analysts to convert teh Business Requirements into Technical Requirements and prepared low- and high-level documentation.
- Worked on business problems to develop and articulate solutions using Teradata’s UDA Framework and multi-level data architecture.
- Worked on analyzing different big data analytic tools including Hive, Impala and Sqoop in importing data from RDBMS to HDFS.
- Configured Spark Streaming to receive real time data from teh Kafka and store teh stream data to HDFS.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Designed high level ETL architecture for overall data transfer from teh OLTP to OLAP.
- Improving teh performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN.
- Experience using Amazon Web Services including Kinesis, Lambda, SQS/SNS, S3, RDS.
- Experience in reading from and writing data to Amazon S3 in Spark Applications.
- Experience in selecting and configuring teh right Amazon EC2 instances and access key AWS services using client tools.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, Mongo DB, Cassandra, HBase, Teradata, Netezza and log data from servers.
- Built scalable and robust data pipelines for Business Partners Analytical Platform to automate their reporting dashboard using Spark SQL and also scheduled teh pipelines.
- Created various Documents such as Source-To-Target Data mapping Document, Unit Test, Cases and Data Migration Document.
- Imported data from structured data source into HDFS using Sqoop incremental imports.
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWSprofiling.
- Created Hive tables, partitions and implemented incremental imports to perform ad-hoc queries on structured data.
- Worked wif NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB.
- Involved in loading data from UNIX tile system to HOPS using Flume and Kettle and HDFS API.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated wif custom visualization tools using R, Hadoop and MongoDB, Cassandra.
- Created Hive Generic UDF's to process business logic wif Hive QL and build Hive tables using list partitioning and hash partitioning.
- Developed SQL scripts using Spark for handling different data sets and verifying teh performance over Map Reduce jobs.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Implemented mapping through Elasticsearch to improve search results and business flow.
- Extensively used Apache Sqoop for efficiently transferring bulk data between Apache Hadoop and relational databases (Oracle, MySQL) for predictive analytics
- Supported MapReduce Programs those are running on teh cluster and also wrote MapReduce jobs using JavaAPI.
- Imported data from mainframe dataset to HDFS using Sqoop. Also handled importing of data from various data sources (i.e. Oracle, DB2, Cassandra, and MongoDB) to Hadoop, performed transformations using Hive, MapReduce.
- Created MapReduce running over HDFS for data mining and analysis using R and Loading & Storage data to Pig Script and R for MapReduce operations.
- Wrote Hive queries for data analysis to meet teh business requirements.
- Developed teh technical strategy for Spark integrated for pure streaming and more general data-computation needs.
- Utilized Agile Scrum Methodology to halp manage and organize a team of 4 developers wif regular code review sessions.
- Developed a JDBC connection to get teh data from Azure SQL and feed it to a Spark Job.
- Wrote Scripts to generate Map Reduce jobs and performed ETL procedures on teh data in HDFS.
- Optimized teh mappings using various optimization techniques and also debugged some existing mappings using teh Debugger to test and fix teh mappings.
- Used S3 Bucket to store teh jar's, input datasets and used Dynamo DB to store teh processed output from teh input data set.
- Updated maps, sessions, and workflows as a part of ETL change and also modified existing ETL Code and document teh changes.
- Developed Python, Shell/Perl Scripts and Power shell for automation purpose and Component unit testing using Azure Emulator.
- Created MapReduce running over HDFS for data mining and analysis using R and Loading & Storage data to Pig Script and R for MapReduce operations.
- Developed MapReduce (YARN) jobs for cleaning, accessing, and validating teh data and Installed Oozie workflow engine to run multiple Hive and Pig jobs.
Environment: Hadoop, Java, MapReduce, AWS, HDFS, Redshift, Spark, Hive, Pig, Linux, XML, Eclipse, Cloudera, CDH4/5 Distribution, DB2, YARN, SQL Server, Informatica, Oracle 12c, SQL, Scala, Azure, MySQL, R, Teradata, EC2, Flume, Zookeeper, Teradata, Scala, Python, Elasticsearch, DynamoDB, Hortonworks, ETL, AWS ELB, EC2, S3, Lambda and Kinesis and Redshift.
Azure Data Engineer
Confidential, St Louis, MO
Responsibilities:
- Understand requirements, build codes, and guide other developers in teh course of development activities in order to develop high standard stable codes wifin teh limits of Confidential and client processes, standards and guidelines.
- Develop Informatica mappings to be implemented based on client requirements and for teh analytics team.
- Perform end to end system integration testing
- Involve in functional testing and regression testing
- Design AWS architecture, Cloud migration, AWS EMR, Dynamo DB, Redshift and event processing using lambda function.
- Automated DAG generation based on source system using concept of dynamic DAG’s. Data extraction is done using Sqoop, DMS. Migrated over 500+ tables from in-house to S3 storage.
- Review and write sql scripts to verify data from source systems to target
- Worked on transformations to transform teh data required by analytics team for visualization and business decisions.
- Review plan and provide feedback on gaps, timeline and execution feasibility etc. as required in teh project
- Participate in KT sessions conducted by customer/ other business teams and provide feedback on requirements
- Involved in migrating teh client data warehouse architecture from on-premises into Azure cloud.
- Create pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse.
- Creating storage accounts which involved wif end to end environment for running jobs.
- Implement Azure Data Factory operations and deployment into Azure for moving data from on-premise into cloud.
- Design data auditing and data masking for security purpose.
- Monitoring end to end integration using Azure monitor.
- Implementation of data movements from on-premises to cloud in Azure.
- Develop batch processing solutions by using Data Factory and Azure Data bricks
- Implement Azure Data bricks clusters, notebooks, jobs and auto scaling.
- Design for data auditing and data masking
- Design for data encryption for data at rest and in transit
- Design relational and non-relational data stores on Azure
- Preparing ETL test strategy, designs and test plans to execute test cases for ETL and BI systems.
- Creating ETL test scenarios and test cases and plans to execute test cases.
- Interacting wif business users and understanding their requirements.
- Good understanding of data warehouse concepts.
- Good exposure and understanding of Hadoop Ecosystem
- Proficient in SQL and other relational databases.
- Good exposure to Microsoft Power BI.
- Good understanding and working knowledge of Python language
Environment: SQL Database, Azure data factory, Python Pig, Sqoop, Kafka, Apache Cassandra, Oozie, Impala, Cloudera, AWS, AWS EMR, Redshift, Flume, Apache Hadoop, HDFS, PostgreSQL Hive, Map Reduce, Cassandra, Zookeeper, MySQL, Eclipse, Dynamo DB, PL/SQL and Python.
Data Engineer
Confidential, Jersey City, NJ
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytical and processing tools including Sqoop, Hive, Spark, Kafka and Pyspark.
- Worked on MapR platform team for performance tuning of hive and spark jobs of all users.
- Using Hive TEZ engine to increase teh performance of teh applications.
- Working on incidents created by users for platform team on hive and spark issues by monitoring Hive and Spark logs and fixing it or else by raising MapR cases.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Tested teh cluster Performance using Cassandra-stress tool to measure and improve teh Read/Writes.
- Worked on Hadoop Data Lake for ingesting data from different sources such as Oracle and Teradata through INFOWORKS ingestion tool.
- Worked on ARCADIA for creating analytical views on top of tables as if teh batch is loading also no issue in reporting or table locks as it will point to arcadia view.
- Worked on Python API for converting assigned group level permissions to table level permission using MapR ace by creating a unique role and assigning through EDNA UI.
- Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.
- Migrating various Hive UDF's and queries into Spark SQL for faster requests.
- Configured to receive real time data from teh ApacheKafka and store teh stream data to HDFS using Kafka connect.
- Hands on experience in Spark using Scala and Python creating RDD's, applying operations -Transformation and Actions.
- Extensively perform complex data transformations in Spark using Scala language.
- Involved in converting Hive/SQL queries into Spark transformations using Scala.
- Used Pyspark and Scala languages to process teh data.
- Used Bitbuket and Git repositories.
- Used text, AVRO, ORC and Parquet file formats for Hive tables.
- Experienced Scheduling jobs using Crontab.
- Used Sqoop to import data from Oracle, Teradata to Hadoop.
- Created Master Job Sequences for integration, (ETL Control) logic to capture job success, failure, error and audit, information for reporting.
- Used TES Scheduler engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Spark, Kafka and Sqoop.
- Experienced in creating recursive and replicated joins in hive.
- Experienced in developing scripts for doing transformations using Scala.
- Written Map Reduce code to process and parsing teh data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Experienced in creating teh shell scripts and made jobs automated.
Environment: HDFS, Hadoop, Python, Hive, Sqoop, Flume, Spark, Map Reduce, Scala, Oozie, YARN, Tableau, Spark-SQL, Spark-MLlib, Impala, Nagios, UNIX Shell Scripting, Zookeeper, Kafka, Agile Methodology, SBT.
Java Developer
Confidential, NYC, NY
Responsibilities:
- Responsible for requirement gathering and analysis through interaction wif end users. designing use-case diagrams, class diagram, interaction using UML model wif Rational Rose.
- Designed and developed teh application using various design patterns, such as session facade, business delegate and service locator and Worked on Maven build tool
- Involved in developing JSP pages using Struts custom tags, jQuery and Tiles Framework.
- Used JavaScript to perform client-side validations and Struts-Validator Framework for server-side validation.
- Good experience in Mule development.
- Developed Web applications wif Rich Internet applications using Java applets, Silverlight, JavaFX.
- Involved in creating Database SQL and PL/SQL queries and stored Procedures.
- Implemented Singleton classes for property loading and static data from DB.
- Debugged and developed applications using Rational Application Developer (RAD).
- Developed a Web service to communicate wif teh database using SOAP.
- Developed DAO (data access objects) using Spring Framework and Deployed teh components into WebSphere Application server
- Actively involved in backend tuning SQL queries/DB script.Worked in writing commands using UNIX Shell script.
- Involved in developing other subsystems' server-side components.
- Used Asynchronous JavaScript and XML (AJAX) for better and faster interactive Front-End.
- Developed Unit Test Cases. Used JUnit for unit testing of teh application.
- Provided Technical support for production environments resolving teh issues, analyzing teh defects, providing, and implementing teh solution defects. Resolved more priority defects as per teh schedule.
- Involved in teh release management process to QA/UAT/Production regions.
- Used Maven tool for building application EAR for deploying on Web Logic Application servers.
- Developed of teh project in teh agile environment.
Environment: Java 1.6, Servlets, JSP, Struts1.2, Git, Bash, IBM Rational Application Developer (RAD), Web logic, Tomcat, Maven, ant, and Jenkins 6, Web sphere 6.0, AJAX, Rational Clear case, Oracle 9i, log4j.
Data Engineer /ETL Developer
Confidential, Charlotte, NC
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- Built APIs dat will allow customer service representatives to access teh data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked wif teh QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes Troubleshooting, Manage and review data backups, Manage & review log files.
- Worked closely wif teh ETL SQL Server Integration Services (SSIS) Developers to explain teh Data Transformation.
- Teh new Business Data Warehouse (BDW) improved query/report performance, reduced teh time needed to develop reports and established self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist teh users wif data analysis.
- Used Oozie scripts for deployment of teh application and perforce as teh secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
Environment: Erwin, Oracle 10g, SQL Server 2005, Business Object Data Integrator 6.X, VB Script, PL/SQL, Microsoft Project/Office.