Big Data Engineer Resume
Nashville -, TN
SUMMARY
- 7 years of IT experience in Business Requirement analysis, design, development and implementation of Data warehousing and Business Intelligence applications using ETL.
- Experience in design, development, and Implementation of Big data applications using Hadoop ecosystem frameworks and tools like HDFS, MapReduce, Yarn, Pig, Hive, Sqoop, Spark, Scala, Storm HBase, Kafka, Flume, Nifi, Impala, Oozie, Zookeeper, Airflow, etc.
- Strong technical experience in SDLC including requirement gathering, Design, Analysis, Coding, Testing, Documentation and Deployment.
- Worked on several projects like ACA, HCSIS, OMAP and IES, CWIS projects which required conversion from Legacy systems to give a one stop solution for public services catered in diversified areas like senior benefits, health and financial assistance, child and family welfare, transport etc.
- Having experience in Snowflake cloud data warehousing shared technology environment for providing stable infrastructure, architecture, best practices, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, Deployment Process) utilities.
- Understanding and working knowledge of Informatica CDC (Change Data Capture).
- Good knowledge of Relational Databases like SQL Server, Oracle, DB2.
- Experienced in Logical and Physical Database design & development, Normalization and Data modeling using Erwin and SQL Server Enterprise manager.
- Experience with slowly changing dimension methodology and slowly growing targets methodologies.
- Strong experience in migrating other databases to Snowflake.
- Played key role in MigratingTeradataobjects intoSnowFlakeenvironment.
- Experience withSnowflake Multi - Cluster Warehouses.
- Experience withSnowflake Virtual Warehouses.
- Design dimensional model, data lake architecture, data vault 2.0 on Snowflake and used Snowflake logical data warehouse for compute.
- Experience in multidimensional modeling and RDBMS concepts Data mart, OLAP, OLTP using Star Schema and Snow-Flake Schema.
- Good understanding in writing Test plans, Test cases, Unit testing, System testing, Integration testing and Functional testing.
- Strong experience in InformaticaData Quality (IDQ), Power Center, Data Cleansing, Data profiling, Data quality measurement and Data validation processing
- Used Unix Shell scripts to write pmcmd commands to automate workflows.
- Experience with different scheduling tools to accommodate requisites.
- Expertise in developing streaming applications in Scala using Kafka and Spark Structured Streaming.
- Expertise in developing and tuning Spark applications using various optimizations techniques for executor tuning, memory management, garbage collection, Serialization assuring the optimal performance of applications by following best practices in the industry.
- Hands on experience in importing and exporting data into HDFS and Hive using Sqoop.
- Exposure on usage of NoSQL databases column oriented HBase and Cassandra.
- Hands-on experience in working with AmazonWebServices (AWS) cloud and its services like EC2, S3, AthenaRDS, VPC, IAM, ElasticLoadBalancing, Glue, Lambda, RedShift, Auto Scaling, Cloud Front, Cloud Watch, and other services of the AWS family.
- Experience in ingesting, processing, analyzing of structured and unstructured data on Hadoop clusters.
- Worked on MicroServices in PVS.
- Expertise in developing spring based Microservices
- Performed Data quality issue analysis using snow SQL by building analytical warehouses on snowflake.
- Experienced in performing code reviews, involved closely in smoke testing sessions, retrospective sessions.
- Have good exposure with the star, snowflake schema, data modelling and work with different data warehouse projects.
- Involved in the code migration of quality monitoring tool from AWS EC@ to AWS lambda and built logical datasets to administer quality monitoring on snowflake warehouses.
- Strong experience in the design and development of relational database concepts with multiple RDBMS databases including Oracle10g, MySQL, MS SQL Server & PL/SQL.
- Trouble-shooting production incidents requiring detailed analysis of issues on web and desktop applications, Autosys batch jobs, and databases.
- Strong troubleshooting and production support skills and interaction abilities with end users.
- Experience in working with various SDLC methodologies like Waterfall, Agile Scrum, and TDD for developing and delivering applications.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, MapReduce, Spark, Yarn, Kafka, PIG, HIVE, Sqoop, Storm, FlumeOozie, Impala, HBase, Hue, Zookeeper.
ETL Tools: Informatica 9.5.1 / 9.6.1/ 9.8.1/ SSIS 2012
Programming Languages: Java, PL/SQL, Pig Latin, Python, R, HiveQL, Scala, SQL
Java/J2EE & Web Technologies: J2EE, EJB, JSF, Servlets, JSP, JSTL, CSS, HTML, XHTML, XML, Angular JS, AJAX, JavaScript, JQuery.
Development Tools: Eclipse, SVN, Git, Ant, Maven, SOAP UI
Databases: Oracle 11g/10g/9i, Teradata, MS SQL, snowflake(cloud)
No SQL Databases: Apache HBase, Mongo DB
Frameworks: Struts, Hibernate, And Spring MVC.
Distributed platforms: Hortonworks, Cloudera.
Operating Systems: UNIX, Ubuntu Linux and Windows 00/XP/Vista/7/8
PROFESSIONAL EXPERIENCE
Confidential, Nashville - TN
Big Data Engineer
Responsibilities:
- Involved in requirements gathering, analysis, design, development, change management and deployment.
- Assisted Business Analyst with drafting the requirements, implementing design and development of various components of ETL for various applications.
- Monitor and modify daily & on demand batch jobs that load warehouse tables from various source systems like Oracle, Sql server, Salesforce.
- Fix sessions that result in undesired load data in lower & higher environments for both Data warehouse and DataMart’s.
- Define match rules, columns for match and merge process.
- Performed Data Cleansing, Address correction using IDQ.
- Built profiling, cleansing and validation plans using IDQ.
- Development on existing mappings that acquired extended scope to accommodate new logistics and perform regressive testing before migrating it to higher environments.
- Extracted data from heterogeneous sources and performed complex business logic on network data to normalize raw data which can be utilized by BI teams to detect anomalies.
- Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file.
- Developed Spark code using Scala and Spark-SQL for faster processing and testing.
- Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time .
- Involved in the development of real time streaming applications using Kafka, Hive,PySpark, Apache Flink on distributed Hadoop Cluster.
- Involved in Migrating Objects from Teradata to Snowflake and Created Snowpipe for continuous data load.
- UsedSnowflake time travelfeature to access historical data.
- Performs debugging, troubleshooting, modifications and unit testing of integration solutions
- Perform unit and integration testing and document test strategy and results
- Work closely with teams from other enterprise software vendors that are being integrated
- Use innovation to improve operational processes and performance, making sure data is of the highest quality, build and unit test integration components
- Utilized Apache Spark with Python to develop and execute Big Data Analytics.
- Developed common Flink module for serializing and deserializing AVRO data by applying schema.
- Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development.
- Used Microservicearchitecture withSpring Bootbased services interacting through a combination ofRESTto build, test and deploy identityMicroservices
- Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation and Adhoc querying.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark
- Developed CI-CD pipeline to automate build and deploy to Dev, QA, and production environments.
- Supported production jobs and developed several automated processes to handle errors and notifications. Also, tuned performance of slow jobs by improving design and configuration changes of PySpark jobs.
- Created standard report Subscriptions and Data Driven Report Subscriptions.
Environment: Hadoop, Informatica 9.8.1/10.2, IDQ, Salesforce, Embarcadero ER Studio Map Reduce, Spark, Spark MLLib,Kafka,Flink,PIG, Hive, AWS,Tableau, SQL, PostgresSQL, Python, PySpark, SQL Server 2012, T-SQL, CI-CD, Git, XML, Splunk,snowflake(cloud).
Confidential, Austin - TXSr. Data Engineer
Responsibilities:
- Extract Transform and Load (ETL) data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T SQL, Spark SQL and Azure Data Lake Analytics and processing the data in Azure Databricks.
- Installed and configured Hadoop MapReduce, HDFS with developing multiple MapReduce jobs in python for data cleaning and preprocessing. Utilized Apache Hadoop by Sandbox Hortonworks to monitor and manage the Hadoop Cluster.
- Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Installing and configuring the Hive and Apache pig queries for data analysis.
- Involved in schedulingOozieworkflow engine to run multiple Hive and pig jobs.
- Worked with writing SQL queries and stored procedures with PL/SQL code.
- Gathered a brief knowledge of Snowflake data warehouse, by extracting the data from Data Lake and sending it to other stages of integration. Also used Snowflake to Maintain and develop complex SQL queries, views, functions, and reports. ETL pipelines were used with SQL, No SQL.
- Also exported data using Sqoop from HDFS to Teradata on regular basis.
- Coordinate development work with team members, review ETL jobs, and create scripts for scheduling jobs and implementation. Used Airflow for scheduling.
- Expertise in snowflake to create and Maintain Tables and views.
- Validate the data feed from the source systems to Snowflake DW cloud platform.
- Integrated and automated data workloads to Snowflake Warehouse.
- Defined virtual warehouse sizing for Snowflake for different type of workloads.
- Worked on importing and exporting data from snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
- Redesigned the Views in snowflake to increase the performance.
- Optimizing of existing algorithms in Hadoop usingSpark Context, Spark SQL, Data Frames and Pair RDD’s. Also analyzed SQL scripts and designed the solutions to implement using PySpark.
- Imported required tables from RDBMS to HDFS using Sqoop and usedApache Spark streamingandKafkato get real time streaming of data intoHBase.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
Confidential, Santa Clara - CA
Sr. Data Engineer
Responsibilities:
- Understanding business needs, analyzing functional specifications and map those to develop and designingMapReduceprograms and algorithms
- Designed and implementedMapReduce-based large-scale parallel relation-learning system.
- CustomizedFlumeinterceptors to encrypt and mask customer sensitive data as per requirement
- Recommendations using Item Based Collaborative Filtering inApache Spark.
- Worked with NoSQL databases likeHbasein creating Hbase tables to load large sets of semi structured data coming from various sources.
- Built web portal using JavaScript, it makes a REST API call to theElastic searchand gets the row key.
- UsedKibana, which is an open source based browser analytics and search dashboard forElastic Search.
- UsedAmazon web services (AWS)like EC2 and S3 for small data sets.
- Performed importing data from various sources to theCassandracluster using Java APIs or Sqoop.
- Developed iterative algorithms usingSpark StreaminginScalafor near real-time dashboards.
- Installed and configured Hadoop and Hadoop stack on a40 nodecluster.
- Involved in customizing the partitioner inMapReducein order to root Key value pairs from Mapper to Reducers in XML format according to requirement.
- ConfiguredFlumefor efficiently collecting, aggregating and moving large amounts of log data.
- Involved in creating Hive tables, loading the data using it and in writing Hive queries to analyze the data.
- ImplementedAWSservices to provide a variety of computing and networking services to meet the needs of applications
- Involved in schedulingOozieworkflow engine to run multiple Hive and pig jobs
- Designed and built the Reporting Application, which uses theSpark SQLto fetch and generate reports onHBasetable data.
- Worked on batch processing of data sources usingApache Spark, Elastic search
- Extracted the needed data from the server intoHDFSand Bulk Loaded the cleaned data intoHBase.
- Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
- Strong Experience in implementing Data warehouse solutions inAmazon web services (AWS)Redshift; Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
- Involved inETL, Data Integration and Migration
- Responsible for creatingHive UDF’sthat helped spot market trends.
- Optimizing HadoopMapReducecode,Hive/Pigscripts for better scalability, reliability and performance
- Experience in storing the analyzed results back into theCassandracluster.
- Developed custom aggregate functions usingSpark SQLand performed interactive querying
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Kafka,NiFi, SQL, PIG, Hive, AWS, PostgresSQL, Python, PySpark, Microservices SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.
Confidential, NJ
Data Engineer
Responsibilities:
- Analyzing Functional Specifications Based on Project Requirement.
- Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, and Kafka.
- Extended Hive core functionality by writing custom UDFs using Java.
- Developing Hive Queries for the user requirement.
- Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from TeamCenter, SAP, Workday, Machine logs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Creates and owns the strategic roadmap for data integration across the enterprise
- Strong understanding of Enterprise data integration patterns and techniques
- Ability to validate data integration by developing and executing test plans
- Planning, scheduling and implementing Oracle to MS SQL server migrations for in house applications and tools.
- Designed and developed Micro Services business components using Spring
- Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool.
- Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
- Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Developed work flows in Live Compare to Analyze SAP Data and Reporting.
- Worked on Java development and support and tools support for in house applications.
- Participated in daily scrum meetings and iterative development.
- Search functionality for searching through millions of files of logistics groups.
Environment: Hadoop, Hive, Sqoop, Spark, Kafka, Scala, MS SQL Server PDW, AWS, Java, microservices.
Confidential
Data Engineer
Responsibilities:
- Worked on Hortonworks-HDP 2.5distribution.
- Responsible for building-scalable distribution data solution using Hadoop.
- Involved in importing data from MS SQL Server, MySQL and Teradata into HDFS using Sqoop.
- Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata.
- Wrote HiveQL queries for integrating different tables for create views to produce result set.
- Collected the log data from Web Servers and integrated into HDFS using Flume.
- Worked on loading and transforming of large sets of structured and unstructured data.
- Used MapReduce programs for data cleaning and transformations and load the output into the Hive tables in different file formats.
- Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location.
- Involved in loading data into HBase NoSQL database.
- Building, Managing and scheduling Oozie workflows for end-to-end job processing.
- Worked on extending Hive and Pig core functionality by writing custom UDFs using Java.
- Analyzing of Large volumes of structured data using SparkSQL.
- Migrated HiveQL queries into SparkSQL to improve performance.
Environment: Hortonworks, Hadoop, HDFS, Pig, Sqoop, Hive, Oozie, Zookeeper, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL.