Sr. Big Data Engineer Resume
SUMMARY
- Above 8+ years of experience as Big Data Engineer /Data Engineer and Data Analyst including designing, developing and implementation of data models for enterprise - level applications and systems.
- Experience in Worked on NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
- Expertise in writing Hadoop Jobs to analyze data using MapReduce, Apache Crunch, Hive, Pig, and Splunk.
- Data Integration, ETL, & Quality - Talend Data Fabric for Big Data, Kettle Pentaho, elastic. Agile - Scrumban
- Experienced in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, and EMR, Elastic search), Hadoop, Python, Spark and effective use of MapReduce, SQL and Cassandra to solve big data type problems.
- Good experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS), Cognos and Business Objects.
- Knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
- Hands on experience in Normalization (1NF, 2NF, 3NF and BCNF) Denormalization techniques for effective and optimum performance in OLTP and OLAP environments.
- Hands on experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, PIG, HIVE, HBASE, Apache Crunch, ZOOKEEPER, SCIOOP, Hue, Scala and CHEF.
- Experience in developing and designing POC's using Scala, Spark SQL and MLlib libraries then deployed on the Yarn cluster.
- Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
- Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
- Expertise in integration of various data sources like RDBMS, Spreadsheets, Text files, JSON and XML files.
- Solid knowledge of Data Marts, Operational Data Store (ODS), OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Expertise in Data Architect, Data Modeling, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
- Experience in designing, building and implementing complete Hadoop ecosystem comprising of Map Reduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark.
- Experience with Client-Server application development using Oracle PL/SQL, SQL PLUS, SQL Developer, TOAD, and SQL LOADER.
- Strong experience with architecting highly per formant databases using PostgreSQL, PostGIS, MySQL and Cassandra.
- Extensive experience in using ER modeling tools such as Erwin and ER/Studio, Teradata, BTEQ, MLDM and MDM.
- Experienced on R and Python for statistical computing. Also experience with MLlib (Spark), Matlab, Excel, Minitab, SPSS, and SAS
- Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Strong Experience in working with Databases like Teradata and proficiency in writing complex SQL, PL/SQL for creating tables, views, indexes, stored procedures and functions.
- Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop.
- Performed the performance and tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE and Data Stage.
- Good experience working on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
- Experience in Data transformation, Data Mapping from source to target database schemas, Data Cleansing procedures.
- Extensive experience in development of T-SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
- Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS) tools.
- Involve in writing SQL queries, PL/SQL programming and created new packages and procedures and modified and tuned existing procedure and queries using TOAD.
- Good Understanding and experience in Data Mining Techniques like Classification, Clustering, Regression and Optimization.
TECHNICAL SKILLS
Hadoop Ecosystem: MapReduce, HBase 1.2, Hive 2.3, Pig 0.17, Solr 7.2, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hue, Cloudera Manager, Stream sets, Neo4j, Hadoop 3.0, Apache Nifi 1.6, Cassandra 3.11
OLAP Tools: Tableau, SAP BO, SSAS, Business Objects, and Crystal Reports 9
Cloud Platform: AWS, Azure, Google Cloud, Cloud Stack/Open Stack
Programming Languages: SQL, PL/SQL, UNIX shell Scripting, PERL,Python, AWK, SED
Databases: Oracle 12c/11g, Teradata R15/R14, MS SQL Server 2016/2014, DB2.
Operating System: Windows 7/8/10, Unix, Sun Solaris
ETL/Data warehouse Tools: Informatica v10, SAP Business Objects Business Intelligence 4.2 Service Pack 03, Talend, Tableau, and Pentaho.
Methodologies: RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Agile, Waterfall Model.
PROFESSIONAL EXPERIENCE
Sr. Big Data Engineer
Confidential
Responsibilities:
- As a Sr. Big Data Engineer, you will provide technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics.
- Responsible for the planning and execution of big data analytics, predictive analytics and machine learning initiatives.
- Assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
- Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open Source framework knowledge.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Implemented MapReduce programs to retrieve results from unstructured data set.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
- Importing and exporting data into HDFS and Hive using Sqoop from Oracle and vice versa.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Generated ad-hoc Tableau reports based on user requirements.
- Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Worked with the clients on-site and provide ERP solutions based on SQL on Oracle and Microsoft SQL Servers.
- Designed and Developed Tableau reports and dashboards for data visualization using Python.
- Experienced in Creating Store procedures and functions in SQl server to import data in to Elastic Search and converting relational data in to documents.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
- Determined what Elastic Search queries produce the best search experience
- Created training manuals and conduct enterprise-wide formal training on the ERP system.
- Experienced in implementing POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
- Developed Spark scripts by using Python and Scala shell commands as per the requirement.
- Created Data Map, registration, real time mapping, workflows, restart token and recovery process using Informatica Power Exchange 9.1.
- Experienced with batch processing of data sources using Apache Spark, Elastic search.
- Experienced in AWS cloud environment and on S3 storage and EC2 instances
- Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Worked with different types of Servers integrated with Tableau such as Amazon, Cloudera Hadoop, Oracle, and MySQL.
- Build and produce REST service for custom Search service on Elastic Search
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
Environment: Pig 0.17, Hive 2.3, HBase 1.2, Sqoop 1.4, Flume 1.8, Cassandra 3.11, zookeeper, AWS, MapReduce, HDFS, Oracle, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix and SOLR 7.2
Confidential - Boston, MA
Sr. Big Data Engineer
Responsibilities:
- As a Sr. Big Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they related to the development of analytics.
- Responsible for the planning and execution of big data analytics, predictive analytics and machine learning initiatives.
- Assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
- Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open Source framework knowledge.
- The roadmap includes moving the division from a heavily man-hour intensive code based environment into a more design centric and design based environment leveraging the capabilities of Talend Data Fabric Big Data Edition, Apache NiFi, Kafka and a few other tool set. Installed and configured Hortonworks HDF/NiFi for POC, and later migrated it into production with link to the Azure Data Lake (ADLS)
- Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
- Designed and developed software applications, testing, and building automation tools.
- Designed efficient and robust Hadoop solutions for performance improvement and end-user experiences.
- Extensively used Pig for data cleansing using Pig scripts and Embedded Pig scripts.
- Explored MLlib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for use case.
- In preprocessing phase of data extraction, we used Spark to remove all the missing data for transforming of data to create new features.
- Worked in a Hadoop ecosystem implementation/administration, installing software patches along with system upgrades and configuration.
- Conducted performance tuning of Hadoop clusters while monitoring and managing Hadoop cluster job performance, capacity forecasting, and security.
- Analyzed Big Data Analytic technologies and applications in both business intelligence analyses.
- Designed and led the implementation of core system components: predictive caching with off-heap Chronicle maps and Apache Ignite in-memory data fabric, Kafka.
- Developed analytics enablement layer using ingested data that facilitates faster reporting and dashboards.
- Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion platform.
- Lead architecture and design of data processing, warehousing and analytics initiatives.
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies using Hadoop, MapReduce, HBase, Hive and Cloud Architecture.
- Worked on implementation and maintenance of Cloudera Hadoop cluster.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Implemented distributed asynchronous calculation components interfacing with Data Fabric. Coordinated with CE team devising integration and WebLogic stress test environments and mentored RSC team members on framework development/testing.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Active involvement in design, new development and SLA based support tickets of Big Machines applications.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Developed Oozie workflow jobs to execute hive, Sqoop and MapReduce actions.
- Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
- Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
- Involved in working of big data analysis using Pig and User defined functions (UDF).
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
- Built Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
- Developed complete end to end Big-data processing in Hadoop eco-system.
- Built a data lake as a cloud based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Proof-of-concept to determine feasibility and product evaluation of Big Data products
- Writing Hive join query to fetch info from multiple tables, writing multiple MapReduce jobs to collect output from Hive.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Evaluated Composite for data fabric / data services virtualization approach; evaluating industry data model approach.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Developed in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
- Involved in developing MapReduce framework, writing queries scheduling map-reduce
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Developed customized classes for serialization and De-serialization in Hadoop.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Implemented a proof of concept deploying this product in Amazon Web Services AWS.
- Involved in migration of data from existing RDBMS (Oracle and SQL server) to Hadoop using Sqoop for processing data.
- Worked on Hive query to process key, value pairs and upload the data to NoSQL database HBase.
Environment: Hadoop 3.0, MapReduce, HBase, Hive 2.3, Informatica, HDFS, Scala 2.12, Spark, Sqoop 1.4, Apache Nifi, HDFS, AWS, EC2, SQL server, Oracle 12c, Pig 0.17
Confidential - Hartford, CT
Big Data Engineer
Responsibilities:
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Installed and configured Hadoop ecosystem like HBase, Flume, Pig and Sqoop.
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Worked with SME and conducted JAD sessions documented the requirements using UML and use case diagrams
- In the Big Data Fabric Hadoop Finance Landing Zone, the files are validated, files are created and sent to DMS and expanded file is sent further to IFSI.
- Used SDLC Methodology of Data Warehouse development using Kanbanize.
- Configured Apache Mahout Engine.
- Used Agile (SCRUM) methodologies for Software Development.
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
- Developed Big Data solutions focused on pattern matching and predictive modeling
- Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark.
- Developed the code to perform Data extractions from Oracle Database and load it into AWS platform using AWS Data Pipeline.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Worked in exporting data from Hive tables into Netezza database.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat Files, RDBMS as part of a POC using Amazon EC2.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Built Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
- Developed complete end to end Big-data processing in Hadoop eco system.
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Cassandra through Sqoop and placed in HDFS and processed.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Implemented partitioning, dynamic partitions and buckets in Hive.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
Environment: Apache Spark 2.3, Hive 2.3, Informatica, HDFS, MapReduce, Scala, Apache Nifi 1.7, Yarn, HBase, PL/SQL, MongoDB, Pig 0.17, Sqoop 1.4, Apache Flume 1.8