We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Hartford, CT

SUMMARY

  • Over 5+ years of profession experience in Software Systems Development, Business Systems, experience in Big Data ecosystem related technologies.
  • Experience in data management and implementation of Big Data applications using Spark and Hadoop frameworks.
  • Hands on experience building streaming applications using Spark Streaming and Kafka.
  • Expertise in data cleansing for analysis, perform data quality testing for gaps, and liaising with data origination teams.
  • Strong experience and knowledge of HDFS, MapReduce and Hadoop ecosystem components like Hive, Pig, Sqoop, NoSQL databases such as Mongo DB and Cassandra.
  • Familiarity with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
  • Hands - on development and implementation experience of Machine learning algorithms in Apache Spark and Hadoop MapReduce.
  • Solid experience in Data Modeling using design tool Erwin, Power Designer, ER Studio and Data Base tools.
  • Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
  • Strong knowledge of Spark for handling large data processing in streaming process along with Scala.
  • Experience in working on CQL (Cassandra Query Language), for retrieving the data present in Cassandra cluster by running queries in CQL.
  • Experience in understanding Stored Procedures, Stored Functions, Database Triggers, and Packages using PL/SQL.
  • Extensive experience in advanced SQL Queries and PL/SQL stored procedures.
  • Hands on experience in big data, data visualization, R and Python development, Unix, SQL, GIT/GitHub.
  • Excellent understanding and working experience of industry standard methodologies like System Development Life Cycle (SDLC)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Highly skilled in integrating Kafka with Spark streaming for high speed data processing.
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon Approach.
  • Good understanding and hands on experience with AWS S3, EC2 and Redshift.
  • Strong background in various Data Modeling tools using Erwin, ER/Studio and Power Designer.
  • Expertise in Normalization (1NF, 2NF, 3NF and BCNF)/De-normalization techniques for effective and optimum performance in OLTP and OLAP
  • Strong knowledge of Spark for handling large data processing in streaming process along with Scala.
  • Extensive knowledge in programming with Resilient Distributed Datasets (RDDs).
  • Strong experience in migrating data warehouses and databases into Hadoop/NoSQL platforms.
  • Experience in developing different Statistical Machine Learning, Text Analytics, Data mining solutions to various business generating and problems data visualizations using Python, R and Tableau.
  • Extensive SQL experience in querying, data extraction and data transformations.

TECHNICAL SKILLS

Data Modeling Tools: Erwin Data Modeler, Erwin Model Manager, ER Studio v17, and Power Designer 16.6.

Big Data Tools: Hadoop Ecosystem MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Pig 0.17, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3Cloudera Manager, Neo4j, Hadoop 3.0, Apache Nifi 1.6, Cassandra 3.11

Cloud Management: Amazon Web Services(AWS), Amazon Redshift

OLAP Tools: Tableau, SAP BO, SSAS, Business Objects, and Crystal Reports 9

Cloud Platform: AWS, Azure, Google Cloud, Cloud Stack/Open Stack

Programming Languages: SQL, PL/SQL, UNIX shell Scripting, PERL, AWK, SED

Databases: Oracle 12c/11g, Teradata R15/R14, MS SQL Server 2016/2014, DB2.

Testing and defect tracking Tools: HP/Mercury, Quality Center, Win Runner, MS Visio 2016 & Visual Source Safe

Operating System: Windows 7/8/10, Unix, Sun Solaris

ETL/Data warehouse Tools: Informatica v10, SAP Business Objects Business Intelligence 4.2 Service Pack 03, Talend, Tableau, and Pentaho.

Methodologies: RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Agile, Waterfall Model.

PROFESSIONAL EXPERIENCE

Confidential - Hartford, CT

Sr. Big Data Engineer

Responsibilities:

  • As a Sr. Big Data Engineer, Involved in requirement gathering phase of the SDLC and helped team by breaking up the complete project into modules with the help of my team lead.
  • Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Lead architecture and design of data processing, warehousing and analytics initiatives.
  • Active involvement in design, new development and SLA based support tickets of Big Machines applications.
  • Responsible for manage data coming from different sources. Storage and Processing in Hue covering all Hadoop ecosystem components.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Worked with clients to better understand their reporting and dash boarding needs and present solutions using structured Agile project methodology approach.
  • Involved in data ingestion into HDFS using Sqoop and Flume from variety of sources.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.
  • Installed Hadoop, MapReduce, HDFS, and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
  • Created tables in HBase to store variable data formats of PII data coming from different portfolios
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run a MapReduce.
  • Developed analytics enablement layer using ingested data that facilitates faster reporting and dashboards.
  • Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion platform.
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Developed Oozie workflow jobs to execute hive, Sqoop and MapReduce actions.
  • Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
  • Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
  • Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
  • Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
  • Developed complete end to end Big-data processing in Hadoop eco-system.
  • Proof-of-concept to determine feasibility and product evaluation of Big Data products
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
  • Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
  • Developed customized classes for serialization and De-serialization in Hadoop.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Implemented a proof of concept deploying this product in Amazon Web Services AWS.

Environment: Hadoop 3.0, Sqoop 1.4, EC2, Agile, HDFS, Apache Flume 1.8, Hive 2.3, HBase, Spark 2.3, Pig 0.17, Apache Kafka 1.1, Elastic Search, MapReduce, UNIX, NoSQL

Confidential - Bellevue, WA

Sr. Data Engineer

Responsibilities:

  • Worked with Business Analyst to understand the user requirements, layout, and look of the interactive dashboard to be developed in tableau.
  • Gather and documented all business requirements to migrate reports from SAS to a Netezza platform utilizing a Micro Strategy reporting tool
  • Involved in Manipulating, cleansing & processing data using Excel, Access and SQL and responsible for loading, extracting and validation of client data.
  • Used Python programs for data manipulation, automation process of generating reports of multiple data sources or dashboards
  • Designed and implemented Data Warehouse life cycle and entity-relationship/multidimensional modeling using star schema, snowflake schema
  • Involved extensively in creating Tableau Extracts, Tableau Worksheet, Actions, Tableau Functions, Tableau Connectors (Live and Extract) including drill down and drill up capabilities and Dashboard color coding, formatting and report operations (sorting, filtering, Top-N Analysis, hierarchies).
  • Data blending of patient information from different sources and for research using Tableau and Python.
  • Used Boto3 to integrate Python application with AWS Redshift, Teradata and S3.
  • Involved in Netezza Administration Activities like backup/restore, performance tuning, and Security configuration.
  • Write complex SQL statements to perform high level and detailed validation tasks for new data and/or architecture changes within the model comparing Teradata data against Netezza data.
  • Utilized various Python frameworks and libraries Pandas, Numpy and scipy for analyzing data from data sources AWS Redshift and Teradata and data manipulation.
  • Developed Python programs and batch scripts on windows for automation of ETL processes to AWS Redshift.
  • Managed the Metadata associated with the ETL processes used to populate the Data Warehouse.
  • Created sheet selector to accommodate multiple chart types (Pie, Bar, Line etc) in a single dashboard by using parameters.
  • Published Workbooks by creating user filters so that only appropriate teams can view it.
  • Worked on SAS Visual Analytics & SAS Web Report Studio for data presentation and reporting.
  • Extensively used SAS/Macros to parameterize the reports so that the user could choose the summary and sub-setting variables to be used from the web application.
  • Created Teradata External loader connections such as Mload, Upsert, Update, and Fastload while loading data into the target tables in Teradata Database.
  • Resolved the data related issues such as: assessing data quality, testing dashboards, evaluating existing data sources.
  • Created DDL scripts for implementing Data Modeling changes, reviewed SQL queries and involved in Database Design and implementing RDBMS specific features.
  • Created data mapping documents mapping Logical Data Elements to Physical Data Elements and Source Data Elements to Destination Data Elements.
  • Written SQL Scripts and PL/SQL Scripts to extract data from Database to meet business requirements and for Testing Purposes.
  • Designed the ETL process using Informatica to populate the Data Mart using the flat files to Oracle database
  • Involved in Data analysis, reporting using Tableau and SSRS.
  • Involved in all phases of SDLC using Agile and participated in daily scrum meetings with cross teams

Environment: Tableau Server 9.3, Tableau Desktop 9.3, AWS Redshift, Teradata, Python, SQL, PostgreSQL, Linux, Teradata SQL Assistant, EC2, S3, Windows, Pl/Sql

Confidential - Boston, MA

Sr. Data Analyst/Data Engineer

Responsibilities:

  • Worked with the analysis teams and management teams and supported them based on their requirements.
  • Involved in extraction, transformation and loading of data directly from different source systems (flat files/Excel/Oracle/SQL/Teradata) using SAS/SQL, SAS/macros.
  • Generated PL/SQL scripts for data manipulation, validation and materialized views for remote instances.
  • Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
  • Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.
  • Developed live reports in a drill down mode to facilitate usability and enhance user interaction
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Used Python to extract weekly information from XML files.
  • Developed Python scripts to clean the raw data.
  • Worked on AWS CLI to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
  • Used AWS CLI with IAM roles to load data to Redshift cluster,
  • Responsible for in depth data analysis and creation of data extract queries in both Netezza and Teradata databases
  • Extensive development in Netezza platform using PL SQL and advanced SQLs.
  • Validated regulatory finance data and created automated adjustments using advanced SAS Macros, PROC SQL, UNIX (Korn Shell) and various reporting procedures.
  • Designed reports in SSRS to create, execute, and deliver tabular reports using shared data source and specified data source. Also, Debugged and deployed reports in SSRS.
  • Optimized the performance of queries with modification in TSQL queries, established joins and created clustered indexes
  • Used Hive, Impala and Sqoop utilities and Oozie workflows for data extraction and data loading.
  • Development of routines to capture and report data quality issues and exceptional scenarios.
  • Creation of Data Mapping document and data flow diagrams.
  • Developed Linux Shell scripts by using Nzsql/Nzload utilities to load data from flat files to Netezza database.
  • Involved in generating dual-axis bar chart, Pie chart and Bubble chart with multiple measures and data blending in case of merging different sources.
  • Developed dashboards in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
  • Created Dashboards style of reports using QlikView components like List box Slider, Buttons, Charts and Bookmarks.
  • Coordinated with Data Architects and Data Modelers to create new schemas and view in Netezza for to improve reports execution time, worked on creating optimized Data-Mart reports.
  • Worked on QA the data and adding Data sources, snapshot, caching to the report
  • Involved in troubleshooting at database levels, error handling and performance tuning of queries and procedures.

Environment: SAS, SQL, Teradata, Oracle, PL/SQL, UNIX, XML, Python, AWS, SSRS, TSQL, Hive, Impala, Sqoop

Confidential

ETL Developer

Responsibilities:

  • Developed various mappings using Mapping Designer and worked with Aggregator, Lookup, Filter, Router, Joiner, Source Qualifier, Expression, Stored Procedure and Sequence Generator transformations.
  • Implemented Slowly Changing Dimensions of type 1 & type 2 to store history according to business requirements.
  • Used Parameter files to pass mapping and session parameters to the session.
  • Tuned the Informatica mappings to reduce the session run time.
  • Developed PL/SQL procedures to update the database and to perform calculations.
  • Worked with SQL* Loader to load data into the warehouse.
  • Contributed to the design and development of Informatica framework model.
  • Wrote UNIX shell scripts to work with flat files, to define parameter files and to create pre and post session commands.
  • Used SAS PROC IMPORT, DATA and PROC DOWNLOAD procedures to extract the FIXED Format Flat files and convert into Teradata tables for Business Analysis.
  • Helped users by Extracting Mainframe Flat Files (Fixed or CSV) onto UNIX Server and then converting them into Teradata Tables using BASE SAS Programs.
  • Collected Multi-Column Statistics on all the non-indexed columns used during the join operations & all columns used in the residual conditions.
  • Generated and implemented Micro Strategy Schema objects and Application objects by creating facts, attributes, reports, dashboards, filters, metrics and templates using Micro Strategy Desktop.
  • Developed BTEQ scripts to load data from Teradata Staging area to Teradata data mart.
  • Worked extensively on PL/SQL as part of the process to develop several scripts to handle different scenarios.
  • Performed Unit testing and System testing of Informatica mappings.
  • Involved in migrating the mappings and workflows from Development to Testing and then to Production environments.

Environment: Oracle8i, SQL, PL/SQL, SQL*PLUS, HP-UX 10.20, Informatica Power Center 7, DB2 Cognos Report net 1.1, Windows 2000

We'd love your feedback!