We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

Ashburn, VA

SUMMARY

  • Over 9+ years of IT experience in Big Data engineering support with experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
  • Expertise in creating HDInsight cluster and Storage Account with End - to-End environment for running the jobs.
  • Good understanding and exposure to Python programming.
  • Hands of experience in GCP, Big Query, GCS bucket and Stack driver.
  • Hands-on experience on developing PowerShell Scripts for automation purpose.
  • Strong knowledge of Spark for handling large data processing in streaming process along with Scala.
  • Hands On experience on developing UDF, DATA Frames and SQL Queries in Spark SQL.
  • Experience in integrating Kafka with Spark streaming for high speed data processing.
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
  • Experience in designing a component using UML Design-Use Case, Class, Sequence, and Development, Component diagrams for the requirements.
  • Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
  • Strong experience on designing Big data pipelines such as Data Ingestion, Data Processing (Transformations, enrichment and aggregations) and Reporting.
  • Experience in using Web Services Technologies: Web Services, RESTFUL, SOAP UI.
  • Experience in writing Build Scripts using Shell Scripts, MAVEN and using CI (Continuous Integration) tools like Jenkins.
  • Excellent working experience in Scrum/Agile framework and Waterfall project execution methodologies.
  • Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing.
  • Strong Experience in migrating data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Strong experience in Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export.
  • Experience in importing and exporting Terabytes of data between HDFS and Relational Database Systems using Sqoop
  • Strong experience in developing the workflows using Apache Oozie framework to automate tasks.
  • Experience in working with MapReduce programs using Apache Hadoop to analyze large data sets efficiently.
  • Strong experience in working with Core Hadoop components like HDFS, Yarn and MapReduce.
  • Strong experience in Cloudera Hadoop distribution with Cloudera manager.

TECHNICAL SKILLS

Big Data Technologies: Hadoop 3.3, HDFS, MapReduce, Hive 2.3, Sqoop 1.4, Apache Impala 2.1, Oozie 4.3, Yarn, Apache Flume 1.8, Kafka 1.1, Zookeeper

Cloud Platform: Amazon AWS, EC2, EC3, MS Azure, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake, Data Factory, GCP, Big Query, GCS

NoSQL DB: HBase 2.4/2.3, Cassandra 3.11, Couch DB, Snowflake DB

Programming Language: Scala, Python 3.6, SQL, PL/SQL, Shell Scripting

Hadoop Distributions: Cloudera, Hortonworks, MapR

SDLC Methodologies: Agile, Waterfall

Operating Systems: Windows 10, Linux and Unix

PROFESSIONAL EXPERIENCE

Confidential - Ashburn, VA

Big Data Engineer

Responsibilities:

  • As a Big Data Engineer, assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
  • Assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
  • Lead architecture and design of data processing, warehousing and analytics initiatives.
  • Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open Source framework knowledge.
  • Responsible for data governance rules and standards to maintain the consistency of the business element names in the different data layers.
  • Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
  • Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
  • Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
  • Built the data pipelines that will enable faster, better, data-informed decision-making within the business.
  • Set up Data Lake in Google cloud using Google cloud storage, Big Query and Big Table.
  • Developed Spark scripts by using python and bash Shell commands as per the requirement.
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Developed a POC for project migration from on prem Hadoop MapR system to GCP.
  • Compared Self hosted Hadoop with respect to GCPs Data Proc, and explored Big Table (managed HBase) use cases, performance evolution.
  • Build a program with Python and apache beam and execute it in cloud Data flow to run Data validation between raw source file and Big query tables.
  • Converted PL/SQL type of code to bigquery-python architecture as well as Pyspark in Dataproc.
  • Used Rest API with Python to ingest Data from and some other site to Big Query.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Performed Data transformations in Hive and used partitions, buckets for performance improvements.
  • Optimized Hive queries to extract the customer information from HDFS.
  • Involved in scheduling Oozie workflow engine to run multiple Hive jobs.
  • Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
  • Write a Python program to maintain raw file archival in GCS bucket.
  • Designed Google Compute Cloud Data Flow jobs that move data within a 200 PB data lake.
  • Implemented scripts that load Google Big Query data and run queries to export data.
  • Implemented business logic by writing UDFs and configuring CRON Jobs.
  • Developing scripts in Big Query and connecting it to reporting tools.
  • Designing and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making.
  • Providing 24/7 On-call Production Support for various applications.

Environment: Hadoop 3.3, Spark 3.1, Python, GCP, Data Lake, GCS, HBase, Oozie, Hive, CI/CD, Big Query, Hive, Rest API, Agile Methodology

Confidential - Allentown, PA

Big Data Engineer

Responsibilities:

  • As a Big Data Engineer, assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
  • Provided a summary of the Project's goals, and the specific expectation of business users from BI and how it aligns with the project goals.
  • Involved in Agile development methodology active member in scrum meetings.
  • Worked in Azure environment for development and deployment of Custom Hadoop Applications.
  • Designed and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure.
  • Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data.
  • Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
  • Worked with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).
  • Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake and Data Factory.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
  • Created tables in snowflake DB, loading and analyzing data using Spark-Scala scripts.
  • Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflake’s Snow SQL.
  • Created Data Pipeline to migrate data from Azure Blob Storage to Snowflake.
  • Developed and execute data pipeline testing processes and validate business rules and policies.
  • Developed SQL scripts using Spark for handling different data sets and verifying the performance over MapReduce jobs.
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
  • Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.
  • Used windows Azure SQL reporting services to create reports with tables, charts and maps.
  • Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS using Scala.
  • Responsible for documenting the process and cleanup of unwanted data.
  • Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Implemented Proof of concepts for SOAP & REST APIs.
  • Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS).
  • Worked on building visuals and dashboards using Power BI reporting tool.
  • Supported Cloud Strategy team to integrate analytical capabilities into an overall cloud architecture and business case development.
  • Conducting code reviews for team members to ensure proper test coverage and consistent code standards.
  • Continuous coordination with QA team, production support team and deployment.

Environment: Spark 3.0, Snowflake DB, Soap & Rest API, ADF, Blob Storage, CI/CD, Azure Sql, JSON, Python, ETL, Azure Synapse, MapReduce, Kafka, Sqoop, Azure PasS, Agile & Scrum Methodology

Confidential - Wilmington, DE

Data Engineer

Responsibilities:

  • As a Data Engineer reviewed business requirement and developed Big Data solutions focused on pattern matching and predictive modeling.
  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Worked inAgile developmentenvironment and Participated indaily scrumand other design related meetings.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
  • Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
  • Ingested data into HDFS using SQOOP and scheduled an incremental load to HDFS.
  • Created S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
  • Using Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
  • Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.
  • Developed PySpark scripts for data transfer between S3 to Redshift.
  • Responsible for creating an instance on Amazon EC2 (AWS) and deployed the application on it.
  • Worked with Amazon EMR to process data directly in S3 when we want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster by setting up the Spark Core for analysis work.
  • Exposure on Spark Architecture and how RDD’s work internally by involving and processing the data from Local files, HDFS and RDBMS sources by creating RDD and optimizing for performance.
  • Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
  • Created spark jobs to apply data cleansing/data validation rules on new source files in Inbound bucket and reject records to reject-data S3 bucket.
  • Transferred the data using Informatica tool from AWS S3 to AWS Redshift.
  • Extensively involved in writing SQL Scripts, functions and packages.
  • Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
  • Estimates and planning of development work using Agile Software Development.
  • Developed all the mappings according to the design document and mapping specs provided and perform unit testing.
  • Used Sqoop to import data into HDFS and Hive from Oracle database
  • Transformed legacy data to be loaded into staging using Store procedures.
  • Act as technical liaison between customer and team on all AWS technical aspects.
  • Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Used AWS Cloud with Infrastructure Provisioning / Configuration.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Maintained Tableau functional reports based on user requirements.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.

Environment: Spark 2.3, AWS, EC2, S3, Redshift, Hive, Tableau, HDFS, AWS EMR, PySpark, Agile & Scrum meetings

Confidential - Newport Beach, CA

Data Engineer

Responsibilities:

  • As a Data Engineer worked with the analysis teams and management teams and supported them based on their requirements.
  • Architected, Designed and Developed Business applications and Data marts for reporting.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Developed reconciliation process to make sureelasticsearchindex document count match to source records.
  • Maintained Tableau functional reports based on user requirements.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Used Agile (SCRUM) methodologies for Software Development.
  • Developed data pipelines to consume data from Enterprise Data Lake (MapR Hadoop distribution - Hive tables/HDFS) for analytics solution.
  • Created Hive External tables to stage data and then move the data from Staging to main tables.
  • Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Developed incremental and complete load Python processes to ingest data intoElasticSearch from Hive.
  • Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
  • Developed Rest services to write data intoElasticSearchindex using Python Flask specifications
  • Developed complete end to end Big-data processing in Hadoop eco system.
  • Used AWS Cloud with Infrastructure Provisioning / Configuration.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Created dashboards for analyzing POS data using Tableau.
  • Developed Tableau visualizations and dashboards using Tableau Desktop.
  • Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
  • Implemented partitioning, dynamic partitions and buckets in Hive.
  • Deployed RMAN to automate backup and maintaining scripts in recovery catalog.
  • Worked on QA the data and adding Data sources, snapshot, caching to the report.

Environment: AWS, Python, Agile, Hive 2.1, Oracle 12c, Tableau, HDFS, PL/SQL, Sqoop 1.2, Flume 1.6

Confidential

Data Modeler/Data Analyst

Responsibilities:

  • Worked as a Data Modeler/Analyst to generate Data Models using E/R Studio and developed relational database system.
  • Followed SDLC Methodology for project development.
  • Interacted with Business Analyst, SMEs and other Data Architects to understanding Business needs
  • Created Logical & Physical Data Model on Relational (OLTP) on Star schema for Fact and Dimension tables using E/R Studio.
  • Performed GAP analysis to analyze the difference between the system capabilities and business requirements.
  • Involved in Data flow analysis, Data modeling, Physical database design, forms design and development, data conversion, performance analysis and tuning.
  • Created and maintained data model standards, including master data management (MDM) and Involved in extracting the data from various sources.
  • Designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using E/R Studio.
  • Worked on Master data Management (MDM) Hub and interacted with multiple stakeholders.
  • Proficient in developing Entity-Relationship diagrams, Star/Snow Flake Schema Designs, and expert in modeling Transactional Databases and Data Warehouse.
  • Worked on normalization techniques, normalized the data into 3rd Normal Form (3NF).
  • Worked Normalization and De-normalization concepts and design methodologies like Ralph Kimball and Bill Inmon approaches and implemented Slowly Changing Dimensions.
  • Implemented Forward engineering to create tables, views and SQL scripts and mapping documents.
  • Used reverse engineering to connect to existing database and create graphical representation (E-R diagram).
  • Performance tuning and stress-testing of NoSQL database environments in order to ensure acceptable database performance in production mode.
  • Developed automated data pipelines from various external data sources (web pages, API etc) to internal data warehouse (SQL server) then export to reporting tools.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis.
  • Performed data analysis and data profiling using complex SQL on various sources systems including Oracle.
  • Monitored the Data quality and integrity of data was maintained to ensure effective functioning of department.
  • Managed database design and implemented a comprehensive Star-Schema with shared dimensions.
  • Analyzed the data which is using the maximum number of resources and made changes in the back-end code using PL/SQL stored procedures and triggers
  • Developed and maintained stored procedures, implemented changes to database design including tables and views.
  • Conducted Design reviews with the business analysts and the developers to create a proof of concept for the reports.
  • Performed detailed data analysis to analyze the duration of claim processes
  • Created the cubes with Star Schemas using facts and dimensions through SQL Server Analysis Services (SSAS).
  • Deployed SSRS reports to Report Manager and created linked reports, snapshots, and subscriptions for the reports and worked on scheduling of the reports.
  • Generated parameterized queries for generating tabular reports using global variables, expressions, functions, and stored procedures using SSRS.
  • Developed, and scheduled variety of reports like cross-tab, parameterized, drill through and sub reports with SSRS.

Environment: E/R Studio v 9.2, OLTP, SDLC, MDM, SQL Server 2008, NoSQL, AWS, NoSQL, PL/SQL

We'd love your feedback!