We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

St Louis, MO

PROFESSIONAL SUMMARY:

  • Over 8+ years of experience in software development with deep business acumen and technical expertise in Big Data technologies
  • Expertise in resolving production issues, hands - on experience in handling all phases of the software development Life cycle
  • Expert in analyzing business requirements and contributing to solution creation, design and deployment.
  • Deep hands-on experience designing, developing and deploying e2e infrastructure to host business software’s at scale in AWS and Azure cloud (IaaS and PaaS) as well as on premises.
  • Proficient in big data tools like Hive and Spark and relational data ware house tool Teradata etc.
  • Well versed withBig data on AWS cloud services i.e. EC2, S3, Glue, Anthena, DynamoDB and RedShift
  • Excellent hands on business requirement analysis, designing, developing, testing and maintaining the complete data management & processing systems, process documentation and ETL technical and design documents.
  • Experience in Designing end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
  • Aggregated Data through Kafka, HDFS, Hive, Scala and Spark Streams in Amazon AWS
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Responsible for data engineering functions including, but not limited to: data extract, transformation, loading, integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management.
  • Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
  • Developing ETL pipelines in and out of data warehouse using combination of Python and Snowsql.
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Loaded the flat files data using Informatica to the staging area.
  • Good experience in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes.

TECHNICAL SKILLS:

Languages: Python, R, SQL, COBOL, Java, JavaScript, HTML, CSS

Data Visualization: AWS QuickSight, Power BI, Tableau, Informatica, Spotfire, Cognos, Microsoft Excel, PowerPoint

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Data Analysis: Web Scraping, Data Visualization, Statistical Analysis, Data Mining, Data Warehousing, Data Migration, Database Management

Database: MySQL, SQL Server, Snowflake, Oracle, AWS Redshift

Data Modeling Tools: Erwin Data Modeler, Erwin Model Manager, ER Studio v17, and Power Designer 16.6

Cloud Platform: AWS, Azure, Cloud Stack/Open Stack

Cloud Management: Amazon Web Services(AWS), Amazon Redshift

Testing and defect tracking Tools: HP/Mercury, Quality Center, Win Runner, MS Visio & Visual Source Safe

Operating System: Windows, Unix, Sun Solaris

ETL/Data warehouse Tools: Informatica 9.6/9.1, SAP Business Objects XIR3.1/XIR2, Talend, Tableau, and Pentaho.

OLAP Tools: Tableau

PROFESSIONAL EXPERIENCE:

Confidential, St. Louis, MO

Senior Big Data Engineer

Responsibilities:

  • Work with Architects, Stakeholders and Business to design Information Architecture of Smart Data Platform for the Multistate deployment in Kubernetes Cluster.
  • Design/Develop jobs using UNIX/Scala/Informatica/HIVE/SPARK/PIG/SQOOP/TWS to pull viewership data to HDFS ecosystems and provide Business ready extracts to the downstream users.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
  • Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances
  • Hivetables are created as per requirement wereInternalorExternaltables defined with appropriatestatic, dynamic partitions and bucketing, intended for efficiency.
  • Load and transform large sets of structured, semi structured data using hive.
  • Handle billions of log lines coming from several clients and analyze those using big data technologies likeHadoop (HDFS), Apache KafkaandApache Storm.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Worked extensively onsparkandMLlibto develop aregression modelfor cancer data.
  • Hands on design and development of an application using Hive (UDF).
  • Developed Simple to complex MapReduce streaming jobs using Python language that are implemented using Hive and Pig.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • We used the most popular streaming toolKafkato load the data on Hadoop File system and move the same data to Cassandra NoSQL database.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Configured Spark Streaming to receive real time data from theKafkaand store the stream data to HDFS.
  • Migrated existingMapReduceprograms toSparkusingScalaandPython.
  • ImplementedSpark SQLto connect toHiveto read the data and distributed processing to make highly scalable.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Exported the Analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Involved in Designing and Developing Enhancements product features.
  • Involved in Designing and Developing Enhancements of CSG using AWS APIS.
  • After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application usingSpark streamingandKafka.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
  • Created and maintained various DevOps related tools for the team such as provisioning scripts, deployment tools, and development and staging environments on AWS, Rack space and Cloud.
  • ImplementedCompositeserver for thedata virtualizationneeds and created multiples views for restricted data access using a REST API.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using PIG by importing data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing PIG scripts to process data in a batch to perform trend analysis of data.
  • Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Writing map reduce code using pythonin order to get rid of certain security issues in the data.
  • Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
  • Used Pig Latin at client side cluster and HiveQL at server side cluster.
  • Importing the complete data from RDBMS to HDFS cluster usingSqoop

Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, Openshift, CI/CD(Jenkins), Kubernetes

Confidential, Boise, ID

Big Data Engineer

Responsibilities:

  • Used SQOOP to import data from RDBMS source system and loaded data into Hive table staging table and base tables.
  • Implemented to reprocess the failure messages in Kafka using offset id.
  • Worked extensively with Sqoop for importing metadata from Oracle.
  • Azure StoragePlanning Migrated Blob Storage for document and media file, Table storage for structured datasets, Queue storage for reliable messaging for workflow processing and file storage to share file data.
  • Worked with PaaS architect on the complex project for the Azure data center assessment and migration.
  • Performed several ad-hoc data analysis in Azure Data Bricks Analysis Platform on KANBAN board.
  • Used Azure reporting services to upload and download reports
  • Handled different kinds of files types like JSON, XML, Flat Files and CSV by using appropriate SERDES or Parsing logic to load into Hive tables.
  • Implemented software enhancements to port legacy software systems to Spark and Hadoop ecosystems on Azure Cloud.
  • Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Data bricks
  • Develop Spark applications using pyspark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data uncover insight into the customer usage patterns.
  • Translated business requirements into SAS code for use within internal systems and models.
  • Developed multipleKafkaProducers and Consumers from as per the software requirement specifications
  • Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling
  • Built a Hortonworks Cluster on Confidential Azure to extract actionable insights for data collected from IOT sensors installed in excavators.
  • Installed Horton Works Hadoop cluster on Confidential Azure cloud in the UK region to satisfy customer’s data locality needs.
  • Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
  • Good Exposure to Azure Cloud, ADF, ADLS, Azure Devops (VSTS), portal services.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Implemented test scripts to support test driven development and continuous integration.
  • Used partitioning techniques for faster performance.
  • Analysing the production jobs in case of a bends and fixing the issues
  • Loaded real time data from various data sources into HDFS using Kafka.
  • Developed Map Reduce jobs for Data Cleanup in Python.
  • Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.

Environment: Sqoop, Hive, Azure, Json, XML, Kafka, Python, MapReduce, oracle, Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Azure, Azure Data Bricks, DAX, Azure Synapse Analytics, Azure Data Lake, Kafka, Python.

Confidential, Greenwich, CT

Big Data Engineer

Responsibilities:

  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using Erwin
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Implementing and Managing ETL solutions and automating operational processes.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Integrated Kafka with Spark Streaming for real time data processing
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Developed code to handle exceptions and push the code into the exception Kafka topic.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Created and maintained documents related to business processes, mapping design, data profiles and tools.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data

Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, Snow Flake Schema, MySQL, PostgreSQL.

Confidential

Hadoop Developer

Responsibilities:

  • Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
  • Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data.
  • ModelledHivepartitions extensively for data separation and faster data processing and followedPigandHivebest practices for tuning.
  • Loaded the aggregated data onto DB2 for reporting on the dashboard.
  • Used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
  • Created customized BI tool for manager team that perform Query analytics using HiveQL.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • UsedMavenextensively for building jar files ofMap Reduceprograms and deployed to Cluster.
  • Worked with NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB
  • Developed suit of Unit Test Cases forMapper, ReducerandDriverclasses usingMR Testinglibrary.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM).
  • Importing and exporting data intoHDFSfrom database and vice versa usingSqoop.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest behavioral data into HDFS for analysis
  • Implemented optimization and performance tuning in Hive and Pig.
  • Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
  • Designed and implemented a Cassandra NoSQL based database that persists high-volume user profile data.
  • Migrated high-volume OLTP transactions from Oracle to Cassandra
  • Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling
  • Created Data Pipeline of Map Reduce programs using Chained Mappers.

Environment: RHEL, HDFS, Map-Reduce, AWS, Hive, Pig, Sqoop, Flume, Oozie, Mahout,HBase, Hortonworks data platform distribution, Cassandra.

Confidential

Data Engineer

Responsibilities:

  • Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Oracle and Informatica PowerCenter.
  • Developed complex mappings in Informatica to load the data from various sources using various transformations like Source Qualifier, look up connected and unconnected, Expression, Aggregate, Update Strategy, Joiner, Filter and Router
  • Developed Mapplets to implement business rules using complex logic
  • Developed Logical and Physical data models that capture current state/future state data elements and data flows using Erwin.
  • Manipulated and summarized data to maximize possible outcomes efficiently
  • Created various Documents such as Source-to-Target Data Mapping Document, and Unit Test Cases Document.
  • Created and scheduled Sessions and Batches through the Informatica Server Manager Wrote UNIX shell scripts to automate the data transfer FTP process to and from the Source systems, to schedule weekly and monthly loads/jobs
  • Migrated mappings from Development to Testing and from Testing to Production.
  • Performed Unit Testing and tuned for better performance.
  • Designed the Dimensional Model of the Data Warehouse Confirmation of source data layouts and needs.
  • Extensively used Oracle ETL process for address data cleansing.
  • Extracted Data from various sources like Data Files, different customized tools like Meridian and Oracle.
  • Extensively worked on Views, Stored Procedures, Triggers and SQL queries and for loading the data (staging) to enhance and maintain the existing functionality

Environment: Informatica Power Center, SQL, Oracle 10g, Erwin, Meridian, MS Office, Windows.

We'd love your feedback!