We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Ofallon, MO

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/ Big Data Engineer
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
  • Good exposure in interacting with clients and solving application environment issues and can communicate effectively with people at different levels including stakeholders, internal teams and the senior management.
  • Extensive experieced Big Data - Hadoop developer with varying level of expertise around different Big Data/Hadoop ecosystem projects which include Spark streaming, HDFS,MapReduce, NiFi, HIVE, HBase, Storm, Kafka, Flume, Sqoop, ZooKeeper, Oozie etc.
  • Strong experience in writing scripts usingPythonAPI, PySpark API and Spark API for analyzing the data.
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Proficient inStatistical MethodologiesincludingHypothetical Testing,ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
  • Expertise in transforming business resources and requirements intomanageable data formatsandanalytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Extensive experience withReal-timestreaming technologies Spark, Storm, Kafka.
  • Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
  • Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
  • Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Bigdata/Hadoop Technologies: MapReduce, Spark, SparkSQL,Azure,Spark Streaming, Kafka,PySpark,, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB

Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Ofallon, MO

Senior Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Installing, configuring and maintaining Data Pipelines
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Start working with AWS for storage and halding for tera byte of data for customer BI Reporting tools
  • Used SQL Server Management Tool to check the data in the database as compared to the requirement give
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Created a Serverless data ingestion pipeline on AWS usingMSK(Kafka)and lambda functions.
  • Developed applications using Java that reads data from MSK(kafka) and writes it toDynamo DB.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
  • Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.

Environment: Cloudera Manager (CDH5),Hadoop,Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Git,Sqoop,Oozie.Pyspark,Informatica,Tableau,OLTP,OLAP,HBase,Cassandra,Informatica,SQL Server,Python,Shell Scripting,XML,Unix.

Confidential, Oldsmar, FL

Sr.Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Working on tickets opened by users regarding various incidents, requests
  • Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala,Spark SQL and MLlib libraries.
  • Experienced in writing real-time processing and core jobs usingSpark StreamingwithKafkaas a data pipeline system.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Adding new users and groups of users as per the requests from the client
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Involved in Unit Testing the code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization.
  • Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Azure, Data Lake, Data factory, Data Bricks, ZookeeperPython 3.6, Unix/Linux Shell Scripting, PyCharm, Informatica PowerCenter, Code Build, Code Pipeline, EventBridge, Athena), Linux Shell Scripting

Confidential, Providence, RI

Big Data Engineer

Responsibilities:

  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Expertise in writing HadoopJobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Expert in creating Hive UDFs using Java to analyze the data efficiently.
  • Thenear real time reportingwas achieved by anevent-based processingapproach adoption instead ofmicro-batchingto deal with data coming fromKafka.
  • Developedspring boot applicationsto read data from Kafka in an event-based manner. These applications were developed to run asmicro-servicesthat deals with parts
  • Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
  • Implemented AJAX, JSON, and Java script to create interactive web screens.
  • Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Created Session Beans and controller Servlets for handling HTTP requests from Talend
  • Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
  • Built APIs that will allow customer service representatives to access the data and answer queries.
  • Extending the functionality of Hive with custom UDF s and UDAF's.
  • The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
  • Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
  • Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
  • Connected to AWS Redshift through Tableau to extract live data for real time analysis
  • Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
  • Utilized Waterfall methodology for team and project management
  • Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
  • Performed statistical analysis using SQL, Python, R Programming and Excel.
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.

Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica,HBase,MapReduce, HDFS, Sqoop, Impala, SQL,Tableau, Python,SAS,Flume, Oozie, Linux.

Confidential

Data Engineer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Build the oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
  • Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
  • Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Implemented data ingestion and handling clusters in real time processing usingKafka.
  • Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
  • Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in oozie script
  • Hands on experiences on Git bash commands like Git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and Git push to the pre prod environment for the code review and later used screwdriver. yaml which actually build the code, generates artifacts which releases in to production
  • Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors
  • Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.

Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka,JSON, XML PL/SQL,SQl, HDFS, Unix, Python, PySpark

Confidential

Junior Software Engineer

Responsibilities:

  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN,Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
  • Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
  • Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
  • Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.

Environment: Hadoop, MapReduce, Hive,Apache Spark, SqoopSnowflake,, Nebula, Teradata, SQL Server, Python, Pig, GitHub, Teradata, Tableau, MS Excel, MS Power Point.

We'd love your feedback!