Sr. Big Data Engineer Resume
Deerfield Beach, FL
SUMMARY
- 8+ years of professional experience in information technology wif an expert hand in teh areas of BIG DATA, HADOOP, SPARK, HIVE, IMPALA, SQOOP, FLUME, KAFKA, SQL tuning, ETL development, report development, database development, data modeling and strong knowledge of oracle database architecture.
- Experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spring Boot, Spark integration wif Cassandra, Avro, Solr and Zookeeper.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Worked on different file formats like delimited files, avro, Json and parquet. Docker container orchestration using ECS, ALB and lambda.
- CreatedSnowflake Schemasby normalizing teh dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to teh Customer Dimension.
- Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, Cloud Watch, SNS, Dynamo DB, SQS.
- Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, Power BI.
- Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
- Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase wif strict configuration management and controlled coordination wif different teams.
- Utilized Kubernetes and Docker for teh runtime environment for teh CI/CD system to build, test, and deploy. Experience in working on creating and running Docker images wif multiple micro services.
- Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and TEMPeffective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Proficient wif Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked wif Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked wif Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and MongoDB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
- Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling wif Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Ability to work TEMPeffectively in cross-functional team environments, excellent communication, and interpersonal skills.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)-EC2, EMR, S3, Redshift, EMR, Lambda, Atana
Databases: Oracle, Sql Server, My Sql, MongoDB, Cassandra, Hbase, Teradata R15/R14.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Deerfield Beach, FL
Sr. Big Data Engineer
Responsibilities:
- Installing, configuring and maintaining Data Pipelines
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Designing teh business requirement collection approach based on teh project scope and SDLC methodology.
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of teh project in documenting teh business requirements.
- Design and implement multiple ETL solutions wif various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
- Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
- Designed and implemented Sqoop for teh incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on teh data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Conduct root cause analysis and resolve production problems and data issues
- Performance tuning, code promotion and testing of application changes
- Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of teh data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of teh data processing environment.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from teh Kafka and store teh stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on teh fly to build teh common learner data model and persists teh data in HDFS.
- Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
- Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
- Designed and architected scalable data processing and analytics solutions, including technical feasibility, integration, development for Big Data storage, processing and consumption of Azure data, analytics, big data (Hadoop, Spark), business intelligence (Reporting Services, Power BI), NoSQL, HDInsight, Stream Analytics, Data Factory, Event Hubs and Notification Hubs.
- Used SQL Server Management Tool to check teh data in teh database as compared to teh requirement give
- Validated teh test data in DB2 tables on Mainframes and on Teradata using SQL queries.
- Develop and deploy teh outcome using spark and Scala code in Hadoop cluster running on GCP.
- Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply wif AML Regulations.
- Work wif region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining teh subsequent phases training, UAT, staff to perform test scripts, data migration and teh uplift strategy (updating of customer information to bring them to teh new KYC standards) review of customer documentation.
- Description of End-to-end development of Actimize models for trading compliance solutions of teh project bank.
- Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
- Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
- Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling using Erwin.
- Automated teh data processing wif Oozie to automate data loading into teh Hadoop Distributed File System.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, AWS, S3, Kafka, Azure, GCP, Scrum, Git, Sqoop, Oozie, Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix.
Confidential, Boise, ID
Sr. Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built wif Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Experience in working wif different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase wif Flume.
- Imported several transactional logs from web servers wif Flume to ingest teh data into HDFS.
- Using Flume and Spool directory for loading teh data from local system (LFS) to HDFS.
- Strong understanding of AWS components such as EC2 and S3
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on architecting teh ETL transformation layers and writing spark jobs to do teh processing.
- Designed & build infrastructure for teh Google Cloud environment from scratch
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Installed and configured pig, written Pig Latin scripts to convert teh data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using Hive QL.
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked wif Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing teh raw data and populating staging Tables.
- Experience in setting up teh whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data inAzure Databricks.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Implemented a Continuous Delivery pipeline wif Docker, and Git Hub and AWS
- Participated in teh full software development lifecycle wif requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborate wif team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
Environment: Hadoop, Hive, AWS, Gcp, Bigquery, Hbase, Scala, Flume, Apache Tez, Cloud Shell, Azure Databricks, Docker, Jira, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql
Confidential, Bedford, TX
Data Engineer/ Analyst
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Teh new Business Data Warehouse (BDW) improved query/report performance, reduced teh time needed to develop reports and established self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist teh users wif data analysis.
- Used Oozie scripts for deployment of teh application and perforce as teh secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analysing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- Built APIs that will allow customer service representatives to access teh data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked wif teh QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending teh functionality of Hive wif custom UDF s and UDAF's.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in teh hive and Map Side joins.
- Expert in creating Hive UDFs using Java to analyse teh data efficiently.
- Responsible for loading teh data from BDW Oracle database, Teradata into HDFS using Sqoop.
- Implemented AJAX, JSON, and Java script to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analysed them by running Hive queries. Processed teh image data through teh Hadoop distributed system by using Map and Reduce tan stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Visualization and Designed Dashboards wif Tableau and generated complex reports including chars, summaries, and graphs to interpret teh findings to teh team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Utilized Waterfall methodology for team and project management
- Used Git for version control wif Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
- Performed statistical analysis using SQL, Python, R Programming and Excel.
- Worked extensively wif Excel VBA Macros, Microsoft Access Forms
- Import, clean, filter and analyse data using tools such as SQL, HIVE and PIG.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand teh data on teh fly wif teh usage of quick filters for on demand needed information.
- Analysed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- TEMPEffectively Communicated plans, project status, project risks and project metrics to teh project team planned test strategies in accordance wif project scope.
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, Hbase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Java script, Oozie, Linux, No SQL, MongoDB, Talend, Git.
Confidential
Data Engineer
Responsibilities:
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build teh Oozie pipeline which performs several actions like file move process, Sqoop teh data from teh source Teradata or SQL and exports into teh hive staging tables and performing aggregations as per business requirements and loading into teh main tables.
- Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
- Performing teh forking action whenever there is a scope of parallel process for optimization of data latency.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Performed pig script which picks teh data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted dis script into a jar and passed as parameter in Oozie script
- Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh SQL Activity. Build an ETL which utilizes spark jar inside which executes teh business analytical model.
- Hands on experiences on git bash commands like git pull to pull teh code from source and developing it as per teh requirements, git add to add files, git commit after teh code build and git push to teh pre prod environment for teh code review and later used screwdriver. yaml which actually build teh code, generates artifacts which releases in to production
- Created logical data model from teh conceptual model and its conversion into teh physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
- Connected to AWS Redshift through Tableau to extract live data for real time analysis.
- Developed Data mapping, Transformation and Cleansing rules for teh Data Management involving OLTP and OLAP.
- Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by teh team and satisfying teh business rules.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
- Partner wif infrastructure and platform teams to configure, tune tools, automate tasks and guide teh evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
- Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of teh analysis and suggested solutions for investors
- Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are tan implemented in SAS where they are interfaced wif MSSQL databases and scheduled to update on a timely basis.
Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka, JSON, XML PL/SQL, Sql, HDFS, Unix, Python, SAS, PySpark, Redshift, Azure, Shell Scripting.
Confidential
Data & Reporting Analyst
Responsibilities:
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce teh running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare teh data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into teh data quality when these types of loads are done (To look for any data loss, data corruption).
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading teh history data from Teradata SQL to snowflake.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for teh Project
- Worked on to retrieve teh data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created performance dashboards in Tableau/ Excel / Power point for teh key stakeholders
- Incorporated predictive modelling (rule engine) to evaluate teh Customer/Seller health score using python scripts, performed computations and integrated wif teh Tableau viz.
- Worked wif stakeholders to communicate campaign results, strategy, issues or needs.
- Analysed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked wif business to identify teh gaps in mobile tracking and come up wif teh solution to solve.
- Analysed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Evaluated teh traffic and performance of Daily deals PLA ads and compare those items wif non-daily deal items to see teh possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Understood Business requirements to teh core and Came up wif Test Strategy based on Business rules
- Prepared Test Plan to ensure QA and Development phases are in parallel
- Written and executed Test Cases and reviewed wif Business & Development Teams.
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
- Automated Regression tool (Qute) and reduced manual effort and increased team productivity
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop MapReduce developed in python, pig, Hive
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify teh dependency for teh daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working wif spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs wif Precondition/Post condition based on teh requirement.
- Monitor teh Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Hadoop, MapReduce, AWS, Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, Jira, EMR, Nebula, Teradata, SQL Server, Apache Spark, Sqoop.
