- 8 years of professional experience in information technology with an expert hand in the areas of BIG DATA, HADOOP, SPARK, HIVE, IMPALA, SQOOP, FLUME, KAFKA, SQL tuning, ETL development, report development, database development, data modeling and strong knowledge of oracle database architecture.
- Experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spring Boot, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Worked on different file formats like delimited files, avro, Json and parquet. Docker container orchestration using ECS, ALB and lambda.
- CreatedSnowflake Schemasby normalizing the dimension tables as appropriate, and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
- Hands on experience in test driven development(TDD),Behavior driven development(BDD)and acceptance test driven development (ATDD)approaches.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into APIes, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, Cloud Watch, SNS, Dynamo DB, SQS.
- Managing Database, Azure Data Platform services (Azure Data Lake(ADLS), Data Factory(ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, Data Warehouse etc. Build multiple Data Lakes.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, Power BI.
- Expertise in Java programming and have a good understanding on OOPs, I/O, Collections, Exceptions Handling, Lambda Expressions, Annotations
- Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running Docker images with multiple micro services.
- Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and MongoDB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
- Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
ETL/Data 6a Tools: Informatica 9.6/9.1, and Tableau.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Programming Languages: SQL, PL/SQL, and UNIX.
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Cloud Platform: AWS, Azure, Databricks, Google Cloud. Databricks.
Cloud Management: Amazon Web Services (AWS)-EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle, Sql Server, My Sql, MongoDB, Cassandra, Hbase, Teradata R15/R14.
Operating System: Windows, Unix, Sun Solaris
Confidential, Atlanta, GA
Senior Big Data Engineer
- Installing, configuring and maintaining Data Pipelines
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Conduct root cause analysis and resolve production problems and data issues
- Performance tuning, code promotion and testing of application changes
- Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Prepared and uploaded SSRS reports. Manages database and SSRS permissions.
- Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
- Designed and architected scalable data processing and analytics solutions, including technical feasibility, integration, development for Big Data storage, processing and consumption of Azure data, analytics, big data (Hadoop, Spark), business intelligence (Reporting Services, Power BI), NoSQL, HDInsight, Stream Analytics, Data Factory, Event Hubs and Notification Hubs.
- Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
- Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3
- Used SQL Server Management Tool to check the data in the database as compared to the requirement give
- Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Identified and documented Functional/Non-Functional and other related business decisions for implementing Actimize-SAM to comply with AML Regulations.
- Work with region and country AML Compliance leads to support start-up of compliance-led projects at regional and country levels. Including defining the subsequent phases training, UAT, staff to perform test scripts, data migration and the uplift strategy (updating of customer information to bring them to the new KYC standards) review of customer documentation.
- Description of End-to-end development of Actimize models for trading compliance solutions of the project bank.
- Automated and scheduled recurring reporting processes using UNIXshellscriptingand Teradata utilities such as MLOAD, BTEQ and Fast Load
- Implemented Actimize Anti-Money Laundering (AML) system to monitor suspicious transactions and enhance regulatory compliance.
- Worked on Dimensional and Relational Data Modelling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modelling using Erwin.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server usingPython.
Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, AWS, S3, Kafka, Azure, GCP, Scrum, Git, Sqoop, Oozie, Pyspark, NiFi, Pig, Hive Informatica, Tableau, Cassandra, Informatica OLTP, OLAP, HBase, SQL Server, Python, Shell Scripting, XML, Unix.
Confidential, Foster City, CA
Sr. Data Engineer
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Using Flume and Spool directory for loading the data from local system (LFS) to HDFS.
- Strong understanding of AWS components such as EC2 and S3
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Designed & build infrastructure for the Google Cloud environment from scratch
- Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using Hive QL.
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase with Flume.
- Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data inAzure Databricks.
- Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
Environment: Hadoop, Hive, AWS, Gcp, Bigquery, Hbase, Scala, Flume, Apache Tez, Cloud Shell, Azure Databricks, Docker, Jira, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql
Confidential, Madison, WI
Data Engineer/ Hadoop Engineer
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ss jobs in Informatica.
- The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending the functionality of Hive with custom UDF s and UDAF's.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
- Expert in creating Hive UDFs using Java to analyse the data efficiently.
- Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop.
- Implemented AJAX, JSON, and Java script to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analysed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analysing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Utilized Waterfall methodology for team and project management
- Used Git for version control with -am and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
- Performed statistical analysis using SQL, Python, R Programming and Excel.
- Worked extensively with Excel VBA Macros, Microsoft Access Forms
- Import, clean, filter and analyse data using tools such as SQL, HIVE and PIG.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Analysed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, HDFS, Sqoop, Impala, SQL, Tableau, Python, Informatica, Hbase, MapReduce, SAS, Flume, Java script, Oozie, Linux, No SQL, MongoDB, Talend, Git.
- Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
- Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.
- Hands on experiences on git bash commands like git pull to pull the code from source and developing it as per the requirements, git add to add files, git commit after the code build and git push to the pre prod environment for the code review and later used screwdriver. yaml which actually build the code, generates artifacts which releases in to production
- Created logical data model from the conceptual model and its conversion into the physical database design using Erwin. Involved in transforming data from legacy tables toHDFS, andHBasetables usingSqoop.
- Connected to AWS Redshift through Tableau to extract live data for real time analysis.
- Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
- Involved in creating UNIX shell Scripting. Defragmentation of tables, partitioning, compressing and indexes for improved performance and efficiency.
- Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Developed and implemented R and Shiny application which showcases machine learning for business forecasting. Developed predictive models using Python & R to predict customers churn and classification of customers.
- Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
- Running of Apache Hadoop, CDH and Map-R distros, dubbedElastic MapReduce(EMR)on(EC2).
- Performing the forking action whenever there is a scope of parallel process for optimization of data latency.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model.
- Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
- Data analysis using regressions, data cleaning, excel v-look up, histogramzs and TOAD client and data representation of the analysis and suggested solutions for investors
- Rapid model creation in Python using pandas, NumPy, sklearn, and plot.ly for data visualization. These models are then implemented in SAS where they are interfaced with MSSQL databases and scheduled to update on a timely basis.
Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, XML PL/SQL, Sql, HDFS, Unix, Python, SAS, PySpark, Redshift, Azure, Oozie, Impala, Kafka, JSON Shell Scripting.
Data Engineer/ Analyst
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
- Evaluated the traffic and performance of Daily deals PLA ads and compare those items with non-daily deal items to see the possibility of increasing ROI. suggested improvements and modify existing BI components (Reports, Stored Procedures)
- Understood Business requirements to the core and Came up with Test Strategy based on Business rules
- Prepared Test Plan to ensure QA and Development phases are in parallel
- Written and executed Test Cases and reviewed with Business & Development Teams.
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
- Automated Regression tool (Qute) and reduced manual effort and increased team productivity
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop MapReduce developed in python, pig, Hive
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
- Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Worked on to retrieve the data from FS to S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
- Incorporated predictive modelling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
- Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Worked with business to identify the gaps in mobile tracking and come up with the solution to solve.
- Analyzed click events of Hybrid landing page which includes bounce rate, conversion rate, Jump back rate, List/Gallery view, etc. and provide valuable information for landing page optimization.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Environment: Hadoop, MapReduce, AWS, Snowflake, AWS S3, Apache Spark, Sqoop. GitHub, Service Now, HP Service Manager, Jira, EMR, Nebula, Teradata, SQL Server,