- Congenial consultant with teamwork experience in pipeline development and database system design offers efficacy in communication, collaboration, and innovation.
- More than 8 years working as a versatile programmer combining professional experience in data science with academic training in computer science and mathematics to deliver analytical skills in Hadoop big data engineering.
- Around 8 + years of experience in Big Data frameworks
- Proven success in team leadership, focusing on mentoring team members and managing tasks for efficiency.
- Worked with various stakeholders for gathering requirements to create as - is and as-was dashboards.
- Recommended and used various best practices to improve dashboard performance for Tableau server users.
- Expert with the design of custom reports using data extraction and reporting tools, and development of algorithms based on business cases.
- Extensively usedPythonLibraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, and NumPy
- Strong fundamentals of SQL data model
- Hands on in performance tuning and reporting for optimization using various methods like Extracts, Context filters, writing efficient calculations, Data source filters, Indexing, and Partitioning over SQL.
- Used to working in production environments, managing migrations, installations, and development.
- Knowledge of Cloudera platform & Apache Hadoop 0.20. Version.
- Very good exposure in OLAP and OLTP.
- Created dashboards in Tableau using various features of Tableau like Custom-SQL, Multiple Tables, Blending, Extracts, Parameters, Filters, Calculations, Context Filters, Data source filters, Hierarchies, Filter Actions, Maps, etc.
- Modified existing and added new functionalities to Financial and Strategic summary dashboards.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Understanding of the Hadoop Architecture and its ecosystem such as HDFS, YARN, MapReduce, Sqoop, Avro, Spark, Hive, HBase, Flume, and Zookeeper
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Strong SQL skills to query data for validation, reporting and dash boarding.
- Worked with Data Lakes and Big Data ecosystems (Hadoop, Spark, Horton works, Cloudera)
- Expert with BI tools like Tableau and PowerBI, data interpretation, modeling, data analysis, and reporting with the ability to assist in directing planning based on insights.
- In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce concepts and experience in working with MapReduce programs using Apache Hadoop for working with Big Data to analyze large datasets efficiently.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.
- Hands on experience in using other Amazon Web Services like Auto scaling, RedShift, DynamoDB.
- In-depth knowledge of SnowflakeDB, Schema and Tablestructures.
- Track record of results as a project manager in an agile methodology using data-driven analytics.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience with the Hadoop ecosystem, big data tools, and database technologies
- Experience in data manipulation, data analysis, and data visualization of structured data, semi-structured data, and unstructured data
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Have Experience in designing and developingAzure stream analyticsjobs to process real time data usingAzure Event Hubs, Azure IoT Hub and Service Bus Queue.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Creative skills in developing elegant solutions to challenges related to pipeline engineering
- Knowledge of the Spark Architecture and programming Spark applications
- Ability to program in varies languages such as Python, Java, C++, and Scala
- Experience in Object-oriented programming and functional programming
- Creates bash scripts to automate software installation, file management, data pipelines
- Knowledge in data governance, data operations, computer security, and cryptology
- Coding skills with PySpark, Spark Context, and Spark SQL
- Pipeline development skills with Apache Airflow, Kafka, and NiFi
- Experience working with several images and Docker Engine
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and IntelliJ, Putty, GIT.
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX. pyspark
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle, SQL Server, MySQL, DB2, Teradata,NO SQL, Mongo DB, Cassandra
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
Senior Data Engineer
- My responsibility in this project is to mainly look after Debit and Credit fraud files.
- Installing, configuring and maintaining Data Pipelines through StreamSets.
- Used CICD tool bamboo that required in the environment and used to write BDR.
- Designed and deployed a Spark cluster and different Big Data analytic tools including Spark, Kafka streaming, AWS and HBase with Cloudera Distribution.
- Configured deployed and maintained multi-node Dev and Test Kafka.
- Integrated Kafka with Streaming ETL and done some required ETL on it to extract the meaningful insights.
- Developed application components interacting with HBase.
- Performed optimizations on Spark/Scala.
- Used the Kafka producer app to publish clickstream events into the Kafka topic and later explored the data with sparkSql.
- Processed raw data at scale including writing scripts, web scraping, calling APIs, write SQL queries, etc.
- Importing streaming logs and aggregating the data to HDFS and MYSQL through Kafka.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Pyspark, Spark-SQL, Data Frame, Pair RDD's and Spark YARN.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Imported data from different sources like HDFS, MYSQL and other sources through Sqoop and Kafka to import streaming logs into Spark RDD.
- Performed visualization using SQL integrated with Zeppelin on different input data and created rich dashboards
- Performed transformations, cleaning and filtering on imported data using Spark-SQL and loaded final data into HDFS and MYSQL database.
- Involved in production support and enhancement development.
Environment: Hadoop, Scala, HDFS, GitLab, Bamboo Cloudera, Kafka, Spark, AWS, Redshift, Lambda, Snowflake DB, Tableau, Informatica, Python, Hive, PL/SQL, Oracle, T-SQL, SQL Server, Unix, Shell Scripting.
Senior Data Engineer
Confidential, Rochester, MN
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Developed Data Pipeline with Kafka and Spark
- Defined API security key and other necessary credentials to run Kafka architecture.
- Designed a PoC for Confluent Kafka.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Developed Spark Applications by using Scala, Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Contributed in designing the Data Pipeline with Lambda Architecture.
- Created numerous ODI interfaces and load into Snowflake DB. Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse
- Handled importing of data from various data sources, performed transformations using Hive, and loaded data into S3 data lakes and snowflake DB
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Created Tables, Stored Procedures, and extracted data using PL/SQL for business users whenever required.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Files extracted from Hadoop and dropped on daily hourly basis intoS3
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Migrated an existing on-premises application to AWS.
- Used AWS services like EC2 and S3 for small data sets processing and storage.
- Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
Environment: Hadoop, Hbase, Scala, Sqoop, Cloudera, Kafka, Spark, MapReduce, AWS, Redshift, Lambda, Snowflake DB, Tableau, Informatica, Python, Hive, PL/SQL, Oracle, T-SQL, Sql Server, NoSQL, Cassandra..
Confidential, Weehawken, NJ
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Experience installing Apache Kafka.
- Configured documentations for Kafka to operate effectively.
- Created a Producer application that sends API messages over Kafka.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Large SAP shop with 6000 servers and several petabytes of data
- Responsible for storage, and capacity management. Analyzed existing systems and propose improvements in processes and systems for usage of modern scheduling tools like Airflow and migrating the legacy systems into an Enterprise data lake built on Azure Cloud.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics
- Responsible for migration of application running on premise onto Azure cloud
- Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2)
- Designed and developed a new solution to process the NRT data by using Azure stream analytics,Azure Event Huband Service Bus Queue.
- Created severalDatabricksSpark jobs withPysparkto perform several tables to table operations.
- Extensively used SQL Server Import and Export Data tool.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Apache Kafka, Hadoop, HDFS, MapReduce, Hive, Pyspark, Scala, Azure Databricks, Azure data factory (ADF), Azure Service Bus, Azure Event Hub, Azure Synapse Analytics, Zookeeper, Python, Cucumber, Oracle, SQL Server, NoSQL, Jupyter, OLTP, Unix, Shell Scripting, SSIS, Git.
Confidential, NYC, NY
- Created and executed Hadoop Ecosystem installation and document configuration scripts on GCP.
- Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used python pandas, Jenkins, nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
- Developed a PySpark program that writes data frames to HDFS as Avro files.
- Utilized Spark's parallel processing capabilities to ingest data.
- Configured Flume to extract the data from the web server output files to load into HDFS.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
- Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to data service layer internal tables as orc format.
- In charge of PySpark code, creating data frames from tables in data service layer and writing them to a Hive data warehouse.
- Experience on moving raw data between different systems using Apache NIFI.
- Involved in loading data from UNIX file system to HDFS using Shell Scripting.
- Used Elasticsearch for indexing/full text searching.
- Code and developed custom Elastic Search java-based wrapper client using the JEST API.
- Hands on experience in using AWS services like EC2, S3, Mongo DB, Nifi, Talend, Auto scaling and DynamoDB
- Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.
Environment: Hadoop, Hive, AWS, PySpark, Cloudera, MapReduce, Apache, Kafka, Java, Python, Pandas, Pig, Cassandra, Jenkins, Flume, SQL Server, MySQL, PostgreSQL, MongoDB, DynamoDB, Airflow, Unix, Shell Scripting.
- Installed Hadoop, MySQL, PostgreSQL, SQL Server, Sqoop, Hive, and HBase.
- Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
- Experience creating and organizing HDFS over a staging area.
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Orchestrated the end-to-end infrastructure for disaster recovery, cost saving and patching purposes using theAWS cloudformation scripts. UtilizedAWS Lambdato run without using servers and prompt to run the code usingS3andSNS.
- Wrote python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
- Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN,Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
- Troubleshoot RSA SSH keys in Linux for authorization purposes.
- Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.
- Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
- Created a data service layer of internal tables in Hive for data manipulation and organization.
- Inserted data into DSL internal tables from RAW external tables.
- Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase.
Environment: Hadoop, Hive, Hbase, MapReduce, Spark, Sqoop, HDFS, AWS, SSIS, Snowflake, Pandas, MySQL, SQL Server, PostgreSQL, Teradata, Java, Unix, Python, Tableau, Oozie, Git.
Associate Data Engineer
- Utilized Pandas to create a data frame,
- Coordinate with the teams from Amazon in engaging my team for cloud migration of applications that we support
- Facilitated planning poker sessions and consolidated on estimates for deliverables
- Maintained burndown charts, CFDs and tracked velocity of the sprint always
- Facilitated product backlog refinement and backlog prioritization along with the Product Owner
- Wrote python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
- Utilized Matplotlib to graph the manipulated data frames for further analysis.
- Graphs provided the data visualization needed to obtain information in a simple form.
- Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up-to-date for reporting purpose by Pig.
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Exported manipulated dataframes to Microsoft Excel and utilized its choropleth map feature.
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Used Git for version control with colleagues.
Environment: Hadoop, Hive, Pig, Zookeeper, Flume, Impala and Sqoop, Pandas, AWS S3 Buckets, Tableau, Oozie, Spark SQL, PostgreSQL, Git.