- 9+ years of Industry experienced in IT with solid understanding of Data Engineering, Analytics, Data Modeling, Data Analysis Evaluating Data Sources and strong understanding of Bigdata Technologies (Hadoop Framework, HDFS, Hive, Sqoop, Spark, Kafka, Cassandra, Oozie, MapReduce), Data Warehouse/Data Mart Design, AWS (S3, Redshift, Athena, RDS, Glue, SNS), ETL, BI, OLAP,OLTP, Client/Server applications.
- Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce Programming Paradigm.
- Expert in working with Hive data warehouse tool - creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Experienced in Dimensional Data Modeling using Data Modeling, Relational Data modeling, ER/Studio, ERwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical Data Modeling.
- Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
- Good experience working on AWS-BigData/Hadoop Ecosystem in the implementation of DataLake.
- Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route S3, RDS, ELB, Dynamo DB, Glue, SNS, SQS, Cloud Formation, etc and Hands-on experience with Redshift Spectrum and AWS Athena query services for reading the data from S3.
- Excellent understanding in ecosystems with Cloudera CDH.x, Hortonworks HDP.x and MapR and experience in execution of both Real time and Batch jobs through the data streams to SPARK Streaming.
- Experience in job/workflow scheduling and monitoring tools like Oozie, AWS Data pipeline & Autosys
- Extensively worked on Spark using Scala/Python on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Good knowledge and experience on NoSQL databases like HBase, Cassandra and MongoDB and SQL database like Teradata, Oracle, PostgreSQL and SQL Server.
- Extensive experience in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for improved database performance in OLTP, OLAP and Data Warehouse/Data Mart environments.
- Experienced in designing standards for using normalized data structures, de-normalized structures and dimensional structures. Defines common design patterns for modeling various types of relationships.
- Experienced in Batch processes, Import, Export, Backup, Database Monitoring tools and Application support and experienced in big data analysis and developing data models using Hive, PIG, and Map reduce, SQL with strong data architecting skills designing data-centric solutions.
- Experienced in Teradata SQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
- Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS,CSV,DBF,MDB etc.
- Experienced in Data Scrubbing/Cleansing, Data Quality, Data Mapping, Data Profiling, Data Validation in ETL
- Excellent Knowledge of Ralph Kimball and Bill Inmon's approaches to Data Warehousing.
- Excellent experience and knowledge in developing Informatica Mappings, Mapplets, Sessions, Workflows and Worklets for data loads from various sources such as Oracle, Flat Files, DB2, SQL Server etc.
- Experienced in writing UNIX shell scripting and hands on experienced with scheduling of shell scripts using Control-M.
BigData Technologies: Hadoop, HDFS, Hive, Pig, HBase, Sqoop, Flume, Airflow, Dag, Kafka and Spark.
Cloud Technologies: AWS S3, Redshift, RDS, Glue, Lambda, SNS, SQS, Athena, Kinesis.
Analysis and Modeling Tools: Erwin r9.6/r9.5/9.1, Sybase Power Designer, Oracle Designer, BP win ER/Studio, MS Access 2000, Star-Schema, Snowflake-Schema Modeling, and FACT and dimension tables, Pivot Tables.
ETL Tools: Informatica, Talend and SSIS.
Programming Languages: Java, Python, Pyspark, SQL and Scala
Database Tools: Microsoft SQL Server 2014/2012/2008 , Teradata, and MS Access, Poster SQL, Netezza, SQL Server, Oracle.
Reporting and Visualization Tools: Tableau, Quicksight and SSRS.
Versioning and CI?CD Tools: Git, BitBucket, Jenkins and Docker
Tools: & Software: TOAD, MS Office, BTEQ, Teradata 15/14.1/14/13.1/13 , SQL Assistant
Sr. Data Engineer
Confidential, Reston, VA
- Worked on setting up AWS DMS and SNS for data transfer and replication and used SQL on the new AWS Databases like RedShift and Relation Data Services.
- Generates ETL scripts to transform, flatten, and enrich the data from source to target using AWS Glue and created event-driven ETL pipelines with AWS Glue.
- Enable and configure Hadoop services such as HDFS, YARN, Hive, Hbase, Kafka, Sqoop, Zeppeline Notebook and Spark/Spark2 and involved in analyzing log data to predict the errors by using Apache Spark.
- Predominantly using Python and AWS (Amazon web services), and MySQL along with NoSQL (mongodb) databases for meeting end requirements and building scalable real time system.
- Designed and Implemented the ETL process using Talend Enterprise Big Data Edition to load the data from Source to Target Database.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS and worked extensively with Sqoop for importing metadata from Oracle.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Designed ER diagrams, logical model (relationship, cardinality, attributes, and, candidate keys) and physical data models (capacity planning, object creation and aggregation strategies) as per business requirements and Involved in documentation of Data Modeler/Data Analyst and ETL specifications for Data warehouse Erwin r9.6.
- Implemented Spark using python and Spark SQL for faster testing processing the data and developed spark scripts by using python shell commands as per the requirement.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Created Talend custom components for the various use cases and worked on XML components, Data quality, Processing and Log & Error components.
- Developed Spark scripts by using Scala shell commands as per the requirement and configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Used External Loaders like Multi Load, T Pump and Fast Load to load data into Teradata14.1, Oracle, and Database, analysis, development, testing, implementation and deployment.
- Performed Extracting, Transforming and Loading (ETL) data from Excel, Flat file, Oracle to MS SQL Server by using Informatica and Implemented schema extraction for Parquet and Avro file Formats in Hive and worked with Talend open studio for designing ETL Jobs for Processing of data.
- Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE and worked with continuous Integration of application using Jenkins.
- Create Notebooks in Data bricks to pull the data from S3 and process with the transformation rules and load back the data to persistence area in S3 in Apache Parquet format
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data and used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing.
- Performed cluster analysis and page-rank on newswire articles using Hadoop framework in python to interpret the importance of keywords connecting documents.
- Involved in creating Hive tables, and loading and analyzing data using hive queries and developed Hive queries to process the data and generate the data cubes for visualizing and Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
Environment: Hadoop Framework (Cloudera), Hive, Kafka, Spark, Sqoop, HDFS, Ariflow, AWS S3, Redshift, RDS, Athena, Glue, SNS, Lambda, Tableau, Spark SQL, Python, Pyspark, SQL, Java, Git, PostgreSQL, Teradata, Databricks, Oracle, Cassandra, ETL informatica, Erwin, Data warehousing, CICD (Jenkins, Docker) and Shell Scripting.
Sr. Data Engineer
Confidential, Chicago IL
- Developed Spark RDD transformations, actions, and DataFrame's, case classes, Datasets for the required input data and performed the data transformations using Spark-Core.
- Create Data pipelines for Kafka cluster and process the data by using sprk streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
- Documented logical data integration (ETL) strategies for data flows between disparate source/target systems for structured and unstructured data into common data lake and the enterprise information repositories
- Migrate on in-house database to AWS Cloud and also designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Redshift, Athena) by focusing on high-availability and auto-scaling.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.
- Using Talend big data components like Hadoop and S3 Buckets and AWS Services for redshift.
- Involved in designing the data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
- Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark and Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 bucket
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing, analyzing and training the classifier using MapReduce jobs, Pig jobs and Hive jobs.
- Worked on Spark streaming collects the data from Kafka in near real time and performs necessary transformations and aggregations on the fly to build the common learner data model and persists the data in Cassandra.
- Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements.
- Develop ETL mappings for various Sources (.TXT, .CSV,.XML) and also load the data from these sources into relational tables with Talend Enterprise Edition.
- Involved in Data modeling using ER Studio identified objects and relationships and how those all fit together as logical entities, these are then translated into physical design using forward engineering ER Studio tool.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Created monitors, alarms, notifications and logs for Lambda functions , Glue Jobs, EC2 hosts using Cloudwatch and used AWS Glue for the data transformation, validate and data cleansing.
- Worked with cloud based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Deployed applications using Jenkins framework integrating Git- version control with it.
- Used the Agile Scrum methodology to build the different phases of Software development life cycle.
- Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
- Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau.
Environment: Hadoop/Bigdata Ecosystem (Spark, Kafka, Hive, HDFS, Sqoop, Oozie, Cassandra, mongoDB), AWS (S3, AWS Glue, Redshift, RDS, Lambda, Athena, SNS, SQS, Cloud Formation), Oracle, Jenkins, Docker, Git, SQL Server, SQL, Java, PostgreSQL, Python, Pyspark, Teradata, Tableau, Quicksight, ER Studio, Data warehousing, ETL Informatica, Talend, Agile
Sr. Data Modeler/ Data Analyst
Confidential, Newton, MA
- Extensively used Erwin r9.5 to design Logical/Physical Data Models, relational database design, forward/reverse engineering, publishing Data Model to acrobat files, created ERWIN reports in HTML, RTF format depending upon the requirement, Published Data Model in model mart, created naming convention files, co-coordinated with DBAs to apply the Data Model changes.
- Did exploratory data analysis (EDA) using Python and done Python integration with Hadoop Map Reduce and spark.
- Involved in requirement analysis, ETL design and development for extracting data from the source systems like Taradata13.1, DB2, Sybase, Oracle 9i, flat files and loading into Netezza.
- Involved in Designing Star Schema (identification of facts, measures and dimensions), Snowflake Schema for Data Warehouse, ODS architecture by using tools like Data Model, Erwin.
- Implemented system architecture for Amazon EC2 based cloud-hosted solution for client and designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
- Documented a whole process of working with Tableau Desktop, installing Tableau Server and evaluating Business Requirements.
- Involved in Ralph Kimball and Bill Inman Methodologies (Star Schema, Snow Flake Schema).
- Coded using Teradata Analytical functions, BTEQ SQL of TERADATA, write UNIX scripts to validate, format and execute the SQLs on LINUX environment.
- Created data models for AWS Redshift and Hive from dimensional data models and Designed and implemented a Data Lake to consolidate data from multiple sources, using Hadoop stack technologies like SQOOP, HIVE/HQL.
- Worked in importing and cleansing of data from various sources like Teradata, Oracle, Netezza flat files, SQL Server with high volume data.
- Creating various types of reports such as drill down & drill through reports, Matrix reports, Sub reports and Charts using SQL Server Reporting Services (SSRS).
- Implemented naming Standards and Warehouse Metadata for fact and dimension of Logical and Physical Data Model.
- Involved in ETL processing using Pig & Hive in AWS EMR, S3 and Data Profiling, Mapping and Integration from multiple sources to AWS S3.
- Created new database objects like Procedures, Functions, Packages, Triggers, Indexes and Views using T-SQL in SQL Server.
- Created complex Stored Procedures and PL/SQL blocks with optimum performance using Bulk Binds (BULK COLLECT & FORALL), Inline views, Reference cursors, cursor variables, dynamic SQL, v-arrays, external tables, nested tables, etc.
- Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
- Designed Informatica mapping for Error handling and involved in preparation of the low level design (LLD) documents for Informatica Mappings.
- Designing and developing SQL Server Database, Tables, and Indexes, Stored procedures, Views, User Defined Functions and other T- SQL statements.
Environment: Netezza, Oracle, Taradata13.1, T-SQL, SQL Server, DB2, LINUX, ERWIN r9.5, MDM, PL/SQL, ETL, SSRS, Informatica, Hadoop, Sqoop, Hive, Pig, AWS S3, Redshift, EMR, Shell Scripting, Python Cognos, Excel, Pivot tables, Shell scripting, Pivot, Power view, UNIX etc.