- Over 8+ years of experience in IT industry which includes 6 years of experience in SDLC, development using Big Data and Hadoop ecosystem in retail business and Banking.
- Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, Hbase, Python, Snowflake, S3 storage, Airflow.
- Experience in doing performance tuning for map reduce jobs & hive complex queries.
- Experience in efficiently doing ETL’s using Spark - in memory processing, Spark SQL and Spark streaming using Kafka distributed messaging system.
- Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage
- Good understanding of various Hadoop distribution platforms Cloudera, Hortonworks, and cloud platforms like Amazon AWS
- Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.
- Very capable at using Amazon Web Services utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
- Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.
- Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.
- Having knowledge in Amazon EC2, S3, VPC, RDS, Elastic Load Balancing, Autoscaling, IAM, SQS, SWF, SNS, Security Groups, Lambda, Cloud Watch services
- Red Hat Enterprise Linux 5
- HDP 2.3
- Map Reduce
- Hive 0.14
- Shell Script
- Python 3.2
- spark 2.4
- AWS EMR 5.0.0
- Oozie 4.2
- Spark SQL PostgreSQL
- Shell Script
Confidential, Plano, TX
AWS Data Engineer
- Played a lead role in gathering requirements, analysis of entire system and providing estimation on development, testing efforts.
- Involved in designing different components of system like Sqoop, Hadoop process involves map reduce & hive, Spark, FTP integration to down systems.
- Have written hive and spark queries using optimized ways like using window functions, customizing Hadoop shuffle & sort parameter
- Developed ETL’s using PySpark. Used both Dataframe API and Spark SQL API.
- Using Spark, performed various transformations and actions and the final result data is saved back to HDFS from there to target database Snowflake
- Migrated an existing on-premises application to AWS . Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Used various spark Transformations and Actions for cleansing the input data
- Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
- Enforced standards and best practices around data catalog, data governance efforts
- Created Datastage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
- Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
- Used Pyspark for extract, filtering and transforming the Data in data pipelines.
- Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack Elasticsearch Kibana
- Used Data Build Tool for transformations in ETL process, AWS lambda, AWS SQS
- Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.
- Responsible for implementing monitoring solutions in Ansible, Terraform, Docker, and Jenkins.
Environment: Red Hat Enterprise Linux 5, HDP 2.3, Hadoop, Map Reduce, HDFS, Hive 0.14, Shell Script, SQOOP1.4.4, Python 3.2, PostgreSQL, spark 2.4, airflow, snowflake.
Confidential, St. Louis, MO
Senior Hadoop Developer / AWS Data Engineer
- Evaluated suitability of Hadoop and its ecosystem to the project and implementing, validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop initiative
- Estimated the Software & Hardware requirements for the Name node and Data nodes in the cluster.
- Experience in migrating existing databases from on premise to AWS Redshift using various AWS services
- Developed the Pysprk code for AWS Glue jobs and for EMR.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
- Developed Java Map Reduce programs for the analysis of sample log file stored in cluster
- Implemented Spark using Python and Spark SQL for faster testing and processing of data.
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala
- Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating amazon resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue
- Reviewing the explain plan for the SQLs in snowflake
- Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI.
- Developed ETL jobs using Spark -Scala to migrate data from Oracle to new Cassandra tables.
- Used Spark -Scala (RDD’s, Data frames, Spark Sql) and Spark - Cassandra -Connector API's for few tasks (Data migration, Business report generation etc.)
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created an e-mail notification service upon completion of job for the team which requested for the data.
- Implemented security to meet PCI requirements, using VPC Public/Private subnets, Security Groups, NACLs, IAM roles, policies, VPN, WAF, Trust Advisor, Cloud Trail etc. to pass penetration testing against infrastructure
- Defined job work flows as per their dependencies in Oozie.
- Played a key role in productionizing the application after testing by BI analysts.
Environment: MapReduce, Hive, Sqoop 1.4.4, Oozie 4.2, Python, Scala, Spark 2.3, Kafka, Ambari, Cassandra, Linux, AWS EMR, S3, Storm
Confidential, Englewood, CO
AWS Data Engineer
- Written Spark applications using Scala to interact with the PostgreSQL database using Spark SQL Context and accessed Hive tables using Hive Context.
- Involved in designing different components of system like big-data event processing framework Spark, distributed messaging system Kafka and SQL database PostgreSQL.
- Implemented Spark Streaming and Spark SQL using Data Frames.
- I have integrated product data feeds from Kafka to Spark processing system and store the order details in PostgreSQL data base.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
- Created multiple Hive tables, implemented Dynamic Partitioning and Buckets in Hive for efficient data access.
- Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations
- Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data
- Involved in creating Hive External tables, also used custom SerDe's based on the structure of input file so that Hive knows how to load the files to Hive tables.
- Managed large datasets using Panda data frames and MySQL
- Monitor Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined
- Monitor System health and logs and respond accordingly to any warning or failure conditions.
- Worked on scheduling all jobs using Oozie.
Environment: AWS EMR 5.0.0, EC2, S3, Oozie 4.2, Kafka, Spark, Spark SQL PostgreSQL, Shell Script, SQOOP1.4, Scala, Kafka
Data Engineer with Java
- Participated in Designing databases (schemas) to ensure that the relationship between data is guided by tightly bound Key constraints.
- Writing PL/SQL stored procedures, function, packages, triggers, view to implement business rules into the Application level.
- Extensive experience in Data Definition, Data Manipulation, Data Query and Transaction Control Language
- Understanding the requirements by interacting with business users and mapping them to design and implementing it following the AGILE Development methodology
- Experience in Installing, Upgrading and Configuring Microsoft SQL Server and Migrating data from SQL Server 2008 to SQL Server 2012
- Experience in designing and creating Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions
- Excellent in learning and adapting to new technologies.
- Developed warmup programs to load recently logged in user profile information into MSSQL DB.
- Performed manual testing and used logging tools like Splunk, putty to read the application logs for elastic search.
- Configured the data mapping between Oracle and SQL Server and tested performance accuracy related queries under SQL Server.
- Created connections to database using Hibernate Session Factory, used Hibernate APIs to retrieve and store data with Hibernate transaction control.
- Experienced in writing JUnit test cases for testing.
- Helped in creating Splunk dashboard to monitor MDB modified in the project.
Environment: SQL Developer, Hibernate, Restful Web Services, Agile Methodology, UNIX, Oracle, TOMCAT, Eclipse, Jenkins, CVS, JSON, Oracle PL/SQL