Big Data / Aws Solution Architect / Lead Data Engineer Resume
2.00/5 (Submit Your Rating)
Hartford, CT
SUMMARY
- 14 years of professional experience in Design, Development of Applications with 6+ years of experience in Big Data technologies with very proficient knowledge on Spark Framework and Specialty in Performance Tuning.
- Lead team of geographically diverse data engineers to deliver big data and cloud data product services across lines of business and implement standards in data engineering Center of Practice
- Reengineered the existing heavy ETL spark Jobs and achieved close to 5X performance improvement with the help of dynamic partition keys creation on datasets by leveraging Spark ML Statistical Libraries
- Created very efficient custom frameworks to easily convert the On - Prem spark jobs to adopt a cyclic micro batch execution at partitions level in parallel to solve the shuffle spills problem and leverage an optimal cluster capacity without impacting other applications that share the cluster
- Extensively worked on Designing Resilient, Cost Optimized and Operationally Excellent solutions in AWS Stack to migrate On-Prem Spark Applications
- Recently Developed and Implemented a dynamic data-ingestion framework in Spark / Shell Script to ingest the on-prem Hadoop / Hive datasets into AWS S3
- Recently Developed and Implemented Dynamic Framework for Data Quality Detection in Spark which provided capabilities of a regression testing framework on the strategic data products to detect data quality issues at attributes level in the entire workflow and developed multiple accelerators using combination of statistical techniques and data engineering for various automation tasks.
- Collaborated with Data Science community to deliver new underwriting model using cloud resources.
- Developed an unsupervised anomaly detection model with LSTM Deep Neural Networks for earlier detection of data quality issues on the strategic data products ETL workflow
- Implemented Real Time Anomaly detection on streaming data using Sagemaker along with Data Science Community for Innovation Jam Initiatives at the enterprise level
- Collaborated with Data Science Community to perform feature engineering for Image Detection and Image Classification Modelling and Inference
TECHNICAL SKILLS
- EMR, EC2, Glue ETL, Lambda, Redshift, Athena, S3, RDS, Glue Metadata Services, Step Functions, SNS, SQS, IAM, Secrets Manager, KMS, Sagemaker, Textract, Comprehend and CloudWatch Logs / Events
- Databricks
- Spark (Python and Scala), Hive/HiveQL, SQL, Sqoop, Flume, Shell Scripting
- Python - Core Python, Pandas, Numpy, Sklearn, Flask, Matplotlib, Seaborn, NLTK and various statistical and machine learning packages
- Teradata, Hive, Oracle, SqlServer, RDS, Redshift, Snowflake
PROFESSIONAL EXPERIENCE
Confidential, Hartford, CT
Big Data / AWS Solution Architect / Lead Data Engineer
Responsibilities:
- Key contributor to the development and execution of BI&A and Data Engineering strategy in AWS Stack
- Responsible for architecting and strategizing to migrate on prem spark applications into AWS and Databricks platforms
- Responsible for the design, development and delivery of core analytic data products to support the BI R&D function, Actuarial, Product Management and business analytic consumers
- Responsible for enabling batch and real time data ingestion patterns of customer data.
- Responsible for design, development and implementation of resilient, cost effective, highly available robust applications in AWS Stack
- Led the solution design, and overall ETL design for processing RDF messaging data using Spark
- Built reusable transformation rules and repeatable data conversion models for similar solutions enabling saving of development effort by 20%.
- Developed python framework for parsing complicated XML and Json messages ingestion into data lake
- Implemented data lake Pipeline which involves orchestrating almost 20 data sources, applying ETL, and created a single Hive table with close to 1200 attributes with more than a billion rows in less than 2 hours.
- Establish Best Practices (standards, principles, guidelines, framework, and knowledge management in the big data space
- Worked on Deep Neural Networks like ANN along with Natural language processing algorithms for Text mining processes
- Implemented Random Cut Forest, Isolation Forest and Deep Auto Encoder Models for Anomaly detection in both batch and streaming data capture outliers and improve overall quality of strategic data products
- Developed spark programs for data ingestion/transformation from DB2, Teradata and Json files.
- Extensively worked on converting SAS to Spark/Hive Modules to create a single entity for data scientist’s exploration
- Extensively worked on performance tuning of Spark/Hive components
- Extensively worked on Kafka - Spark Streaming for Real time data pipelines
- Worked on AWS tools like EMR / LAMBDA for specific requirements where the source data was placed in S3 by the vendors
- Use Kanban, Git, and GitHub for project management and version control as project lead on Workers Compensation lines’ data products.
- Provide data models and POCs to ingest data from various sources and in various formats.
- Write bash scripts for multithreading and automation.
- Deploy and manage cloud infrastructure using Jenkins and Terraform.
- Designed and Deployed Dynamic framework to connect, read and parse MongoDB data and stored in hive
- Playing technical leadership/mentoring role for a bunch of Onshore and Onshore Team
Confidential
Lead Data Engineer / Data Analyst
Responsibilities:
- Extensively worked on solution architecture and design for Data lake analytics implementations using Spark and Hive
- Designing and deployment of data lake with different Big Data ecosystem tools including Spark, Kafka, Python, NIFI, Hive, Oozie, Sqoop with Hortonworks distribution.
- Developed Spark code using Scala and Spark-SQL for large data sets for both Streaming batch processing.
- Expertise in using various Spark connectors to load and process data between Cassandra, Elastic Search, Kafka
- Extensively worked on loading data into HIVE tables using Spark.
- Extensively worked in Kafka and Spark Streaming for unbounded API data, to perform various transformations, Joins and load into Elastic Search for low latency reporting and analytics.
- Leveraged to NIFI to configure for Streaming and Batch Sources for Pipelining into Kafka / HDFS Sinks
- Involved in converting Hive/HQL queries into Spark transformations using Spark RDD, Scala and Python.
- Developed SQOOP import utility to load data from various RDBMS sources for history loads
- Developed data pipeline using Flume and Spark to store data into HDFS.
- Good Knowledge on Cloudera/Hortonworks distributions and in Amazon simple storage service (Amazon S3), Amazon EC2, Amazon EMR and have very good understanding of Microsoft Azure and Google Cloud Dataflow Big data and machine learning tools.
- Extensively worked on loading data into Hive Tables, Raw HDFS Storage, Cassandra and Elastic Search from Using Spark Jobs
- Implemented web log analytics using SPLUNK, ElasticSearch / Kibana and Grafana.
- Good experience in performing data analytics using SPARK with both Scala and Python API’s.
