Sr. Data Engineer Resume
Framingham, MA
SUMMARY
- Demonstrable skill and Proficient IT professional with 8+ years of Industry experience in data model, data security, data architecture, tackling challenging architectural and scalability problems.
- Experienced in Cloud computing platform in designing data pipeline on premise, AWS with hands - on experience in Amazon EC2, S3, RDS, Amazon Elastic Load Balancing, Auto Scaling, EMR in AWS family and Azure data Lake, Azure SQL database and Azure SQL Data warehouse in Azure.
- Experience in ETL processing, migration and data processing using AWS services such as EMR, EC2, Athena, Glue, Lambda, S3, Relational Database Service (RDS) and other services of AWS family.
- Good Hands-on Knowledge on Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Elastic Beanstalk (EBS), autoscaling, Security Groups, EC2 Container Service (ECS), Red shift.
- Knowledge on Continuous storage in AWS using Elastic Block Storage, S3, Glacier, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Expertise working on EC2 instance for computing the huge data and processing it across a wide range of applications.
- Skilled in transferring data from the data resources to Amazon Redshift, S3 using AWS data pipelines.
- Worked on spark data bricks cluster for estimating the cluster’s size, monitoring, and troubleshooting on AWS cloud.
- Used Terraform in AWS virtual private cloud to automate the tasks by configuring the settings interfacing with the control layer.
- Data Composition using Spark Streaming from S3 bucket in real-time. Accomplished necessary Transformation and Aggregation on the fly to shape the common learner data model and endures the data in HDFS.
- Expertise in implementing and maintaining data pipelines using Apache Airflow.
- Experience with Azure Cloud Services, SQL Azure, Azure Analysis, Azure Monitoring, Azure data factory.
- Worked closely with Azure platform. In hands experience on Azure Data Lake, Blob storage, Synapse, Data Storage Explorer, SQL, SQLDB and Data Warehouse.
- Experience working with Azure services like Stream Analytics, HDInsight.
- Proficient in building data pipelines and data loading using Azure Data Factory, Azure Databricks and Azure Data warehouse to control the accessibility to the database.
- Well versed with developing data processing jobs to analyze the data using Map Reduce, Spark and Hive.
- Excellent understanding of Spark Architecture, including Spark Core, Spark SQL, Spark Context, Spark-SQL Data Frame APIs, Driver Node, Pair RDDs, Worker Node, Stages, Executors and Tasks.
- Strong understanding and experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, Sqoop, Hive, HBase, Oozie, Flume, Kafka and Zookeeper.
- Developed Spark applications using Spark SQL, PySpark and Delta Lake in Databricks to extract, transform and aggregate multiple file formats for analyzing and transforming the data.
- Expertise in building Spark and PySpark for interactive analysis, batch processing and real-time processing.
- Extensively used cloudera platforms for the Hive data and used spark data frame operations for validating the data.
- Worked on importing continuous data using Sqoop in Last modified and Last updated mode.
- Hands-on Programming knowledge on Python (Pandas, NumPy), PL/SQL, Scala, PySpark.
- Good knowledge on converting Hive/SQL Queries into Spark Transformations with Datasets and Data frames.
- In-depth knowledge of Hadoop architecture and its component like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Task Tracker and Map Reduce programming paradigm.
- Very strong data development skills with ETL, Oracle SQL, Linux/UNIX, data warehousing and data modeling.
- Worked to design complex data models both in real-time and offline analytic processing and provided support for data profiling and data quality functions.
- Experienced with SCM tools like GIT, Jenkins, Ansible and Test-Driven Code Development.
- Hands on experience in building statistical models using tools like Python, PySpark, SQL with extensive experience in computing/programming skills; proficiency in Python, R, and Linux shell script.
- Experienced in data management and data analysis in a relational database, Hadoop and Spark.
- Expertise in relational database administration including configuration, implementation, data modeling, maintenance, redundancy/HA, security, troubleshooting/performance tuning, upgrades, database, data and server migrations, SQL
- Experience working with Agile Software Development methodologies.
TECHNICAL SKILLS
AWS Big Data Technologies: Amazon AWS (EMR, EC2, RDS, EBS, S3, Athena, Elasticsearch, Lambda
Azure Big Data Technologies: Azure Data Lake, Azure Data Bricks, Blob storage, Synapse, Data Storage Explorer, Azure Active Directory, HDInsight, Cosmos DB
Hadoop Big Data Technologies: Airflow, Hadoop, Spark, Sqoop, Hive, HBase, Oozie, Flume, Kafka and Zookeeper, Cloudera Manager
ETL Tools: AWS Glue, Azure Data Factory
Databases: MySQL, Teradata, Oracle, MS-SQL SERVER, PostgreSQL, DB2
Version Control: GIT
Database Modelling: Dimension Modelling, ER Modelling, Star Schema Modelling, Snowflake Modelling.
Monitoring and Reporting: Tableau, PoweBI, Datadog
Programming Languages: Python, PySpark, Scala, PowerShell, HiveQL
IDE Tools: Eclipse, Jupyter, Anaconda, PyCharm
Others: ADO, Terraform, Docker, Kubernetes, Jenkins, Jira.
PROFESSIONAL EXPERIENCE
Confidential, Framingham MA
Sr. Data Engineer
Responsibilities:
- Designed and Developed Enterprise data lake which allows various data types of data from multiple data resources.
- Worked closely with Data and Analytics team to architect and build Data Lake using various AWS services like EMR, S3, Athena, Glue, Redshift Spectrum, Apache Airflow and HIVE.
- Developed and implemented scalable, secure cloud architecture based on AWS, leveraged AWS cloud services like VPC (Virtual Private Cloud) and EC2 auto-scaling and build highly scalable, secure and flexible systems that handled load bursts and quickly evolve during development iterations.
- Designed and implemented a real-time data pipeline to work with semi structured data by incorporating larger volume of raw records from multiple data sources using Kinesis Data Steam and kinesis firehouse.
- Created AWS Lambda function to read from the Producer and write records to an Amazon DynamoDB table as they arrive.
- Worked with EMR to transform and move big data into AWS data stores and databases in S3 and DynamoDB.
- Created and launched AWS EC2 instances to execute jobs on EMR to store the results in S3.
- Designed and setup Enterprise Data Lake to provide support in multiple areas such as Analytics, processing, storing and Reporting of big and rapidly changing data.
- Worked with Hadoop distribute file system (HDFS), S3 Storage, big data formats like parquet, JSON.
- Optimized Queries and tuned the performance in AWS Redshift for migrating large datasets.
- Configured AWS Redshift clusters, AWS Redshift spectrums for querying, AWS Redshift Data share for transferring the data among clusters.
- Automated the ETL processes using Apache Airflow and S3 for storing the data in batches.
- Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Developed Python code to satisfy requirement and perform data processing and analytics using inbuilt libraries.
- Used Apache Spark with Python to develop and execute Big Data Analytics applications.
- Created Lambda functions with Boto3 for unused AMIs in all application regions to reduce the cost for EC2 resources.
- Worked with the Spark for improving performance and optimization of the existing algorithms.
- Processed MySQL and Athena Tables to create the datasets to draw the visual insights using AWS QuickSight.
- Deployed Applications on AWS EC2 instances and configured the storage on S3 buckets.
- Configured S3 buckets with policies to automatically archive the infrequently accessed data to storage classes.
- Involved in Analyzing raw files from S3 data lake using AWS Athena, Glue without loading the data in the database. implemented AWS Elasticsearch for storing large datasets in a single cluster to run large log analytics.
- Implemented IAM roles for various resources like EC2, S3, RDS to communicate with each other.
- Analyzed complex data and identified anomalies, trends and risks to provide insights to improve internal controls.
- Configured computing, security and networking systems within the cloud environment and implemented cloud policies and maintained service availability.
Environment: AWS, Hadoop, AWS Kinesis, Parquet, Avro, JSON, CloudWatch, SNS, AWS Redshift, Apache Airflow, AWS S3, MYSQL, AWS EC2, AWS S3, AWS Elasticsearch, Apache Airflow, AWS Athena, AWS Glue.
Confidential, Connecticut
Azure Data Engineer
Responsibilities:
- Worked on Azure Data Factory to integrate data of both on-prem (MySQL Cassandra) and cloud (Azure SQL DB, Blob Storage) and applied transformations to load data to Azure Synapse.
- Migrated on-premises Hadoop cluster to Azure storage using Azure Data Factory.
- Built and scheduled pipelines using triggers in Azure Data Factory.
- Built PySpark pipelines to validate the table data from Hive and Oracle.
- Implemented Azure Log Analytics for monitoring the resources and their tasks for better throughput.
- Developed Apache Spark jobs for Data pre-processing and cleansing activities.
- Automated deployment from ACR using ADO YAML pipelines.
- Designed and developed common architecture to store data in Enterprise and building Data Lake in Azure cloud.
- Developed applications on Spark for Data extraction, transformation and aggregation from multiple systems and stored data with the help of Azure Databricks notebooks on Azure Data Lake Storage.
- Advanced skills in data pipelines using airflow to interact with services like Azure Databricks, Azure Data Lake, Azure Data Factory and Azure Synapse Analytics.
- Advanced skills to create ADF Pipelines using linked Services/Datasets/Pipeline in order to Extract, Transform and load data from multiple sources like SQL server, Blob Storage, Azure SQL and Azure Synapse Analytics.
- Deployed python libraries on Databricks by setting up new Jobs to install the required libraries with dependencies.
- Managed huge volumes of data for exploring and transforming by creating Python Databricks notebooks.
- Developed a Python code for transferring and extracting data from on-premises to Azure data lake.
- Managed resources and scheduling across the cluster using Azure Kubernetes Service.
- Collaborated with clients and stakeholders to execute the design flow of data migration to Azure including disaster recovery and testing performance.
- Used Terraform in AWS virtual private cloud to automate the tasks by configuring the settings interfacing with the control layer.
- Used Terraform in managing resources scheduling, disposable environments and multitier applications.
- Automated Data migration and exploration using Azure Analytics and HDInsight to deliver the insights.
- Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complex aggregate calculations.
- Created complex SQL queries and scripts to extract, aggregateand validatedata from MS SQL, Oracle, and flat files using Informatica and loaded into a single data warehouse repository.
- Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/data warehouses.
- Gained Experience with analytical reporting and facilitating datafor Power BI dashboards.
- Responsible for collecting, scrubbing, and extractingdata, generated compliance reports using SSRS, analyzed and identified market trends to improve product sales.
Environment: Azure cloud, Azure data factory, Apache Spark, YAML, Azure Databricks, Azure Analytics, Azure Synapse, SQL, SSIS, AWS VPC, Azure Kubernetes, Python Databricks, HDInsight
Confidential
Data Engineer
Responsibilities:
- Organized, configured and scheduled different resources across the cluster using Azure Kubernetes Service and monitored Spark cluster using Log Analytics and Ambari Web UI.
- Worked on Transitional log storage from Cassandra to Azure SQL Datawarehouse and improved the performance.
- Engaged in-depth with the development of data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
- Developed Spark Streaming programs to real time data from Kafka and handled data for transformation.
- Worked intensively to write power shell scripts to schedule Hive and Spark jobs from Control-M jobs.
- Ingested data in mini batches and performs RDD transformation on mini batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
- Developed business aggregations using Spark and DynamoDB for storing aggregated data in JSON format.
- Managed Hadoop jobs with the help of Apache Oozie workflow engine using Oozie scripts and workflow setup.
- Used Apache for data streaming from POSTGRESQL database to data lake.
- Integrated Kerberos authentication with Hadoop infrastructure for Authentication and authorization management.
- Worked with HBase by using Hive-HBase integration and computed various metrics for reporting on the dashboards.
- Used Spark Scala APIs, HIVE data aggregation and formatting data (JSON) to develop data pipelines programs for generalPySpark and visualization.
- Strong understanding of Partitioning, bucketing in Hive and designed tables in Hive to optimize performance.
- Implemented Caching, partitioned techniques and Spark in Scala for quick data accessing and processing.
- Optimized and tuned Hadoop Environment and modified hardware to meet expected performance thresholds.
- Implemented User Defined Functions in PySpark for data loading and transformations.
- Implemented STAR schema style for data warehouses and fact tables for referencing any dimension tables.
- Extensively used SQL Queries to perform data validations, extractions, transformations and data Loading from various data resources.
- Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Performed analysis on the data loaded into HDFS using Map Reduce jobs.
- Automated the tasks for loading the data into HDFS by creating workflows using Oozie. implemented HBase Row key for inserting data into HBase tables with lookup and staging tables concepts.
- Implemented Spark jobs in Scala to access the data from Kafka and transformed the data to fit into HBase database.
- Developed skills on cluster installation, decommissioning and commissioning of data node, name node recovery, side configuration and capacity planning.
- Created charts and graphs to visualize data analysis results and worked on model testing, validation and reformulation to nurture accurate outcome prediction
Environment: Azure Kubernetes, Ambari web UI, Kafka, Spark Streaming, Hive, DynamoDB, Apache Oozie, Scala, SQL, Map Reduce, HDFS, HBase.
Confidential
Software Engineering Analyst
Responsibilities:
- Engaged with different trams in gathering requirements to design ETL migration process from Existing RDBMS to Hadoop Cluster using Sqoop.
- Worked to develop HIVE Queries for Data Transformation and Data analysis.
- Developed, implemented and tested python-based web applications interacting with MySQL.
- Developed skills on how to load data from Hive table to MySQL and DB2 using python scripts.
- Involved in web services and Hibernate in a fast-paced development environment.
- Performed ETL operations using informatica, PL SQL, UNIX shell scripts and worked with SSIS.
- Developed strong skills in ETL design and development and understood the tradeoffs of various design options on multiple platforms using multiple protocols.
- Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes. Worked with SQL Server Reporting Services (SSRS) and created various types of reports like Parameterized, Cascading, Conditional, Matrix, Table, Chart and Sub Reports.
- Created connection pools, physical tables, schema folders and catalog folders in the physical layer of the repository.
- Defined static and dynamic repository variables to modify metadata content to adjust to a changing data environment.
- Alleviated network security issues by developing and implementing customized HTML and CSS Coding.
- Researched Lean Six Sigma principle and implemented an online portal which replaced the Excel sheets and optimized the storage of data by reducing.
- Gained in-hands experience on agile software development process including analysis design and worked on - Scrum/Jira and Confluence.
- Gained experience of DataStax Spark connector to store the data into Cassandra database also to get the data from Cassandra database.
- Automate builds and deployments with continuous integration for multiple clients using a variety of languages and tools -- primarily octopus, TFS, and PowerShell.
Confidential
Python Developer
Responsibilities:
- Worked in ETL tasks such as pulling, pushing data from and to various servers.
- Developed ETL scripts in Python to get data from database table and insert, update the resultant data.
- Outlined data management system using MySQL and gained experience in Agile and Scrum process.
- Implemented and tested python-based web applications interacting with MySQL.
- Installed Hadoop, map reduce, AWS, HDFS and worked on data cleaning and pre-processing.
- Cleaned data and processed third-party spending data into maneuverable deliverables within specific formats with excel macros and python libraries.
- Used Python IDE PyCharm for developing the code and performing the unit test.
- Developed the required XML schema documents and implemented the framework for parsing XML documents.
- Build application and database servers using AWS EC2 and create AMIS also using RDS for Oracle DB.
- Developed skills on AWS, S3, EC2, LAMBDA, MySQL, Python, Git, Jenkins, Dockers and Kubernetes. Worked with object-oriented Python, Flask, Bootstrap, Linux, GIT.
- Engaged and worked on the analysis, design, and development and testing phases of software development life cycle