Senior Data Engineer Resume
SUMMARY
- Overall, 8+ years of diverse experience in the field of Data Space, which includes experience in requirement gathering, design and development, and Implementation of various applications as well as experience in Data Engineer with cloud platforms like AWS, AZURE, GCP, SNOWFLAKE and DATABRICKS. Experienced in big data architecture & ecosystem components.
- Using Spark for data cleansing, data analysis, structured and unstructured data transformations, data modeling, data warehousing, and data visualizations using PySpark, SparkSQL, Python, SQL, Airflow, Kafka, SSIS, Sqoop, Oozie, Hive, Tableau and Power BI.
- Experience in dealing with Apache Spark and Apache Hadoop components like RDDs, DataFrames, Spark - Streaming, HDFS, MapReduce, HIVE, HBase, PIG, SQOOP.
- Experienced in streaming real time data through Apache Kafka and then storing it in NoSQL database Couchbase for analytics purpose and producing the streamed messages to Solace/Kafka queue for end users to consume in real time.
- Expertise in Azure services including Blob Storage, Virtual Machines, Azure Storage Queues, Azure Database, Azure Data Lake Analytics, Azure Event Hubs, Azure Functions, Azure Data Factory, Azure Synapse Analytics, Azure Cosmos DB, Azure Monitor, Azure Active Directory, Azure Databricks, Azure HD Insights.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS. and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
- Orchestrated the DAG to migrate the on-premises data to the cloud using AIRFLOW.
- Experienced in Optimizing the Virtual warehouses and the SQL Query in terms of the cost in Snowflake.
- Well versed with Hadoop distribution and ecosystem components like HDFS, YARN, MapReduce, Spark, Sqoop, Hive, and Kafka.
- Expertise in AWS services including S3, EC2, SNS, SQS, RDS, EMR, Kinesis, Lambda, Step Functions, Glue, Redshift, DynamoDB, Elasticsearch, Service Catalog, CloudWatch and IAM.
- Proficient in Spark for processing and manipulating complicated data utilizing Spark Core, Spark Context, Spark SQL, Data Frame, Pair RDD's and Spark Streaming.
- Experienced in Jenkins/Ansible automation. Automated spark jobs using Ansible.
- Well versed in Core Java, Scala, Web Services (SOA), REST Services, JDBC, MySQL, DB2 and Python.
- Expert in installing, configuring, and using eco system components like Spark, Kafka, Couchbase, Hadoop, MapReduce, HDFS, HBase, Zoo Keeper, Hive, Sqoop and Pig.
- Extensively used Spark Data Frames API in Cloudera platform to perform analytics on Hive data and expertise in using spark-SQL with diverse distributed data formats like JSON, Parquet and Avro.
- Expertise in NoSQL Database HBase and plugging them to Hadoop eco system, Hive & HBase Integration.
- Experienced in using Git for version controlling and experienced with Jenkins.
- Experience in working with NoSQL databases and their integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase.
- Configured the concurrency spark cluster in Databricks to transform and enrich the data to gain the insights of the data which impact the organization.
- Excellent hands-on experience in Unit testing, Integration Testing and Functionality testing.
- Experienced with version control systems like Git, GitHub, GitLab, SVN, Bamboo and Bitbucket to manage the versions and configurations of the code organized. Responsible for defect reporting and tracking in JIRA.
- Expertise in creating interactive dashboards through Power bi, Tableau, QlikView and Tibcospotifre with high level security features like row level.
TECHNICAL SKILLS
Big Data: Spark, DataFrames, RDD, Spark-Streaming, Kafka, HDFS, MapReduce, HBase, Sqoop, Hive, Oozie, Druid.
Cloud Services: AWS - EMR, EC2, S3, Redshift, Athena, Lambda Functions, Step Functions, Dynamo DB, CloudWatch, CloudTrail, SNS, SQS, Kinesis, Quick Sight. Azure - Azure HDInsight, Data Bricks (ADBX), Data Lake (ADLS), Cosmos DB, DevOps, Azure AD, Blob Storage, Data Factory, Azure Functions, Azure SQL Data warehouse, Azure SQL database, Azure Active directory, Azure Monitor, Azure Stream Analytics, Azure Event Hub.
IDE’s & Utilities: PyCharm, Scala, Visual Studio Code, SSMS, Data Studio, Omni-Ai, IntelliJ, Eclipse, JCreator, NetBeans.
RDBMS: MySQL, MS-SQL server, MS Access, Postgres SQL, DB2, Oracle.
NoSQL Database: HBase, Couchbase, M7, MongoDB and Cassandra.
Operating Systems: Linux, UNIX, OS/400, MacOS, WINDOWS 98/00/NT/XP.
Programming & Scripting Languages: Python, R, Scala, Java, C++, C, PIG, PL-SQL, Shell scripting.
Web Services/ Protocols: TCP/IP, UDP, FTP, HTTP/HTTPS, SOAP, Rest, Restful
Version control systems: GitHub, SVN, GIT, GitLab, Big Bucket, Bamboo.
Build and CI tools: Docker, Kubernetes, Maven, Jenkins.
SDLC/Testing Methodologies: Agile, Waterfall, Scrum.
Hadoop Distributions: Cloudera and Horton Works
Statistical Analysis Skills: A/B Testing, Time Series Analysis, Marko.
Monitoring and Reporting: Tableau, Power Bi, Superset.
PROFESSIONAL EXPERIENCE
Confidential
Senior Data Engineer
Responsibilities:
- Implemented Hadoop file system (HDFS), AWS S3 storage, and Bigdata formats including Parquet and AVRO JSON for enterprise data lake.
- Configured AWS S3 buckets with policies for automatic archiving of infrequently accessed data to storage classes.
- Created complex dataframes by reading the files from AWS S3 bucket to OMNI-AI using PySpark.
- Implemented DQ/data validation check using Spark on incoming messages/beacons and then stored the results in Elastic search. These DQ results were then shown on Kibana through Elastic search integration.
- Build a druid data source to show Realtime consuming data on superset by creating dashboards.
- Fixed the data loss and duplicate data issue from Kafka by implementing Kafka offset management and integration of Couchbase respectively.
- Applied partitioning and bucketing strategy to improve the performance of overall system while writing the files.
- Built Apache airflow DAGs to export the data to AWS S3 buckets by triggering to invoke an AWS lambda function.
- Analyzing files from the S3 data lake using AWS Athena and AWS Glue without importing the data into a database.
- Implemented AWS Elasticsearch for storing massive datasets in a single cluster for extensive log analysis.
- Automated ETL operations using Apache Airflow, optimized queries and fine-tuned performance in AWS Redshift for large dataset migration.
- Configured AWS Redshift clusters, spectrums for querying, and data sharing for data transfer between clusters.
- Implemented CI/CD by automating the spark jobs build and deployment using Jenkins and Ansible.
- Enabled metrics properties in spark jobs to monitor the behavior of spark jobs. These metrics were written to InlfuxDB.
- The collected data in InfluxDB was viewed in Grafana - where the metrics can be queried and visualized easily.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Hive, HBase.
- Automated the data pull from AmazonS3 to HDFS, where it downloads the file whenever there is new file present on AmazonS3.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be loaded to Glue Catalog and can be queried from AWS Athena.
- Highly proficient in developing lambda functions to automate tasks on AWS using CloudWatch triggers, S3 events as well as kinesis streams.
- Developed the Python scripts using Boto3 to supplement automation provided by Terraform for tasks such as encrypting EBS volumes backing AMIs and scheduling Lambda functions for routine AWS tasks.
- Knowledge of Airflow's built-in features for managing and scaling data pipelines, such as pooling and task prioritization.
- Created ad-hoc tables to add schema and structure data into AWS S3 bucket using lambda function and performed data validation, filtering, sorting and transformations for every data change in a Dynamo DB table and load the transformed data to Postgres database.
- Developed scalable data integration pipelines to transfer data from AWS S3 bucket to AWS Redshift database using Python and AWS Glue.
ENVIRONMENT: Apache Spark, Apache Kafka, Couchbase, Solace, Apache Hadoop, HDFS, Hive, HBase, Map Reduce, Java, Hive, Pig, Sqoop, Flume, MapR, Oozie, Scala, Java, Spring MVC, Spring WebFlux, Shell Scripting, MySQL, AmazonS3, SFTP, AWS S3, Kafka, MySQL, AWS Athena, SNS, SQS, Lambda, PySpark, CloudWatch, Dynamo DB, AWS Redshift, Python, AWS Glue, PostgreSQL, EMR, Tableau.
Confidential
Data Engineer
Responsibilities:
- Developed the Scalable ETL pipeline using PySpark on Azure databricks and loaded the enriched data into Azure Data Lake.
- Developed the Spark Scala scripts and UDF's to read from Azure Blob storage to perform transformations on large datasets using Azure Databricks.
- Developed the scalable data ingestion pipelines on Azure HDInsight Spark cluster using Spark SQL. Also Worked with Cosmos DB.
- Developed the robust ETL pipeline in Azure Data Factory to integrate data from both on-premises to cloud and applied transformations using PySpark to load the enrich data to Azure SQL Datawarehouse.
- Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table, utilized Spark Streaming API to stream data from various sources and optimized existing Scala code and improved the performance.
- Create pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory and Azure HD Insights.
- Implemented Dimensional Data Modeling to deliver Multi-Dimensional STAR schemas and Developed Snowflake Schemas by normalizing the dimension tables as appropriate.
- Designed & implemented the new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances, SQL Azure DW instances, HD Insight clusters and installing DMGs on VMs to connect to on premise servers.
- Developed the Spark Data Frames from various Datasets and applied business transformations and data cleansing operations in Azure Data Bricks.
- Developed the Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows in Airflow, Apache NiFi.
- Designed custom-built input adapters using Spark and Hive to ingest and analyze data in Airflow and ingested the enriched data to snowflake.
- Used polybasic technique to load and export data quickly in Azure Synapse Analytics and analyzed data using a serverless SQL pool and Spark pools.
- Migrated from the existing Oozie workflow work nature to Apache Airflow for daily incremental loads, getting data from RDBMS.
- Implemented Kafka high level consumers to get data from Kafka partitions and move into HDFS.
- Designed the Snowpipe for continuous data load into snowflake from azure data lake.
- Collected and processed large amounts of log data and staging data in HDFS using Kafka for further analysis.
- Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used to create, configure and manage a cluster of Virtual machines and used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization.
- Involved in setting up the CI/CD pipeline using Jenkins, Terraform and Azure DevOps.
- Designed the interactive Power BI dashboards and reporting with high level security features such as row level based on business requirements.
ENVIRONMENT: HDInsight, Snowflake, Data Lake, Databricks, Data Factory, SQL, SQL DB, Data Storage Explorer, Azure Synapse, SSMS, Azure Studio, Terraform, PySpark, Python, Hive, KQL, Apache Airflow, Apache Flume, Delta Lake, Power BI.
Confidential
ETL Developer
Responsibilities:
- Used Azure Logic Apps, Azure Data Factory, and PowerShell for moving or retrieving data from various sources.
- Expertise in Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL data warehouse controlling and providing database access, and Migrating On-premises databases to Azure Data Lake Store via Azure Data Factory (ADF).
- Created pipelines in Azure utilizing ADF to retrieve data from several source databases (Informatix, Sybase, etc.) utilizing various Azure Activities (Move &Transform, Copy, filter, for each, data Bricks, etc.)
- Generated a Linked service to import data from Caesar's SFTP location into Azure data Lake.
- Created and configured Kafka producers to gather data from various servers and broadcast it to topics.
- Configured and created new connections through applications for better access to MySQL database and developed sophisticated SQL & PLSQL - Stored procedures, functions, sequences.
- Developed robust Azure data factory pipelines to extract data from on premises source systems to azure cloud data lake storage and implemented the copy’s behavior such as flatten hierarchy, preserve hierarchy, and merge hierarchy.
- Developed Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion.
- Created Azure Key vault as central repository for maintaining secrets and referenced the secrets in Azure Data Factory and in Azure Databricks notebooks, configured logic apps to handle email notification to the end users.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Developed MapReduce that extracts, transforms, and aggregate data from a variety of file formats, including JSON, Avro, and other compressed file formats and process Avro, Parquet files to map-side joins.
- Proactive in improving development, testing, continuous integration, and production deployment processes.
- Optimized & improved efficiency of the spark jobs using the concepts like broadcast joins and dynamic partition pruning.
- Worked on migrating SQL database to Azure data Lake, blob storage, Azure SQL Database and invoked the azure function for the data augmentation and stored in blob storage.
- Developed the Azure Databricks Job workflows which extracts data from SQL server and uploads the files to SFTP using PySpark.
- Developed robust data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Worked on spark streaming configuration to receive real time data from the Apache flume and store the stream data using python to Azure Table and Delta Lake is used to store and do all types of processing and analytics.
ENVIRONMENT: Azure cloud platform, specifically utilizing services such as Azure Data Factory (ADF), Azure Data Lake Analytics, Azure SQL Database, Data Bricks, Azure SQL Data Warehouse, Azure Data Lake Storage (ADLS), Blob Storage, and Spark applications.
Confidential
Data Analyst
Responsibilities:
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Expertise in converting HiveQL into Spark transformations using Spark RDD and Scala programming.
- Worked with PySpark for using Spark libraries by using Python scripting for data analysis.
- Used Kibana, which is an open source-based browser analytics and search dashboard for Elastic Search.
- Performed importing data from various sources to the Cassandra cluster using Java APIs or Sqoop.
- Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
- Worked on batch processing of data sources using Apache Spark, Elastic search.
- Worked on UNIX Shell Scripting for splitting group of files to various small files and file transfer automation.
- Interacted with key users and assisted them with various data issues, understood data needs and assisted them with Data analysis.
- Used Pandas API to put the data as time series and tabular format for east timestamp data manipulation and retrieval.
- Developed and executed various MySQL database queries from Python using Python-MySQL connector and MySQL dB package.
ENVIRONMENT: PySpark, Spark, Spark SQL, MySQL, Cassandra, Snowflake, MongoDB, Flume, VSTS, Azure HDInsight, Data Bricks, Data Lake, Cosmos DB, DevOps, Azure AD, Blob Storage, Data Factory, Azure Synapse Analytics, Git, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Airflow, Flume, Hive, Sqoop, HBase, PowerBI.