We provide IT Staff Augmentation Services!

Azure Data Engineer Resume

4.00/5 (Submit Your Rating)

Jacksonville, FL

SUMMARY

  • Overall, 10 years of professional experience with 6 years of Big Data and SQL consultant experience in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and Implementing enterprise level systems spanning Big Data.
  • A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.
  • Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
  • Good experience with Azure services like HD Insight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
  • Expertise in scripting with PySpark, Java, Scala and Spark - SQL for developing applications for interactive analysis, batch processing, and stream processing.
  • Working knowledge on Azure cloud components (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB).
  • Extensive knowledge in all phases of Data Acquisition,Data Warehousing(gathering requirements, design, development, implementation, testing, and documentation),Data Modeling (analysis using Star Schema and Snowflake for FACT and Dimensions Tables),Data Processing and Data Transformations(Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters)
  • Proficient at using Spark APIs for streaming real time data, staging, cleansing, applying transformations and preparing data for machine learning needs
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).
  • Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded. Skilled in using Amazon Redshift to perform large scale database migrations.
  • Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
  • Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
  • Worked on ETL Migration services by creating & deploying AWS Lambda functions
  • Experience in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations usingTableau, Power BI, Arcadia, and Matplotlib.
  • Skilled in using Kerberos, Azure AD, Sentry, and Ranger for maintaining authentication and authorization and Hands on experience in using Visualization tools like Tableau, Power BI.
  • Involved in migration of the legacy applications to cloud platform using DevOps tools like GitHub, Jenkins, JIRA, Docker, and Slack.
  • Designed, built, deployed and maintained a large-scale data ETL pipeline on AWS using PySpark/Glue.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS
  • Leveraged different file formats Parquet, Avro, ORC and Flat files.
  • Sound knowledge in developing highly scalable and resilientRestful APIs, ETLsolutions, and third-party integrations as part of Enterprise Site platform using Informatica.
  • Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
  • Capable in working with SDLC, Agile and Waterfall Methodologies.
  • Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Yarn, MapReduce, Spark, HBase, Cassandra, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, Pig, Oozie, Zookeeper, Flume, Elastics Search, Scala.

Hadoop Distribution: Apache Hadoop 2.x/1.x, Cloudera, HDP, Azure HDInsight (Databricks, Data Lake, Delta Lake, Blob Storage, Data Factory, SQL DB, SQL DW, Cosmos DB, Azure DevOps, Active Directory), Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, DynamoDB, ECS).

Operating Systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10).

Programming Languages: Python, Scala, Java, R, Shell Scripting, HiveQL.

Databases: MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2.

NoSQL Database: Cassandra, MongoDB, Redis.

Reporting Tools/ETL Tools: Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia.

Development Tools: Spring, J2EE, JDBC, Okta, Postman, Swagger, Angular, Mockito, Flask, Hibernate, Maven, Tomcat, JavaScript, Node.js, HTML, CSS.

Others: Machine Learning, NLP, Terraform, Docker, Kubernetes, Jira, Git.

PROFESSIONAL EXPERIENCE

Confidential

Azure Data Engineer

Responsibilities:

  • Proficient in working with Azure cloud platform (HDInsight, Data Lake, Data Bricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Experience in designingCloud Azure Architecture and Implementation plans for hosting complex application workloads on MS Azure.Involved in PySpark functions for mining data to provide real time insights and reports.
  • Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing PySpark scripts and UDF's to perform transformations on large datasets.
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Azure cloud.
  • Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow.
  • Configured Spark streaming to receive real time data from the Apache Flume and store the stream data using Python to Azure Table and Delta Lake is used to store and do all types of processing and analytics.
  • Utilized Spark Streaming API to stream data from various sources.
  • Involved in using Spark Data Frames to create various Datasets and applied business transformations and data cleansing operations using Databricks Notebooks.
  • Worked with AWS glue job as glue crawler helps in getting data from S3 and store in Glue catalog. By passing data into dynamic frame using glue context, applied spark transformations and actions based on business requirement.
  • Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
  • Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi. Tasks are distributed on celery workers to manage communication between multiple services.
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (MS SQL, MongoDB) into HDFS.Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
  • Optimized existing Python code and improved the cluster performance.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
  • Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow.
  • Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.

Environment: HDInsight, Data Lake, Databricks, Data Factory, SQL, SQL DB, Data Storage Explorer, Azure Synapse, Cassandra, PySpark, Python, Hive, Apache Airflow, Apache Flume, Delta Lake, Ambari Web UI, Postman, Oozie, Power BI.

Confidential, Jacksonville, FL

Sr. Data Engineer

Responsibilities:

  • Installed and configured with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari and NiFi.
  • Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Ingested huge volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.
  • Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
  • Extensively worked on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.
  • Created pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
  • Created, provisioned multiple Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Handle the requests for SQL objects, schedule, business logic changes and Ad-hoc queries from customer and analyzing and resolving data sync issues.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using PySpark and shell scripting.
  • Developed PySpark notebook to perform data cleaning and transformation on various tables.
  • Implemented end-to-end data pipeline using FTP Adaptor, Spark, and Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Python.
  • Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.
  • Developed Spark Applications by using Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
  • Created Linked service to land the data from SFTP location to Azure Data Lake.
  • Perform ongoing monitoring, automation, and refinement of data engineering solutions.
  • Experience in working on both agile and waterfall methods in a fast-paced manner.

Environment: Apache Kafka, Zookeeper, Azure Data Storage Services, Data Lake Gen2, Data Factory V2, Spark, Databricks, Python Spark, Hive, HDFS, Flask, HBase, Azure, PySpark, Azure function Apps, BLOB Storage, SQL Server, Spark SQL.

Confidential, Chicago, IL

Data Engineer

Responsibilities:

  • Developing End to End ETL Data pipeline that take the data from surge and loading it into the RDBMS using the Spark. experience in Hadoop framework, HDFS, MapReduce processing implementation.
  • Developed an ETL pipeline to extract archived logs from disparate sources and stored in AWS S3 data lake for further processing using PySpark. Used Cron schedulers for weekly automation.
  • Migrated from JMS solace to Kafka, used Zookeeper to manage synchronization, serialization, and coordination.
  • Extensively used AWS Athena to import structured data from S3 into other systems such as RedShift or to generate reports.
  • Worked with new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
  • Involved in moving data from various DB2 tables to AWS S3 buckets using the Sqoop process.
  • Gathered data and performed analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Redshift).
  • Involved heavily in setting up the CI/CD pipeline using Jenkins, Terraform and AWS.
  • Implemented Spark Java UDF's to handle data quality, filter, and data validation checks.
  • Involved in importing the data from various data sources into HDFS using Sqoop & applying various transformations using Hive, Apache Spark & then loading data into Hive tables or AWS S3 buckets
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Developed Bash scripts to get log files from FTP server and executed Hive jobs to parse them.
  • Created External tables, optimized Hive queries and improved the cluster performance by 35%.
  • Used Stream Sets for analytics and involved in debugging and optimizing data pipelines collecting logs and metrics from various application APIs.
  • Managed and deployed configurations for the entire datacenter infrastructure using Terraform.
  • Co-ordinated with Kafka team and built an on-premises data pipeline. Supported Kafka Integrations, performance tuning and identified bottlenecks to improve performance and throughput.
  • Used Kerberos for authentication and Apache Sentry for authorization.
  • Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data toHBasetables.
  • Used Git for version control, Interacted with Onsite team for deliverables.

Environment: PySpark, AWS S3, ETL Data pipelines, Zookeeper, AWS EMR, AWS Athena, EC2, RDS, AWS Lambda, Redshift, Terraform, Jenkins, Spark Java, Bash scripts, Java, Hive, Streams Sets, Kafka, Kerberos, Apache Sentry, Python, HBase, Git.

Confidential, BETONVILLE, AR

Hadoop developer

Responsibilities:

  • Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
  • Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
  • Worked on building end to end data pipelines on Hadoop Data Platforms.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security, and resource automation. worked with PySpark for using Spark libraries by using Python scripting for data analysis.
  • Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.

Environment: Python, Scala, HDFS, Hive, HBase, ETL, HQL, Spark, Flume, Kafka, MapReduce, Zookeeper, Kafka, Unix, Web Services.

Confidential

Jr. Software Engineer

Responsibilities:

  • Involved in designing and implementing the User Interface for the General Information pages and Administrator functionality.
  • Involved in writing JavaScript functions for front-end validations.
  • Developed Responsive web application for the backend system using Angular JS with HTML5 and CSS3.
  • Extensively used jQuery selectors, events, Traversal, and jQuery AJAX with JSON Objects.
  • Used PL/SQL stored procedures, triggers for handling database processing.
  • Written Java classes to test UI and Webservices through JUnit.

Environment: JavaScript, Angular JS, HTML, CSS, Java, JDK1.5, jQuery, Ajax, SQL, Junit.

We'd love your feedback!