We provide IT Staff Augmentation Services!

Sr. Azure Data Engineer Resume

5.00/5 (Submit Your Rating)

New York, NY

SUMMARY

  • 8+ years of experience in IT, which includes experience in Bigdata Technologies, Hadoop ecosystem, Data Warehousing, SQL related technologies in Retail, Manufacturing, Financial and Communication sectors. 6 Years of experience in Big Data Analytics using Various Hadoop eco - systems tools and Spark Framework and currently working on Spark and Spark Streaming frameworks extensively using Scala as the main programming dialect.5+ years of experiences on BI Application design on MicroStrategy, Tableau and Power BI.
  • Proficient in Hive, Oracle, SQL Server, SQL, PL/SQL, T-SQL and in managing very large databases
  • Experience writing in house UNIX shell scripts for Hadoop & Big Data Development
  • Have good experience working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW).
  • Hands on experience on Python programming PySpark implementations in AWS EMR,building data pipelines infrastructure to support deployments for Machine Learning models, Data Analysis and cleansing and using Statistical models with extensive use of Python, Pandas, Numpy, Visualization using Matplotlib,Seaborn and Scikit packages for predictions,xgboost.
  • Proficient writing complex spark (pyspark) User defined functions (UDFs), Spark SQL and HiveQL.
  • Experience working on Azure Services likeData Lake, Data Lake Analytics,SQL Database, Synapse, Data Bricks, Data factory, Logic AppsandSQL Data warehouseand GCP services LikeBig Query, Dataproc, Pub sub etc.
  • Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data fromweblogs and store in HDFS and accomplished developing Pig Latin Scripts and using HiveQL for data analytics
  • Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data
  • Experience in converting Hive/SQL queries into Spark transformations using Java and experience in ETL development using Kafka, Flume and Sqoop
  • Experience working on NoSQL databases including HBase, Cassandra and MongoDB and experience using Sqoop to import data into HDFS from RDBMS and vice-versa
  • Extensive experience in working with NO SQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase.
  • Developed Spark scripts by usingScalashell commands as per the requirement
  • Good experience in writing Sqoop queries for transferring bulk data between ApacheHadoop and structured data stores
  • Substantial experience in writing Map Reduce jobs in Java, Pig, Flume, Zookeeper,Hive and Storm
  • Extensive knowledge in using Elasticsearch and Kibana.
  • Experience in Developing Spark applications using PySpark, and Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats(structured/unstructured) for analyzing and transforming the data to uncover insights into the customer usage patterns.
  • Experience inDataWarehousing,DataMart,DataWrangling usingAzureSynapseAnalytics
  • Worked in container-based technologies likeDocker,KubernetesandOpenshift.
  • Created multiple MapReduce Jobs using JavaAPI, Pig and Hive for data extraction
  • Experienced in ExtractTransform and Load (ETL) processing large datasets of different forms including structured, semi-structured and unstructured data
  • Experience in understanding business requirements for analysis, databasedesign&development of applications
  • Worked with Kafka tools like Kafka migration, Mirror maker and Consumer offset checker
  • Experience with realtimedataingestion using Kafka.
  • Experience with CI/CDpipelines with Jenkins, Bitbucket, GitHub etc.
  • Enthusiastic learner and excellentproblemsolver
  • Strong expertise in troubleshooting and performance fine-tuning Spark, Map Reduce and Hive applications
  • Hands on experience in using different file formats and compression techniques in Hadoop like Snappy, Lzo, Lz4, Bzip2, and Gzip
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
  • Good experience on working with AmazonEMR framework for processing data on EMR and EC2instances
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, SparkStreaming and SparkSQL
  • Created AWS VPC network for the installed Instances and configured security groups and Elastic IP’s Accordingly
  • Experience in automated scripts using UNIXshell scripting to perform database activities
  • Developed AWSCloud formation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups
  • Experience with AWS Command line interface and PowerShell for automating administrative tasks. Defined AWS Security Groups which acted as virtual firewalls that controlled the traffic reaching one or more AWS EC2, LAMBDA instances.
  • Extensive experience in developing applications that perform DataProcessing tasks using Teradata, Oracle, SQL Server and MySQL database
  • Worked on data warehousing and ETL tools like Informatica, Tableau, and Qlik Replicate.
  • Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure
  • Acquaintance with Agile and Waterfall methodologies. Responsible for handling several clients facing meetings with great communication skills

TECHNICAL SKILLS

Big Data Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Hadoop Distribution: Cloudera, Horton Works, Apache, AWS

Languages: Java, SQL, PL/SQL, Python, Pig Latin, HiveQL, Scala, Regular Expressions

Web Technologies: HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Portals/Application servers: WebLogic, WebSphere Application server, WebSphere Portal server, JBOSS, TOMCAT

Build Automation tools: SBT, Ant, Maven

Version Control: GIT

IDE &Build Tools, Design: Eclipse, Visual Studio, Junit, IntelliJ, PyCharm.

Databases: MS SQL Server 2016/2014/2012 , Azure SQL DB, Azure Synapse. MS Excel, MS Access, Oracle 11g/12c, Cosmos DB

PROFESSIONAL EXPERIENCE

Sr. Azure Data Engineer

Confidential, New York, NY

Responsibilities:

  • Involved in complete BigData flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS
  • Working knowledge on Azure cloud components (Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
  • Experience in analyzing data from Azure data storages using Databricks for deriving insights using Spark cluster capabilities.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, data bricks, Pyspark, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Developed Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Developed SparkAPI to import data into HDFS from Teradata and created Hive tables
  • Developed Sqoop jobs to import data in Avro file format from Oracledatabase and created Hive tables on top of it
  • Involved in configuring Elastic search, Log stash & Kibana (ELK) stacks and Elasticsearch performance and optimization.
  • Converted features in JSON to Elasticsearch Stack: Logstash to Kibana.
  • Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table and DataLakeis used to store and do all types of processing and analytics.
  • Created Partitioned and BucketedHive tables in ParquetFile Formats with Snappy compression and then loaded data into ParquetHive tables from AvroHive tables.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
  • Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
  • Hands on programming experience in scripting languages like JavaScala
  • Improved the performance of queries against tables in enterprisedatawarehouse inAzureSynapseAnalytics by using table partitions
  • Involved in running all the Hive scripts through Hive, Impala, Hive on Sparkand some through SparkSQL
  • Expertise in managing logs through Kafka with Logstash
  • Experience in provide architecture and design as product is migrated to scala, play framework.
  • Involved in performance tuning of Hive from design, storageand query perspectives
  • Developed FlumeETLjob for handling data from HTTP Source and Sink as HDFS.
  • Collected the Json data from HTTPSource and developed SparkAPIs that helps to do inserts and updates in Hivetables
  • Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.Worked on HBase to load and retrieve data for real time processing using Rest API Worked with Terraform to create stacks in AWS from the scratch and updated the terraform as per the organization’s requirement on a regular basis.
  • Implemented a server less architecture using API Gateway, Lambda, and DynamoDB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda deployment function, configured it to receive events from our S3 buckets and provisioned Lambda functions to create a Log stash for centralized logging.
  • Developed Spark scripts to import large files from Amazon S3 buckets
  • Developed Sparkcore and SparkSQL scripts using Scala for faster data processing.
  • Developed Kafka consumer's API in Scala for consuming data from Kafka topics
  • Involved in designing and developing tables in HBase and storing aggregated data from HiveTable
  • Integrated Hive and TableauDesktopreports and published to TableauServer
  • Developed shellscripts for running Hivescripts in Hive and Impala
  • Virtualized the servers usingDockerfor the test environments and dev-environments needs, also configuration automation usingDockercontainers.
  • Azure Databricks, Azure Storage Account etc. for source stream extraction, cleansing, consumption and publishing across multiple user bases.
  • Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark and feature selection and created nonparametric models in Spark.
  • Transformed and Copieddatafrom the JSON files stored in aDataLake Storage into anAzureSynapseAnalytics table by usingAzureDatabricks
  • Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
  • Responsible to manage data coming from different sources and for implementing MongoDB to store and analyzeunstructured dat
  • Worked on Hadoop ecosystem in PySpark on Amazon EMR and Databricks.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • To meet specific business requirements wrote UDF’s in Scala and Pyspark.
  • Created data bricks notebooks using Python (PySpark),Scala and Spark SQL for transforming the data that is stored in Azure Data Lake stored Gen2 from Raw to Stage and curated zones.
  • Performed data profiling and transformation on the raw data using Python.
  • Used Talend for Big data Integration using Spark and Hadoop
  • Coordinated with DBA on database build and table normalizations and de-normalizations
  • Worked with Azure BLOB and Data lake storage and loading data into Azure SQL Synapseanalytics (DW).
  • Worked on the creation of customDockercontainerimages, taggingandpushingtheimagesandDocker consolesfor maintaining the application of life cycle.
  • Involved in Big data requirement analysis, develop and design solutions for ETLplatforms.
  • Deployed web applications into different application servers using Jenkins and implemented Automated Application Deployment using Ansible.
  • Worked on Ansible Playbooks with Ansible roles. Created inventory in Ansible for automating the continuous deployment. Configured the servers, deployed software, and orchestrated continuous deployments or zero downtime rolling updates.
  • Experience building microservices and deploying them into Kubernetes cluster as well as Docker Swarm.
  • Orchestrated number of Sqoop and Hivescripts using Oozieworkflow and scheduled using Ooziecoordinator
  • Administered all requests and analyzed issues and provided efficientresolutions
  • Designed all program specifications and performed required tests
  • Implement RDD/Datasets/Data frame transformations in Scala through Spark Context and HiveContext
  • Used Jira for bug tracking and BitBucket to check-in and checkout code changes
  • Designed columnar families in Cassandra and Ingested data from RDBMS, performed datatransformations, and then exported the transformed data to Cassandra as per the business requirement.
  • Experienced in version control tools likeGITand ticket tracking platforms likeJIRA.
  • Expert at handlingUnit TestingusingJunit4, Junit5 and Mockito

Environment: HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, HBase, Kafka, Impala, SparkSQL, Spark Streaming, Eclipse, Informatica, Oracle, AWS, Teradata, CI/CD, PL/SQL UNIX Shell Scripting, Cloudera.

Sr. Azure Data Engineer

Confidential, Smithfield, RI

Responsibilities:

  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with MapReduce, Hive
  • Hands - on experience in Azure Cloud Services Azure Synapse Analytics, SQL Azure, Data Factory,Azure Analysis services, Application Insights, Azure Monitoring, Key Vault,and Azure Data Lake.
  • Involved in creating Hivetables and loading and analyzing data using hive queries
  • Designed and developed custom HiveUDF’s
  • Also Worked with Cosmos DB (SQL API and Mongo API).
  • Using the JSON and XMLSerDe's for serializationand de-serialization to load JSON and XML data into HIVEtables
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation.
  • Implemented to reprocess the failure messages in Kafka using offsetid
  • Used HiveQL to analyze the partitioned and bucketeddata, Executed Hivequeries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic
  • Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
  • Developed code in Java which creates mapping in ElasticSearch even before data is indexed into.
  • Written Hive queries on the analyzeddata for aggregation and reporting
  • Developed SqoopJobsto load data from RDBMStoexternal systems like HDFS and HIVE
  • Worked on converting the dynamic XMLdata for injection into HDFS
  • Responsible to manage data coming from different sources and for implementing MongoDB to store and analyzeunstructured data
  • Transformed and Copieddatafrom the JSON files stored in aDataLake Storage into anAzureSynapseAnalytics table by usingAzureDatabricks
  • Implemented various Hive queries for analytics and called them from a java client engine to run on different nodes. Worked on writing APIs to load the processed data toHBasetables.
  • Azure Databricks, Azure Storage Account etc. for source stream extraction, cleansing, consumption and publishing across multiple user bases.
  • Was involved in writing pyspark User Defined Functions(UDF’s) for various use cases andapplied business logic wherever necessary in the ETL process
  • Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate themonthly account level customer data stored in S3
  • Tables for quick searching, sorting and grouping using the Cassandra Query Language.
  • Used Terraform to reliably version and create infrastructure on Azure. Created resources, using Azure Terraform modules, and automated infrastructure management.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
  • Worked with Terraform Templates to automate the Azure IaaS virtual machines using terraform modules and deployed virtual machine scale sets in production environment.
  • Used Azure Kubernetes Service to deploy a managed Kubernetes cluster in Azure and created an AKS cluster in the Azure portal, with the Azure CLI, also used template driven deployment options such as Resource Manager templates and Terraform.
  • Used the Spark Data Cassandra Connector to load data to and from Cassandra.
  • Implemented SparkScripts using Scala, SparkSQL to access hive tables into spark for faster processing of data
  • Loading data from UNIX file system to HDFS
  • Responsible for converting row-like regular hive external tables into columnar snappy compressed parquet tables with key-value pairs
  • Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table and DataLakeis used to store and do all types of processing and analytics.
  • Loaded the data into SparkRDD and do in memory data Computation to generate the Output response
  • Used several RDD transformation to filter the data injected into SparkSQL
  • Used HiveContext and SQLContext to integrate Hive metastore and SparkSQL for optimum performance.
  • Used Control-M to schedule the jobs daily and validated the jobs

Environment: Spark SQL, HDFS, Hive, Pig, Apache Sqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, PostgreSQL, AWS, IntelliJ, CI/CD, Oracle, Subversion, Control-M, Teradata, and Agile Methodologies.

Azure Data Engineer

Confidential, Charlotte, NC

Responsibilities:

  • Experienced in upgrading ClouderaHadoop cluster from 5.3.8 to 5.8.0 and 5.8.0 to 5.8.2.
  • Hands-on experience on all Hadoop ecosystems (HDFS, YARN, MapReduce, Hive, Flume, Oozie, Zookeeper, Impala, HBase and Sqoop) through Clouderamanager
  • Collecting and aggregating large amounts of log data using ApacheFlume and staging data in HDFS for further analysis
  • Designed and implemented streaming solutions using Kafka or Azure Stream Analytics
  • Designed and implemented database solutions in Azure SQL Data Warehouse, Azure SQL
  • Created Oozie workflow in process to automate the spark application.
  • Worked on the Snow-flaking the Dimensions to remove redundancy.
  • Used Kibana an open-source plugin for Elasticsearch (ELK) in analytics and Data visualization.
  • Built models using Python and PySpark to predict probability of attendance for various campaigns and events
  • Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
  • Worked on complex SNOW SQL and Python Queries in Snowflake.
  • Created various types of data visualizations using Python.
  • Used Kafka for real time data ingestion and processing.
  • Worked on Azure AD connect to sync on-premises AD user data, groups and organizations to Azure AD and troubleshoot Azure services sync with on-premise AD and resync using the Azuretools.
  • Managed GIT repositories for branching, merging and tagging and developing Groovy Scripts for automation purpose. Extended the generic process by attaching the Jenkins job webhook to all the current Java and Scala-based projects inGitHub.
  • Transformed and Copieddatafrom the JSON files stored in aDataLake Storage into anAzureSynapseAnalytics table by usingAzureDatabricks
  • Experience with Jenkins for Continuous Integration and deployment into Tomcat Servers. And worked in setting up Jenkin slaves for end to end automation.
  • Developing parser and loaderMapReduce application to retrieve data from HDFS and store to HBase and Hive
  • Experienced in setting up Multi-hop, Fan-in, and Fan-out workflow in Flume
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS
  • Implemented CustomSterilizer, interceptors to Mask, created confidential data and filter unwanted records from the event payload in Flume
  • Good experience in writing Spark applications using Scala and Javaand usedScalaset to developScalaprojects and executed usingSpark-Submit
  • Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table and DataLakeis used to store and do all types of processing and analytics
  • Responsible for ingesting large volumes of IOT data to Kafka workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL).
  • Used the Spark Data Cassandra Connector to load data to and from Cassandra.
  • Experienced in creating data-models for Clients transactional logs, analyzed the data from Cassandra.
  • Wrote Kafkaproducers to stream the data from external rest APIs to Kafka topics
  • Worked with teams to use KSQL for real-time analytics
  • Used Flume to collect data from a variety of sources likelogs, jms, Directory etc.
  • Worked with multiplexing, replicating and consolidation in Flume
  • Experience in ClouderaHadoopUpgrades and Patches and Installation of EcosystemProducts through Clouderamanager along with ClouderaManagerUpgrade.
  • Used OOZIEOperationalServices for batch processing and scheduling workflows dynamically

Environment: Hadoop, HDFS, Hive, Map Reduce, Impala, MySQL, Oracle,Scala, Sqoop, Spark, SQL Talend, Yarn, Pig, Oozie, Linux-Ubuntu, Tableau, Maven, Jenkins, Java (JDK 1.6),CI/CD, Cloudera, JUnit, agile methodologies

Hadoop Developer/Data Engineer

Confidential

Responsibilities:

  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReducejobs in java for datacleaning and preprocessing
  • Load and transform large sets of structured, semistructured and unstructured data
  • Responsible to manage data coming from different sources and for implementing MongoDB to store and analyzeunstructured data
  • SupportedMapReduce Programs those are running on the cluster and involved in loading data from UNIX file system to HDFS
  • Transformed and Copieddatafrom the JSON files stored in aDataLake Storage into anAzureSynapseAnalytics table by usingAzureDatabricks
  • Installed and configured Hive and written HiveUDFs
  • Involved in creating Hive tables, loading with data and writing hive queries that will run internally in MapReduce way
  • Created HBase tables to store variable data formats of PII data coming from different portfolios
  • Implemented best income logic using Pig scripts
  • Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
  • Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
  • Supported in setting up QA environment and updating configurations for implementing scripts with Pig and Sqoop
  • Involved in POC for migratingETLS from Hive to Sparkin Sparkon YarnEnvironment
  • Actively participating in the code reviews, meetings and solving any technical issues
  • Continuous monitoring and managing theHadoop cluster using ClouderaManager
  • UsedHibernateORM framework withSpringframework for data persistence and transaction management and involved in templates and screens in HTML and JavaScript

Environment: Hadoop, HDFS, MapReduce, Pig, Sqoop, UNIX, HBase, Java, JavaScript, HTML.

Data Analyst/Data Engineer

Confidential

Responsibilities:

  • Analyzing different user requirements and coming up with specifications for the various databaseapplications
  • Planning and implementingcapacity expansion in order to ensure that the company’s databases are scalabl
  • Diagnosing and resolving database access and checking on performance issues
  • Designed and developed specific databases for collection, tracking and reporting of data
  • Designed coded, tested and debugged custom queries using MicrosoftT-SQL and SQLReportingServices
  • Experienced in creating data-models for Clients transactional logs, analyzed the data from Cassandra.
  • Conducted research to collect and assemble data for databases - Was responsible for design/development of relationaldatabases for collecting data
  • Built data input and designed data collection screens - Manageddatabasedesign and maintenance, administration and security for the company
  • Used InformaticaPowerCenter to create mappings, sessions and workflows for populating the data into dimension, fact, and lookup tables simultaneously from different source systems (SQL server, Oracle, Flat files)
  • Created mappings using various Transformations like SourceQualifier, Aggregator, Expression, Filter, Router, Joiner, StoredProcedure, Lookup, UpdateStrategy, Sequence Generator and Normalizer
  • Worked Extensively with SSIS to import, export and transform the data between Used T-SQL for Querying the SQLServer database for data validation and data conditioning.
  • Discussed intelligence and informationrequirementswith internal and external personnel

Environment: Windows Server, Microsoft SQL Server, Informatica, Query Analyzer, Enterprise Manager, Import and Export, SQL Profiler.

We'd love your feedback!