Sr. Azure Data Engineer Resume
Charlotte, NC
SUMMARY
- Having 9+ years of professional IT experience which includes in Big Data ecosystem and Python technologies.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Expertise in Hadoop architecture and various components such as HDFS, YARN, Hive, Pig, High Availability, Job Tracker, Task Tracker, Name Node, Data Node, Apache, Cassandra and MapReduce programming paradigm.
- Experience in Hadoop cluster using Cloudera’s CDH, Hortonworks HDP.
- Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi - clouds
- Experience in using various tools like Sqoop, Flume, Kafka, and Pig to ingest structured, semi-structured and unstructured data into the cluster.
- Experience in using Microsoft Azure SQL database, Data Lake, Azure ML, Azure data factory, Functions, Databricks and HDInsight.
- Having hands on experience in versioning using bitbucket.
- Good understanding on Cloud Based technologies such as GCP, Azure.
- Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Python.
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
- Extensive experience in migrating on premise Hadoop platforms to cloud solutions using AWS and Azure.
- 3+ years of experience in writing python as ETL framework and Pyspark to process huge amount of data daily.
- Strong experience in implementing data models and loading unstructured data using HBase, Dynamo Db and Cassandra.
- Having hands on experience in Application Deployment using CICD pipeline.
- Can work parallelly in both GCP and Azure Clouds coherently
- Experience in implementing Spark using Scala and Spark SQL for faster processing of data.
- Strong experience in extracting and loading data using complex business logic’s using Hive from different data sources and built the ETL pipelines to process tera bytes of data daily.
- Experienced in transporting, and processing real time event streaming using Kafka and Spark Streaming.
- Hands on experience with importing and exporting data from Relational databases to HDFS, Hive and HBase using Sqoop.
- Experienced in processing real time data using Kafka 0.10.1 producers and stream processors and implemented stream process using Kinesis and data landed into Data Lake S3.
- Designed and developed spark pipelines to ingest real time event-based data from Kafka and other message queue systems and processed huge data with spark batch processing into data warehouse hive.
- Experienced in creating and analyzing Software Requirement Specifications (SRS) and Functional Specification Document (FSD).
- Working with Bootstrap twitter framework to Design single page application.
- Strong working knowledge developing Cross Browser Compatibility (IE, Firefox, Safari, Chrome etc.) for dynamic web applications.
- Experienced in developing the unit test cases using Junit, Mockitoand Scala Test.
- Capable of organizing, coordinating and managing multiple tasks simultaneously.
- Excellent communication and inter-personal skills, self-motivated, organized and detail-oriented, able to work well under deadlines in a changing environment and perform multiple tasks effectively and concurrently.
- Strong analytical skills with ability to quickly understand client’s business needs. Involved in meetings to gather information and requirements from the clients.
TECHNICAL SKILLS
Hadoop: Hadoop, Map Reduce, HIVE, PIG, Impala SQOOP, HDFS, HBASE, Oozie, Spark, Pyspark, Scala and Mongo DB
Cloud Technologies: Azure Analysis Services, Azure SQL Server, Dynamo DB, Step Functions, Glue, Athena, CloudWatch, Azure Data Factory, Azure Data Lake, Functions, Azure SQL Data Warehouse, Databricks and HDInsight
DBMS: Amazon Redshift, PostgreSQL, Oracle 9i, SQL Server, IBM DB2 And Teradata
ETL Tools: Data Stage, Talend and ABInitio
Deployment Tools: Git, Jenkins, Terraform and CloudFormation
Programming Language: Python, Scala, PL/SQL
Scripting: Unix Shell and Bash scripting
PROFESSIONAL EXPERIENCE
Confidential, Charlotte, NC
Sr. Azure Data Engineer
Responsibilities:
- Analyze, design and build Modern data solutions using Azure Pass service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, SparkSQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Experienced in writing Spark Applications in Scala and Python.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery
- Analyzed large and critical datasets using HDFS MapReduce, Kafka, Spark, HBase, Hive, Hive UDF and Spark.
- Used Kafka consumer’s API in Scala for consuming data from Kafka topics.
- Developed spark applications in Spark on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Designed SSIS Packages to transfer data from flat files, Excel SQL Server using Business Intelligence Development Studio.
- Developed a POC for project migration from on premise Hadoop MapR system to GCP.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
- Developed custom aggregate functions usingSparkSQL and performed interactive querying.
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement usingPySpark.
- Ran data formatting scripts in Java and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
- Implemented Kafka model which pulls the latest records into Hive external tables.
- Loaded all datasets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.
- Develop Cloud Functions in Python to process JSON files from source and load the files to BigQuery.
- Involved in ETL, Data Integration and Migrationby writing Pig scripts.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
- Migrated the computational code in HQL toPySpark.
- Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities
- Completed data extraction, aggregation, and analysis in HDFS by usingPySparkand store the data needed to Hive.
- Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
- Sound knowledge in programming Spark using Scala.
- Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in java for data cleaning and pre-processing.
- Populated HDFS and HBase with huge amounts of data using Apache Kafka.
- Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to Hive and Impala.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Involved in implementing security on Hortonworks Hadoop Cluster using with Kerberos by working along with operations team to move non secured cluster to secured cluster.
- Created Build and Release for multiple projects in production environment using Visual Studio Team Services.
Environment: Python, Hadoop, Hive, GCP, BigQuery, Pig, Sqoop, Scala, Azure Databricks, Kafka, Scala, Flume, HBase, Pyspark, AWS, Hortonworks, Oracle 10g/11g/12C, Teradata, Cassandra, HDFS, Data Lake, Spark, MapReduce, Ambari, Cloudera, Tableau, Snappy, Zookeeper, NoSQL, Shell Scripting, Ubuntu, Solar.
Confidential, NYC, NY
Azure Data Engineer
Responsibilities:
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline to Extract, Transform and load dat from different sources like Azure SQL, Bold storage, Azure SQL Data warehouse, write-back tool and backwards.
- Design & implement migration strategies for traditional systems on Azure.
- Knowledge on PySpark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
- Created BigQuery authorized views for row level security or exposing the data to other teams.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Creating Azure Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Experience in building and architecting multiple Data pipelines, end to end ETL process for Data ingestion and transformation in GCP and coordinate task among the team
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.
- Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and BigQuery tables.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL, Azure etc.
- Propose architectures considering cost/spend in Azure and develop recommendations to right size data infrastructure
- Developed PySpark programs and created the data frames and worked on transformations.
- Involved in loading data from Linux file systems, servers, web services using Kafka producers and partitions.
- Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
- Implement POC with Hadoop. Extract data with Spark into HDFS.
- Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Developed Spark applications using Scala for easy Hadoop transitions.
- Implemented applications with Scalaalong with Akka and Play framework.
- Optimized the code using PySpark for better performance.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Worked on Spark streaming using Apache Kafka for real time data processing.
- Experienced in optimizing Hive queries, joins to handle different data sets.
- Involved in ETL, Data Integration and Migrationby writing Pig scripts.
- Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
- Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
- Designed and implemented MongoDB and associated RESTful web service.
- Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- UsedTalendtool to create workflows for processing data from multiple source systems.
Environment: Apache Hadoop 0.20.203, GCP, BigQuery, Kafka, Scala, Cloudera Manager (CDH4), HDFS, Eclipse, Hive, Pig, Sqoop, Oozie and SQL, Oracle 11g.
Confidential
Big Data Engineer
Responsibilities:
- Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
- Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
- Develop and deploy the outcome using Spark and Scala code in Hadoop cluster running on GCP
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task et.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Migrate the Data using Azure Database Migration Service (AMS).
- Migrate SQL Server and Oracle database to Microsoft Azure Cloud.
- Monitoring BigQuery, Dataproc and cloud Data flow jobs via Sack driver for all the environment
- Process and load bound and unbound Data from Google sub topic to BigQuery using cloud Dataflow with Python
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
- Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
- Building a Scala and spark based configurable framework to connect common Data Sources like MYSQL, SQL Server, BigQuery and load it in BigQuery
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala, and Python.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
- Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
- Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
- Developed and analyzed the SQL scripts and designed the solution to implement using PySpark
- Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.
- Supported MapReduce Programs and distributed applications running on the Hadoop clusterand scripting Hadoop package installation and configuration to support fully automated deployments.
- Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setupHadoopenvironment in AWS EC2 Instances.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Worked Azure SQL Database Environment
- Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
- Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
Environment: Hadoop, MapReduce, Python, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, Python, Scala, PySpark, MapR, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Hortonworks, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and Redhat 6.5
Confidential
Software Developer
Responsibilities:
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing and assisted with data capacity planning and node forecasting.
- Involved in design and ongoing operation of several Hadoop clusters and Configured and deployed Hive Meta store using MySQL and thrift server
- Implemented and operated on-premises Hadoop clusters from the hardware to the application layer including compute and storage.
- Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
- Designed custom deployment and configuration automation systems to allow for hands-off management of clusters via Cobbler, FUNC, and Puppet.
- Prepared complete description documentation as per the Knowledge Transferred about the Phase-II Talend Job Design and goal and prepared documentation about the Support and Maintenance work to be followed in Talend.
- Deployed the company's first Hadoop cluster running Cloudera's CDH2 to a 44-node cluster storing 160TB and connecting via 1 GB Ethernet.
- Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team.
- Modified reports and Talend ETL jobs based on the feedback from QA testers and Users in development and staging environments.
- Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HBase tables.
- Participated in Hadoop development Scrum and installed, Configured Cognos8.4/10 and Talend ETL on single and multi-server environments.
Environment: Apache Hadoop, Cloudera, Pig, Hive, Talend, Map-reduce, Sqoop, UNIX, Cassandra, LINUX, Oracle 11gR2, UNIX Shell Scripting, Kerberos