Azure Big Data Engineer Resume
TexaS
SUMMARY
- Experience Data Engineer with 7+ Years of progressive experience on Analysis, Design, Development and Implementation of end - to-end production data pipelines.
- Expert in providing ETL solutions for any type of business model.
- Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases to Azure Data lake storeusing Azure Data factory.
- Experience in writing Spark transformations and actions usingSpark SQLinScala.
- Handled Confidential data cube using SPARK framework by writing Spark SQL queries in Scala so as to improve efficiency of data processing and reporting query response time.
- Good experience in writing Spark applications using Scala.
- Developed spark programming code in SCALA on Databrikcs.
- Implemented large Lamda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Experience in development and design of various scalable systems usingHadooptechnologies in various environments mostly using Python. Extensive experience in analyzing data using Hadoop Ecosystems includingHDFS, MapReduce, Hive & PIG.
- In - depth knowledge of ApacheHadoopArchitecture (1.x and 2.x) and ApacheSpark1.x Architecture.
- Experience building data pipelines using Azure DataBricks and PySpark.
- Developed SQL/HSQL queries inside AzureDatabricks.
- Experience in understanding the security requirements for Hadoop.
- Developed Spark applications using PySpark and Scala in Azure DataBricks Environment.
- Used Azure Pipelines for CI/CD production devops.
- Strong skills in visualization toolsPower BI,Confidential Excel - formulas, Pivot Tables, Charts and DAX Commands.
- Proficient inSQL, PL/SQLandPythoncoding.
- Experience developingOn - premiseandReal Time processes.
- Developed DAG’s using Airflow to manage real time data pipelines.
- Excellent understanding of best practices of Enterprise Data Warehouse and involved in Full life cycle development ofData Warehousing.
- Expertise in DBMS concepts.
- Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflakeschemas for OLAP and Operational data store (ODS)applications.
- Highly experienced in importing and exporting data between HDFS and Relational Database Management systemsusingSqoop.
- Skilled in designing and implementingETL Architecturefor cost effective and efficient environment.
- Optimized and tuned ETL processes & SQL Queries for better performance.
- Performed complexdata analysisand provided critical reports to support various departments.
- Work with Business Intelligence tools likeBusiness Objectsand Data Visualization tools likeTableau.
- Extensive Shell/Python scriptingexperience for Scheduling and Process Automation.
- Good exposure to Development, Testing, Implementation, Documentation and Production support.
- Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively manage client expectations.
- An excellent team member with an ability to perform individually, good interpersonal relations, strong communication skills, hardworking and high level of motivation.
TECHNICAL SKILLS
Languages: Scala, SQL, UNIX shell script, JDBC, Python, Spark, PySpark
Hadoop Ecosystem.: HDFS, Map Reduce YARN, Hive, Pig, Hbase, Kafka, Zookeeper, Sqoop, Oozie, DataStax & Apache Cassandra, Drill, Flume, Spark, NIFI
Cloud: Azure Databricks, Azure Blob storage, Azure Virtual machine
Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL.
Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL, Cassandra.
Data Warehousing: Informatica Dataquality/Bigdata, Pentaho, ETL Development, Amazon Redshift, IDQ.
Version Control Tools: SVM, GitHub, Bitbucket.
BI Tools: Power BI, Tableau
Operating System: Windows, Linux, Unix, Macintosh HD.
PROFESSIONAL EXPERIENCE
Confidential - Texas
Azure Big Data Engineer
Responsibilities:
- Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
- Developed Data lakes using Azure Data Lakes and Blob Storage.
- Implemented Spark using Scala and utilizingData framesand Spark SQL API for faster processing of data.
- Worked Extensively on Azure Data Factory to create batch pipelines.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Developed Spark applications usingScalaandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer patterns.
- Worked on designing and developing the Real - Time Tax Computation Engine usingOracle, StreamSets, NIFI, Spark Structured StreamingandMemSql.
- Developedsparkprogramming code inSCALAon INTELLIJ IDE using SBT tools.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Implemented Pyspark and utilizing Data frames andSpark SQLAPI for faster processing of data.
- Worked on PySpark Data sources, PySpark Data frames,Spark SQLand Streaming using Scala.
- Worked extensively on Azure Components such as Databrick, Virtual machine, Blob storage
- Experience in developingPySparkapplication usingScala SBT
- Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data acrossStriim, StreamSetsandDBVisit.
- Expertise in using different file formats likeText files, CSV, Parquet, JSON
- Experience in custom compute functions usingSpark SQLand performed interactive querying.
- Responsible for masking and encrypting the sensitive data on the fly
- Responsible for creating multiple applications for reading the data from different Oracle instances to NIFI topics.
- Responsible in maintaining and creating DAG’s using Apache Airflow
- Responsible for setting up a MemSql cluster on Azure Virtual Machine Instance
- Experience in Real time streaming the data usingPySparkwithNIFI.
- Responsible for creating a NIFI cluster using multiple brokers.
- Experience working on Vagrant boxes to setup a local NIFI and Stream Sets pipelines
Environment: Azure Data Bricks, Azure Data Lakes, Azure Data Factory, Spark 2.2, Scala, Linux, Apache NIFI 1.0, Striim, Streamsets, Pyspark, Spark SQL, Spark Structured Streaming, IntelliJ, SBT, git.
Confidential, Brooklyn, NY
Data Engineer
Responsibilities:
- Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) dat process the data using the Sql Activity.
- Worked with different feeds data like JSON, CSV, XML,DAT and implemented Data Lake concept.
- Developed Informatica design mappings using various transformations.
- Used Azure Databricks Distribution for Hadoop and big data transformations.
- Make use of Azure Blob storagefor raw file storage and Azure Virtual machine for streaming data.
- UsedAzure Functions to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
- ProgrammedETL functionsbetween Oracle and Amazon Redshift.
- Maintained end to end ownership for analyzed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
- Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark frameworketc.)
- UsedKafkaproducer to ingest the raw data intoKafkatopics run theSpark Streamingapp to process clickstream events.
- Performed data analysis and predictivedata modeling.
- Explore clickstream events data withSparkSQL.
- Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and advertising data related to Consumer Product Goods.
- Spark SQLis used as a part of Apache Spark big data framework for structured, Shipment, POS, Consumer, Household, Individual digital impressions, Household TV impressions data processing.
- CreatedDataFramesfrom different data sources like Existing RDDs, Structured data files, JSON Datasets, Hive tables, External databases.
- Load terabytes of different level raw data into Spark RDD for data Computation to generate the Output response.
- Leadership of a major new initiative focused on Media Analytics and Forecasting will has the ability to deliver the sales lift associated the customer marketing campaign initiatives.
- Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
- Coordinating the QA, PROD environments deployments.
- Pythonwas used in automation of Hive and Reading Configuration files.
- Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
- UsingHiveto analyze the partitioned data and compute various metrics for reporting.
Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, MapR, Sqoop, Zookeeper, Azure Blob storage, Azure Databricks, Azure Virtual machine, Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile Methodology, Scrum.
Confidential
Big Data Analyst
Responsibilities:
- Understanding business needs, analyzing functional specifications, map those to develop, designingMapReduceprograms and algorithms.
- Designed and implementedMapReduce-based large-scale parallel relation-learning system.
- CustomizedFlumeinterceptors to encrypt and mask customer sensitive data as per requirement
- Recommendations using Item Based Collaborative Filtering inApache Spark.
- Worked with NoSQL databases likeHbasein creating Hbase tables to load large sets of semi structured data coming from various sources.
- Built web portal using JavaScript, it makes a REST API call to theElastic searchand gets the row key.
- UsedKibana, which is an open source based browser analytics and search dashboard forElastic Search.
- Performed importing data from various sources to theCassandracluster using Java APIs or Sqoop.
- Developed iterative algorithms usingSpark StreaminginScalafor near real-time dashboards.
- Installed and configured Hadoop and Hadoop stack on a40 nodecluster.
- Involved in customizing the partitioner inMapReducein order to root Key value pairs from Mapper to Reducers in XML format according to requirement.
- ConfiguredFlumefor efficiently collecting, aggregating and moving large amounts of log data.
- Involved in creating Hive tables, loading the data using it and in writing Hive queries to analyze the data.
- ImplementedAWSservices to provide a variety of computing and networking services to meet the needs of applications
- Involved in schedulingOozieworkflow engine to run multiple Hive and pig jobs
- Designed and built the Reporting Application, which uses theSpark SQLto fetch and generate reports onHBasetable data.
- Worked on batch processing of data sources usingApache Spark, Elastic search
- Extracted the needed data from the server intoHDFSand Bulk Loaded the cleaned data intoHBase.
- Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
- Strong Experience in implementing Data warehouse solutions inAmazon web services (AWS)Redshift; Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
- Involved inETL, Data Integration and Migration
- Responsible for creatingHive UDF’sdat helped spot market trends.
- Optimizing HadoopMapReducecode,Hive/Pigscripts for better scalability, reliability and performance
- Experience in storing the analyzed results back into theCassandracluster.
- Developed custom aggregate functions usingSpark SQLand performed interactive querying
Environment: HDFS, MapReduce, Cloudera, Hbase, Hive, Pig, Elastic search, Kibana, Sqoop, Spark, Cassandra, Scala, Flume, Oozie, Zookeeper, Maven, Linux, UNIX Shell Scripting.