Sr. Cloud Data Engineer Resume
Dallas, TX
SUMMARY
- Around 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation. Good knowledge on extracting teh models and trends from teh raw data collaborating with teh data science team.
- Well versed with Big data on AWS cloud services me.e. EC2, S3, Glue, Atana, DynamoDB and RedShift
- Experience in job/workflow scheduling and monitoring tools like Oozie, AWS Data pipeline & Autosys
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Has Data warehousing experience in Business Intelligence Technologies and Database with Extensive Knowledge in Data analysis, TSQL queries, ETL & ELT Process, Reporting Services (using SSRS, Power BI) and Analysis Services using SQL Server /2 SSIS, SSRS and SSAS, SQL Server Agent.
- Has good experience designing cloud-based solutions in Azure by creating Azure SQL database, setting up Elastic pool jobs and design tabular models in Azure analysis services.
- Has extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
- Extensive knowledge in Data Mapping, Data Integration, Information Gathering, Data Cleansing, Data Manipulation, Performance Tuning, and data governance.
- Proficient with Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Strong experience in migrating other databases to Snowflake.
- Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering, and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
- Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
- Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
- Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
- Experience working with all major Hadoop distributions like Cloudera (CDH), Horton works (HDP) and AWS EMR.
- Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
- Experience working with GitHub/Git 2.12 source and version control systems.
- Expert in writing pseudo code, SQL queries and optimizing teh queries in Oracle, SQL Server and Teradata. Good understanding of Views, Synonyms, Indexes, Partitioning, Database Joins, Stats and Optimization.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Extract, transform and load teh data from different formats like JSON, a Database, and expose it for ad-hoc/interactive queries using Spark SQL.
- Performed teh migration of Hive and MapReduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole.
- Developed highly scalable Spark applications using Spark Core, Data frames, Spark-SQL and Spark Streaming API's in Scala.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, Hive, Pig, HBase, Flume, Sqoop, Kafka, Oozie.
Programming Languages: Python, Java and Scala.
Spark components: RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming.
Cloud Infrastructure: AWS Cloud Formation, S3, Redshift, Atana, Glue.
Cloud Technologies: MS Azure, Amazon Web Service(AWS).
Databases: Oracle, Teradata, MySQL, SQL Server.
Scripting and Query Languages: Shell scripting, Sql.
Version Control: Git, Bit Bucket.
Build Tools: Maven, Gradle. Ant.
Reporting: Tablue, Power Bi
PROFESSIONAL EXPERIENCE
Sr. Cloud Data Engineer
Confidential, Dallas, TX
Responsibilities:
- Involved in designing and deploying multi-tier applications using all teh AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- Coordinated with team and Developed framework to generate Daily ad-hoc, Report's and Extracts from enterprise data and automated using Oozie.
- Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch
- Used AWS Glue for teh data transformation, validate and data cleansing.
- Used python Boto 3 to configure teh services AWS glue, EC2, S3
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Atana.
- Programmed in Hive, Spark SQL, Java, C#, and Python to streamline teh incoming data and build teh data pipelines to get teh useful insights, and orchestrated pipelines
- Designed and developed teh applications on teh data lake to transform teh data according business users to perform analytics
- Working on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.
- Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to bring teh data driven culture across teh enterprises
- Developed Talend MDM jobs to populate teh claims data to data warehouse - star schema, snowflake schema, Hybrid Schema.
- Well versed with Snowflake features like clustering, time travel, cloning, logical data warehouse, caching etc.
- Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop
- Worked closely with business, transforming business requirements to technical requirements part of Design Reviews & Daily Project Scrums and Wrote custom MapReduce programs by writing Custom Input formats
- Created data pipelines for different events to load teh data from DynamoDB to AWS S3 bucket and tan into HDFS location
- Developed Hive queries to pre-process teh data required for running teh business process
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
- Implementations of generalized solution model using AWS SageMaker
- Solid Enterprise working knowledge of Scala fundamentals including programming languages, best practices.
- Development and maintenance of Scala applications that are executed on teh Cloudera platform.
- Expert in implementing advanced procedures like text analytics and processing using teh in-memory computing capabilities like Apache Spark written in Scala
- Developed customized UDFs and UDAFs in Scala to extend Pig and Hive core functionality.
- Experience in using Avro, Parquet and JSON file formats, developed UDFs in Hive
- Used Talend for Big data Integration using Spark and Hadoop.
- Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters
- Used Polybase for ETL/ELT process with Azure Data Warehouse to keep data in Blob Storage with almost no limitation on data volume.
- Worked with Bitbucket, Jira, for teh deployed teh projects into production environments
- Used deep learning frameworks like MXNet, Caffe 2, Tensorflow, Theano, CNTK and Keras to help clients build Deep learning models.
Cloud Data Engineer
Confidential, Columbus, OH
Responsibilities:
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming teh data to uncover insights into teh customer usage patterns.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in Azure Databricks
- Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Develop Spark applications using PySpark and spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming teh data uncover insight into teh customer usage patterns
- Used Databricks to integrate easily with teh whole Microsoft stack.
- Responsible for data services and data movement infrastructures
- Experienced in ETL concepts, building ETL solutions and Data modeling
- Worked on architecting teh ETL transformation layers and writing spark jobs to do teh processing.
- Loaded application analytics data into data warehouse in regular intervals of time
- Worked on confluence and Jira
- Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Compiled data from various sources to perform complex analysis for actionable results
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
- Analyzed teh system for new enhancements/functionalities and perform Impact analysis of teh application for implementing ETL changes
- Built performant, scalable ETL processes to load, cleanse and validate data
- Participated in teh full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
- Collaborate with team members and stakeholders in design and development of data environment
- Preparing associated documentation for specifications, requirements, and testing
Data Engineer
Confidential, New York, NY
Responsibilities:
- Worked and learned a great deal from AWS Cloud services like EC2 and S3.
- Implemented a Continuous Delivery pipeline with Docker and Git Hub
- Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
- Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
- Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
- Developed Spark code using Scala and Spark-SQL for faster processing and testing
- Developed Spark applications using Scala for easy Hadoop transitions.
- Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
- Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling
- Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Hands on experience on architecting teh ETL transformation layers and writing spark jobs to do teh processing.
- Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Developed logistic regression models (Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
- Develop near real time data pipeline using spark
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library.
- Experience implementing machine learning back-end pipeline with Pandas, Numpy.
Hadoop Developer
Confidential
Responsibilities:
- Participated in all teh phases of teh Software development life cycle (SDLC) which includes Development, Testing, Implementation and Maintenance
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs for data cleaning and preprocessing
- Involved in importing data from MySQL to HDFS using SQOOP.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Involved in creating Hive tables, loading with data and writing hive queries, which will run internally in map, reduce way.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in managing and reviewing Hadoop log files.
- Used different PySpark API's to perform necessary transformations and actions on teh data came in Batches form different sources.
- Performed various Parsing technique's using spark API'S to cleanse teh data from Kafka.
- Experienced in working with Spark SQL on different file formats like Avro and Parquet.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Implemented test scripts to support test driven development and continuous integration.
- Created Pig Latin scripts to sort, group, join and filter teh enterprise wise data.
- Worked on tuning teh performance Pig queries.
- Experience in writing custom UDFs for Hive and Pig to extend teh functionality.
- Installed Oozie workflow engine to run multiple MapReduce jobs.
- Cluster co-ordination services through Zookeeper.
- Worked on data profiling & various data quality rules development using Informatica Data Quality.
Data Analyst
Confidential
Responsibilities:
- Involved in designing physical and logical data model using ERwin Data modeling tool.
- Designed teh relational data model for operational data store and staging areas, Designed Dimension & Fact table’s for data marts.
- Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.
- Created Stored Procedures, Database Triggers, Functions and Packages to manipulate teh database and to apply teh business logic according to teh user's specifications.
- Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.
- Creation of database links to connect to teh other server and Access teh required info.
- Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.
- Used Advanced Querying for exchanging messages and communicating between different modules.
- System analysis and design for enhancements Testing Forms, Reports and User Interaction.