Sr. Data Engineer Resume
Phoenix, AZ
SUMMARY
- Over 8 years of experience on using Bigdata ecosystem in financial, health and Industry business processing Domains.
- Experience in Data Warehouse Development, Enhancement, Migration, Maintenance and Production support projects.
- Experience in troubleshooting of jobs and addressing production issues like data issues, Environment issues, performance tuning and enhancements.
- Expertise in writing Spark RDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations using Spark - Core.
- Having experience in using Hive Query Language (HQL)for extracting data from Hadoop.
- Experience using Sqoop to import data into HDFS from RDBMS.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL,DataFrame, Pair RDD's, Spark YARN.
- Experience working with NoSQL databases like HBase, MongoDB.
- Experience in execution of Batch jobs through the data streams to SPARK Streaming.
- Experience in Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
- Worked extensively on Data modeling concepts and their implementation in Dimensional modeling using ETL Processes for Data warehouses.
- Familiar in using highly scalable parallel processing techniques by using parallel jobs on multiple node configuration files.
- Reliable Data Engineer keen to help companies collect, collate and exploit digital assets. Skilled administrator of information for Azure services ranging from Azure databricks, Azure relational database and non-relational database, and Azure data factory and cloud services.
- Practiced at cleansing and organizing data into new, more functional formats to drive increased efficiency and enhanced returns on investment.
- Dynamic Database Engineer devoted to maintaining reliable computer systems for uninterrupted workflows.
- Delivers up-to-date methods to increase database stability and lower likelihood of security breaches and data corruption.
- Offers detailed training and reference materials to teach best practices for system navigation and minor troubleshooting.
- Background includes data mining, warehousing and analytics. Proficient in machine and deep learning. Quality driven and hardworking with excellent communication and project management skills.
- Experience working in onsite-offshore structure and effectively coordinated tasks between onsite and offshore teams.
- Experience in UNIX scripting as per business requirements.
- Having exposure to Scheduling tools like Control-M & Autosys.
- Experience in working on Waterfall/Agile Scrum project methodologies.
TECHNICAL SKILLS
Programming Languages: Java, Scala, Python, HML5, CSS3, C, C++, R, Linux shell.
Database: DB2, Oracle, SQL Server, Teradata, Netezza, YellowBrick, Cassandra, HBase, MongoDB.
Big Data Ecosystems: Hadoop, Map Reduce, HDFS, Hive, Sqoop, Spark, Map Reduce, Spark Streaming, Kafka, HBase, Yarn, Oozie, Zookeeper, Hue
Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route S3, EMR, Redshift.
Job Schedulers: Control-M, Autosys, Zookeeper
Version controllers: Tortoise sub version control 1.8.7, Accurev, GitHub.
PROFESSIONAL EXPERIENCE
Confidential, Phoenix, AZ
Sr. Data Engineer
Responsibilities:
- Migrating an entire oracle database to BigQuery and using of power bi for reporting.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
- Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in
- Hadoop Cluster over large Datasets.
- Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering and skewing
- Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
- Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
- Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process.
- Created ETL Pipeline using Spark and Hive for ingest data from multiple sources.
- Carried out data transformation and cleansing using SQL queries, Python and Pyspark.
- Expertise knowledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology to get the job done.
- Implementing and Managing ETL solutions and automating operational processes.
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Worked on building dashboards in Tableau with ODBC connections from different sources like Big Query/ presto SQL engine.
- Developed stored procedures in MS SQL to fetch the data from different servers using FTP and processed these files to update the tables.
- Developed report using Tableau that keeps track of the dashboards published to Tableau Server, which help us find the potential future clients in the organization.
- Involved in creating Oozie workflow and coordinated jobs to kick off jobs on time and data availability.
- Knowledge about cloud dataflow and Apache beam.
- Good knowledge in using cloud shell for various tasks and deploying services.
- Created BigQuery authorized views for row level security or exposing the data to other teams.
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including
- Pig, Hive, SQOOP, Apache Spark, with Cloudera Distribution.
Environment: GCP, Big Query, Data Proc, Oozie, PySpark, HiveSQL, Sqoop.
Confidential, Atlanta, GA
Sr. Data Engineer
Responsibilities:
- Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytic. Data
- Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Generated detailed studies on potential third-party data handling solutions, verifying compliance with internal needs and stakeholder requirements.
- Collaborated on ETL (Extract, Transform, and Load) tasks, maintaining data integrity and verifying pipeline stability.
- Performed large-scale data conversions for integration into HD insight.
- Designed and implemented effective database solutions (Azure blob storage) to store and retrieve data.
- Designed advanced analytics ranging from descriptive to predictive models to machine learning techniques.
- Monitored incoming data analytics requests and distributed results to support IoT hub and streaming analytics
- Prepared documentation and analytic reports, delivering summarized results, analysis and conclusions to stakeholders.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wrote UDF’s in Scala and Pyspark.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
- Hands-on experience on developing SQL Scripts for automation purpose.
Environment: Azure Data Lake, Azure Storage, Azure DW, Azure SQL, Control-M, HiveQL, PySpark, Scala, Sqoop.
Confidential
Sr. Data Engineer
Responsibilities:
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
- Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
- Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Set up, worked on Kerberos authentication principals to establish secure network communication on cluster, and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
- Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 Data Lake.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
- Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
- Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
- Used Pyspark for extract, filtering and transforming the Data in data pipelines.
- Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elastic search for near real time log analysis of monitoring End to End transactions.
- Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.
Environment: AWS S3, AWS Glue, AWS Redshift, Control-M, HiveQL, Zookeeper, Oozie, Spark, Scala, Sqoop.
Confidential
Data Engineer
Responsibilities:
- Played a lead role in gathering requirements, analysis of entire system and providing estimation on development, testing efforts.
- Involved in designing different components of system like Sqoop, Hadoop process involves map reduce & hive, Spark, FTP integration to down systems.
- Have written hive and spark queries using optimized ways like using window functions, customizing Hadoop shuffle & sort parameter.
- Developed ETL’s using PySpark. Used both Dataframe API and Spark SQL API.
- Using Spark, performed various transformations and actions and the result data is saved back to HDFS from there to target database.
- Migrated an existing on-premises application to AWS
- Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
- Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
- Used Pyspark for extract, filtering and transforming the Data in data pipelines.
- Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack Elasticsearch Kibana
- Used Data Build Tool for transformations in ETL process, AWS lambda, AWS SQS Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Created Unix Shell scripts to automate the data load processes to the target Data Warehouse.
Environment: AWS S3, AWS Glue, AWS Redshift, Control-M, HiveQL, Zookeeper, Oozie, Spark, Scala, Sqoop.