Data Engineer Resume
TX
SUMMARY
- IT proficient with 7+ years of experience, expertise in Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
- Hands on experience in Azure applications (Paas &Iaas) Azure synapse analytics, SQL Azure, Azure Data Lake, Azure Data Factory, Azure Analysis Service, Azure Data bricks, Azure monitoring and Key vault.
- Experience in controlling and granting database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
- Builds business case for other teams in the organization to utilize the platform for additional features outside the IT Service Management processes
- Extensive experience extracting and loading data from relational databases such as Teradata, Oracle and DB2 into Azure Data Lake storage via Azure Data Factory.
- Extensive knowledge in data analysis, TSQL queries, ETL Process, Reporting Services (using SSRS, PowerBl) and Analysis Services using SQL Server 2017 SSS, SSRS and SSAS, SQL Server Agent.
- Hands-on experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, SB. AM. EBS, RDS. ELB, VPC. Route53. Ops Works, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS. AWS SQS, AWS SES, AWS SWF &. AWS Direct Connect.
- Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
- Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interacted with data in other AWS data stores such as Amazon 53 and Amazon DynamoDB.
- Experience with building data pipelines in Python/Pyspark/Hive SQL/Presto/Big Query and building python DAG in Apache Airflow.
- Experience in working on python libraries like NumPy, pandas, Boto3 and Impala.
- Skilled in System Analysis, E-R/Dimensional DataModeling, Database Design and implementing RDBMS specific features.
- Experienced in developing production-ready Spark applications using Spark Components such as Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML, and Spark Streaming.
- Expertise in deploying Kubernetes clusters in a cloud environment using cloud formation templates and PowerShell scripting.
- Experience installing/configuring/maintaining Apache Hadoop clusters for application development and Hadoop tools like Sqoop, Hive, PIG, HBase, Kafka, Hue, Oozie, Spark, Scala, and Python.
- Experience in designing and developing applications in Spark using Python to compare the performance of Spark with Hive.
- Good working experience with Hive and HBase/MapRDB Integration.
- Experience converting Hive/SQL queries into Spark transformations using Spark Data Frames, Python.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: Hadoop, Map Reduce, Oozie, Hive, Scoop, Spark, Nifi, Impala, Zookeeper and Cloudera Manager, Airflow.
NO SQL Database: HBase, Dynamo DB.
Monitoring and Reporting: PowerBI, Tableau, Custom shell scripts.
Hadoop Distribution: Horton Works, Cloudera.
Application Servers: JDBC
Build Tools: Maven
Programming & Scripting: Python, Scala, SQL, Shell Scripting.
Databases: Oracle, MY SQL, Teradata
Version Control: GIT, bitbucket
IDE Tools: Eclipse, Jupyter, Pycharm.
Operating Systems: Linux, Unix, Ubuntu, CentOS, Windows
Cloud: AWS, Azure.
Development methods: Agile, Waterfall
PROFESSIONAL EXPERIENCE
Confidential - TX
Data Engineer
Responsibilities:
- Worked extensively on AWS S3 data transfer and AWS Redshift was used for cloud data storage.
- Handled data extraction and data ingestion from different data sources into S3 by creating ETL pipelines using Spark.
- Extensively worked with Pyspark / Spark SQL for data cleansing and generating data frames and RDDs.
- Experienced in working with Spark SQL on different file formats like XML, JSON, and Parquet.
- Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients.
- Handled formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Experienced in transforming batch data from different sources by using different Pyspark API.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.
- Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready data sets onto Snowflake analytical environment.
- Experienced in maintaining a Hadoop cluster on AWS EMR.
- Used spark to build tables that require multiple computations and non equi-joins.
- Modeled Hive partitions extensively for faster data processing.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on Data in Hive.
- Implemented various udfs in python as per the requirement.
- Used Bit Bucket to collaboratively interact with the other team members.
- Proficient with container systems such as Docker and container orchestration tools such as EC2Container Service and Terraform.
- Created a Data Pipeline utilizing Processor Groups and numerous processors in Apache Nifi for Flat File, RDBMS as part of a Proof of Concept (POC) on Amazon EC2.
- Extensively used AWS Athena to import structured data from S3 into other systems such as Red Shift to generate reports.
Confidential, Baltimore
Data Engineer
Responsibilities:
- Developed design patterns for feeding data into the data lake from a variety of sources and standardizing it to enable enterprise-level benchmarking and comparison
- Implement data standards and maintain data quality, master data management and knowledge of data sources.
- Expert in developing Data bricks notebooks for extracting data from various source systems such as DB2, Teradata, and performing data cleansing, wrangling, ETL processing, and loading to Azure SQL DB.
- ETL operations were carried out in Azure Data Factory by connecting to various relational database source systems via JDBC connectors.
- Configured data pipelines using Azure Data Factory, and a custom alerts platform was built for monitoring. Ingested data in mini batches and performed RDD transformations using spark streaming analytics in Azure Data Bricks.
- Developed Spark applications using python libraries like Pyspark.
- Experienced in Spark-SQL for data extraction, data transformation, and data aggregation from multiple file formats for analyzing & transforming.
- Validate Databricks by developing python scripts and automated the process using ADF.
- Analyzed the SQL scripts and designed it by using Pyspark SQL for faster performance.
- Used Pyspark for reading and writing data formats such as JSON, Delta, and Parquet files from different sources.
- Using Pyspark SQL, I monitored the SQL scripts and modified them for improved performance.
- Developed Spark code using python 3 for Pyspark/Spark-SQL for faster testing and processing of data.
- Managing secret keys through Azure key vault and configuring APIs to access the key vault through authentication process.
- Built Power BI reports developed SQL queries with stored procedures, common table expressions (CTEs), and a temporary table.
- Deployed Azure IaaS (Infrastructure as a service) virtual machines (VMs) and Cloud services (PaaS role instances) into secure VNets and subnets.
- Used SSIS (SQL server integration service) to create Multidimensional cubes.
- Managed development of ETL processes independently.
- Built ETL data pipelines to input data from Blob storage to Azure Data Lake Gen2 using Azure Data Factory (ADF).
- Designed and developed user interfaces and customization of Reports using Tableau.
- Designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
- Involved in creation of CI/CD pipelines.
- Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing of data in Azure Data bricks.
- Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
- Processed structured and semi-structured data into Clusters using Spark SQL and Data Frames API.
Confidential - Auburn Hills, MI
Data Engineer
Responsibilities:
- Development of Pyspark script as required in the project such as for Data validation, Metadata Template generation, Mailing Alerts etc.
- Providing KT to the New team and helping the team with resolution of errors to make sure data ingestion happens successfully
- Design and Implement spark to load data from and to store data into Hive tables Cluster Management, make sure the resources available in Cloudera platform are stable
- To Onboard new applications to the Hadoop Cluster
- Develop Automation script as per the requirement Worked on Change Management for the Production changes
- Loading data from multiple data sources (SQL, DB2, and Oracle) into HDFS using Sqoop and storing it inhive tables.
- Extracted data from Teradata into HDFS using Sqoop.
- Sqoop was used to return the analyzed patterns to Teradata.
- Expert in Developing SSIS packages to ETL data into heterogeneous data warehouse.
Confidential
Data Engineer
Responsibilities:
- Experienced in Hadoop ecosystem, with an emphasis on big data solutions.
- Researched and resolved issues regarding integrity of data flow into databases.
- Identified and documented detailed business rules and use cases based on requirements analysis.
- Upheld security and confidentiality of documents and data within area of responsibility.
- Good understanding Talend solution design
- Development of Automation scripts using python, Pyspark and shell commands
- Designed and developed complex data pipelines and maintained the data quality to support a rapidly growing business
- Demonstrated POC on Impala performance on CDH & CDP by developing a python script.
- Efficient in building hive, Impala, Yarn pool monitoring scripts.
