Sr.data Engineer Resume
San Jose, CA
SUMMARY
- I have 7+ years of IT experience on Migrating SQL database to AzuredataLake, Azuredatalake Analytics, Azure Database,DataBricks and Azure SQLDatawarehouse and controlling and granting database access and Migrating On premise databases to AzureDataLake store using AzureDatafactory.
- Experience in all phases of diverse technology projects specializing in Data Science and Machine Learning.
- Experience in Developing Spark applications using Spark - SQL in Data bricks fordataextraction, transformation, and aggregation from multiple file formats for analyzing & transforming thedatato uncover insights into the customer usage patterns.
- Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Pros, Google Cloud Storage, Composer.
- Good understanding of BigDataHadoop and Yarn architecture along with variousHadoop Demons such as Job Tracker, Task Tracker, Name Node,DataNode,Resource/Cluster Manager, and Kafka (distributed stream-processing).
- Experience in Database Design and development with Business Intelligence using Integration Services (SSIS), DTS Packages, SQL ServerAnalysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.
- Expertise in various phases of project life cycles (Design, Analysis,Implementation, and testing).
- Proficiency in handling software development sprints, test, and delivery cycle for the development teams.
- Expertise in working on Star Schema and Snowflakes Schema methodologies.
- Experience in Bash and Python scripting with focus on DevOps tools, CI/CD and AWS Cloud Architecture and hands-on Engineering
- Excellent understanding of best practices of Enterprise Data Warehouse and involved in Full life cycle development of Data Warehousing.
- Expertise Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store (ODS) applications.
- Extensive Shell/Python scripting experience for Scheduling and Process Automation.
- Good exposure to Development, Testing, Implementation, Documentation and Production support.
- Expert in providing ETL solutions for any type of business model.
- Hands on experience in analyzing data using Ecosystems including HDFS, Map Reduce, Hive & PIG.
- Extensive experience in working with Informatica Power center.
TECHNICAL SKILLS
Programming Languages: Python and basics of R and Java programming
Packages: ggplot2, caret, dplyr, Wreak, gmodels, RCurl, Twitter, NLP, Reshape2, json, duly, pandas, NumPy, Seaborne, SciPy, Matplot lib, Scikit - learn, Beautiful Soup, Rpy2.
Reporting Tools: Tableau, SAS BI, Microsoft Power BI
Databases: SQL, Hive, Spark SQL, MYSQL
Big Data Technologies: Spark and Hadoop
ETL Tools: SSIS
Operating System: Windows, Linux/Unix Work Experience
PROFESSIONAL EXPERIENCE
Sr.Data Engineer
Confidential, San Jose, CA
Responsibilities:
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.
- Developed shell scripts for handling the movement of data files between HDFS and local file system.
- Develop ETL's to source data extraction, transformation and population processes.
- Involved in Unit Testing, Integration Testing and Regression Testing in a regular basis to improve the performance of the application.
- Developed shell scripts for scheduling/running ETL jobs using pmcmd command.
- Involved in analysis, specification, design, and implementation and testing phases of Software Development Life Cycle (SDLC) and used Agile Methodology (SCRUM) for developing application.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Implemented data aggregations using Scala on Spark.
- Involved in relational and dimensional Data Modeling for creating Logical and Physical design of the database and ER diagrams using data modeling like Erwin
- Developed multiple POCs using Spark Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive.
- Involved in the process of designing Google Cloud Architecture.
- Developed multiple POCs using PySpark (Python) and deployed machine learning models on the Yarn cluster.
- Developed POC’s on Spark-Streaming APIs to perform necessary transformations and actions for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
- Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table.
- Developed Data governance methods for handling the different domains of Customer data.
- Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Implemented MVC design pattern for developing the Web Application and used MVC5 application using Razor syntax for view engine also used C# for the back end.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Created a Git repository and added the project to GitHub. Creating Reports in Looker based on Snowflake Connections
- Delivered the content using HTML5, CSS3, JS, and deployed the application to production using Linux operating system and Apache web server.
- Engineered multiple connectors on IBM Streams for processing Site Catalyst (Online Interaction) data.
- Worked on Hive, Big query (BQ) for exporting data for further analysis and for generating transforming files from different analytical formats to text files.
- Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
Environment: GCP, Java, Big query, Gas Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, DataProc, Cloud Sql, MySQL, Postgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql
Data Engineer
Confidential, Austin TX
Responsibilities:
- Designed and developed custom data integration pipelines on Facebooks big data stack such as python, YAML, Hive, Vertica and Data swarm
- Designed and developed custom aggregation framework for reporting and analytics in Hive, Presto and Vertica
- Developed ETL mappings and workflows using Informatica and Data swarm.
- Developed business logic and implemented functionality in C#.
- Interacted with Business Analysts and Data Modelers and characterized Mapping reports and Design process for different Sources and Targets.
- Designed and developed Spark Scala code to fast processing of Havel queries.
- Developed HIVE scripts to transfer data from and to HDFS.
- Involved in different phases of building the Data Marts like analyzing business requirements, ETL process design, performance enhancement, go-live activities and maintenance.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/Map.
- Developed POC to execute Machine Learning models on Data bricks platform using SparkML library.
- Performed the Data Accuracy, Data Analysis, Data Quality checks before and after loading the data.
- Migrating servers, databases, and applications from on premise to AWS, Azure. Processing the schema oriented and non-schema-oriented data using Scala and Spark.
- Worked with Teradata, SQL Server, Apache Spark, Sqoop.Migrated Postgresql and MySQL databases from on-premises to AWS EC2 and RDS environments.
- Worked on Business Intelligence standardization to create database layers with user-friendly views in Vertica that can be used for development of various Tableau reports/ dashboards.
- Used Informatica to extract, transform & load data from SQL Server to Oracle databases. Created dynamic BI
- Report/dashboard for production support in Excel/PowerPoint/Power BI/Tableau/ My SQL Server/ PHP.
- Worked on complex information model, logical relationships, and the data structures from MySQL, ORACLE, and HIVE/PRESTO.
Environment: Hadoop, AWS, Java, Map Reduce, HDFS, Hive, Sqoop, Spring Boot, Cassandra, Swamp, Data Lake, Oozie, Kafka, Spark, Scala, Java, Azure, GitHub, Docker, Talent Big Data Integration.
Data Engineer
Confidential, Tempa, FL
Responsibilities:
- Developed more efficient data collection procedures, with automated data cleaning and transformation procedures before inserting them into the database.
- Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker.
- Involved in integrating hive queries into spark environment using Spark-SQL.
- Developed a Spark job in Java which indexes data into Elastic Search from external Hive tables which are in HDFS.
- Designing and developing code, scripts and data pipelines that leverage structured and instruct
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like Redshift, MongoDB, T-SQL, and SQL Server using Python.
- Worked closely with the technical team on special projects requiring data-driven Python visualizations to integrate into their applications
- Designed different A/B test experiments for validating business solutions based on past performance to make new recommendations.
- Worked extensively in Python to extract, compile, analyze data, and make effective visualizations reflecting insights obtained for senior executives
- Implemented Spark-SQL with various data sources like JSON, Parquet, ORC, and Hive. Worked closely with the managing leadership to accommodate sales and business data-related information requirements.
- Worked closely with the analytics and sales team when undertaking a new project right from determining the problem definition, through data acquisition, exploration, and visualization, to evaluating metrics for the same.
- Worked closely with data scientists to assist on feature engineering, model training frameworks, and model deployments at scale.
Environment: Apache Hadoop, HDFS, Hive, Java, Sqoop, Spark, VMG (OATH), Teradata MySQL, Apache Oozie, SFTP, IBM Streams, Python, Scala.
Data Analytic
Confidential, Atlanta GA
Responsibilities:
- Understand the Business Requirements and prepare the Design Documents.
- Understanding Source data and their granularities
- Determine and document the source systems, current and history data to be used in the extraction.
- Integrated Different Source Systems and loaded into Data warehousing tables.
- Applied slowly changing dimensions like Type 1 and 2 effectively to handle the delta Loads.
- Worked on complex scenarios using only Informatica without writing SQL queries.
- Automated many Data Stage jobs using UNIX scripting to reduce the monitoring time spent for support
- Interacting to onshore team on daily basis calls.
- Complete Unit Testing and follow the code review check list.
- Prepared various mappings to load the data into different stages like Landing, Staging and Target tables.
Environment: Map, Hadoop, Hive, Shell Scripting, Apache Spark, Scala, Snowflake, IntelliJ, PyCharm, SQL, Oozie.
Software Developer
Confidential, Milpitas, CA
Responsibilities:
- Involved extensively in issue resolution, including but not limited to debugging of reporting-related issues, dashboard design issues, archiving, and performance monitoring issues.
- Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Built the web application by using Python, Django, Flask, WSGI, Redis, Web Socket and AWS
- Used cloud shell SDK in GCP to configure the services Data Pros, Storage, Big Query.
- Created shell scripts for handling the handling the files upload in the HDFS directory and for submitting the spark jobs.