Big Data Engineer Resume
Pleasanton, CA
OBJECTIVE:
5 plus years of experience as a Big Data Engineer developed high performance Pyspark jobs for data migration, KPIs for Business Intelligence (BI), Data Warehousing. Looking for an employment in an organization that offers me ample scope to apply my knowledge and technical skills, and to be part of the team that energetically works towards the eventual growth of the organization.
PROFESSIONAL SUMMARY:
- Experience as a Big Data Engineer in various domains like Retail/Fashion, Health Care and Travel Domain.
- Solid understanding of architecture, working of Hadoop framework involving Hadoop Distributed File System and its components like Pig, Hive, Sqoop, PySpark.
- Extensive experience on Pig, Hive, Pyspark.
- Experience in AWS services S3, Glue, AWS Lambda and Redshift.
- Experience in Microsoft Azure components like Event Hub, Stream Analytics, ADW, HDInsight Cluster and Azure Data factory.
- Strong working experience in extracting, wrangling, ingestion, processing, storing, querying and analyzing structured and semi - structured.
- Experience in developing ETL framework using Talend for extracting and processing data.
- Well versed experienced in creating pipelines in Azure data factory using different activities like Move &Transform, Copy, filter, foreach, Data bricks etc
- Strong experience with SQL Server and T-SQL in constructing joins, user defined functions, stored procedures, views, indexes, user profiles and data integrity.
- Knowledge of Star Schema Modeling, and Snowflake modeling, FACT and Dimensions tables, physical and logical modeling.
- Certified in Introduction to Programming Using Python from Microsoft.
- Experience in all phases of software development life cycle (SDLC) from requirements gathering, design, development and support of applications.
- Involved in client support, day to day user interaction such as working on Remedy tickets, fixing the issues.
- Configured version controls like GIT for integrating code for application development.
- Experience in issue tracking systems, preferably Jira.
TECHNICAL SKILLS:
Programming Languages: T-SQL, PL/SQL, Python, Core Java, Shell scripting
Databases: SQL Server, MySQL, Oracle Database, Teradata
Hadoop Ecosystem: HDFS, Map Reduce, Yarn, Hive, Pig, HBase, Impala, Scoop, Flume, Spark
Cloud Technologies: Microsoft Azure (Blob Storage, ADF, Event Hub) , AWS (S3, Glue, Redshift)
ETL Frameworks: Talend, SSIS
Visualization Tool: Microsoft Power BI, Tableau, Seaborn, matplotlib, ggplot2, plotly
Development Environments /Cloud: AWS, Microsoft Azure, Anaconda Spyder, PyCharm, Jupyter Notebook
PROFESSIONAL EXPERIENCE:
Confidential, Pleasanton, CA
Big Data Engineer
Environment: PySpark, AWS (S3, Glue, Lambda, RedShift, Cloud9, Athena, Cloud Formation, Code Commit), SQL, Hive
Responsibilities:
- Collaborating with business to understand the process and get the data requirements to process raw files from S3.
- Developing Spark programs using Python API’s to import data from S3 and into spark Dynamic frame and converting to dataframe to perform transformations and actions.
- Designed and developed ETL process in AWS Glue to migrate flightaware usage data from S3 data source to redshift.
- Developed data quality metrics script to monitor the quality of the daily partition files which are processed into target tables.
- Involved in converting Hive/SQL queries into spark transformations using Spark dataframe in Python.
- Developed PySpark script to merge static and dynamic files and cleanse the data.
- Created stack to replicate the workflow in Production environment using cloud formation.
- Creating stacks using cloud formation to create datasets in AWS glue catalog
- Used Parquet data formats to store data in HDFS.
- Responsible for analyzing and data cleansing using Spark SQL queries
- Monitoring Cloud watch logs and automating email alerts to the customers to send status of the job.
Confidential, San Francisco, CA
Big Data Engineer
Environment: IBM Netezza, SQL, Linux, Data Warehouse, Talend, PySpark, AWS(S3, Glue, Lambda, RedShift)
Responsibilities:
- Involved in modelling Clinical/Claim/Member Transformation data mart.
- Built and maintaining complex SQL scripts, index, views and complex queries for data analysis and extraction.
- Worked on data mapping, data cleansing, program development for loads and data verification of converted data to legacy data.
- Formulating various Data - mappings for meeting the business requirements.
- Integration of Different Source data in Staging Area by using different Data Pre-Processing steps like Data Merging, Data Cleansing and Data Aggregation.
- Responsible for Performance Tuning of SQL queries and support data loads.
- Worked on different data formats such as Parquet, AVRO, Sequence File, Map file and XML file formats.
- Created Data pipeline using Talend to extract data from multiple databases and migrate to AWS S3 bucket.
- Storing data in S3 bucket daily basis. Using Glue loading data into Redshift.
- Used AWS Glue for transformations and AWS Lambda to automate the process.
- Analyzed and executed the test cases for various phases of testing - integration and regression.
- Incremental buildout of the data to support evolving data needs by adopting the agile methodology.
Confidential, Berkeley, California
Big Data Engineer
Environment: Python, HDFS, PySpark, Hive, Rabbit MQ, SQL, Azure (ADF, Blobs, SQL Data warehouse), MSSQL, Teradata, Power BI.
Responsibilities:
- Collaborating with data architects to understand and get the data requirements for the project model as per the business.
- Developed Python script to hit Rabbit MQ API and extract data in JSON format and load into Spark RDD.
- Developed Spark program using PySpark, to handle Streaming data and load data Azure Events Hubs
- Developing Talend ingestion frameworks to ingest between Teradata, MSSQL, HDFS, Azure Blobs and Azure Data warehouse.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory and ingest data Azure Blob storage and processing the data in Azure Databricks.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Loading data into Azure SQL database using Azure Databricks.
- Processing real time structured data using stream analytics in ADW in order to merge with the data that is being ingested from Hadoop to ADW.
- Ingesting structured data from MySQL, SQL Server to HDFS as incremental import using Talend jobs. These imports are scheduled to run in a periodic manner.
- Responsible for creating Hive tables, loading the structured data resulted from Map Reduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
- Tuning the performance of Hive data analysis using clustering and partitioning of data with respective to date, location.
- Developed job workflows in CAWA to automate the tasks of loading the data into HDFS.
- Responsible to process and derive the data which is needed to build the customer defined metrics that are displayed in Power BI.
Confidential, Fremont, California
Data Engineer
Environment: SQL, MSSQL, Stored procedures, SSIS, HDFS, PySpark, Hive
Responsibilities:
- Worked closely with business and involved in gathering the user requirements, Analysis, Design, functional & technical specification and wrote SQL scripts for data transfer/migration.
- Worked with the ETL team to document the transformation rules for data migration from OLTP to Warehouse environment for reporting purposes.
- Extensively worked in analyzing SQL Stored Procedures and enhancing them accordingly.
- Used SSIS package to populate data from excel to database, used lookup, derived column and conditional split to achieve the required data.
- Created Store procedures, User defined functions, Views, T-SQL scripting for complex business logic.
- Optimized the database by creating various clustered, non-clustered indexes and index views.
- Using aggregate strategies to aggregate data, sorting and joining tables.
- Extensively using joins and sub-queries for complex queries which were involving multiple tables from different databases. Analyzing data and re-mapping fields based on business requirements.
- Used SQL profiler to identify slow running queries and optimized them for better performance.
- Updated data in tables based on the user requirement using temp tables, with CTE ‘s and joins.
- Worked with business analysts to provide solutions for technical difficulties and business-related issues.
- Developed PySpark scripts that runs on MSSQL table pushes to Big Data where data is stored in Hive tables.
- Creating Views for reporting team.
- Analyzed the requirements and framed the business logic for the ETL process.
- Created SQL jobs and monitored them often and updated business with the successful migrations.
- Used Scrum Development Methodology.
Confidential
Data Analyst
Environment: SQL, ETL, Tableau
Responsibilities:
- Analyzed and reported data findings and complete reports on building capacity, inventory control metrics, error reporting, and performance management controls.
- Wrote SQL queries to perform data analysis, data modeling and prepare data mapping documents to explain the transformation rules from source to target tables
- Creating and supporting a data management workflow from data collection, storage, analysis to training and validation.
- Worked with the ETL team to document the transformation rules for data migration from OLTP to Warehouse environment for reporting purposes.
- Developed Data Migration and Cleansing rules for the Integration Architecture.
- Developed Trend lines, Funnel charts, Donut charts, Heat Maps, Tree Maps and Drilldown reports, 100% stacked bar charts etc. in Tableau.
- Involved in publishing, scheduling and subscriptions with Tableau Server and creating and managing users, groups, sites in Tableau Server.
- Development of data collection processes and data management systems Maintenance of data integrity.
- Designing of queries, compiling of data, and generation of reports in Excel and Tableau.