Data Engineer Resume
OH
SUMMARY
- Around 6 years of experience working with big data, primarily using teh Hadoop framework and PySpark for data analysis, transformations, deployment, and ingestion. Knowledgeable about AWS Data Pipelines, data structures, and processing systems. PySpark, Python, SQL, and Hive are used for data mining, cleaning, and munging.
- Extensive experience optimizing Spark performance with concepts such as persist, cache, broadcast, and efficient joins.
- Experience in improving teh performance and optimizing existing algorithms in Hadoop by working with Spark - SQL, Spark Context, Pair RDDs, Data Frames, YARN and in-memory processing frameworks such as Spark Transformations, SparkQL, Spark Streaming.
- Hands on experience in writing PySpark Scripts to process streaming data from data lakes using Spark Streaming, and PySpark Pipelines were built to process big data.
- Experience on handling different file formats like CSV, xml, log ORC, AVRO, PARQUET, Sequential files, MAP Files and RC.
- Experience in working on Spark using python on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Experience with teh Hive data warehouse tool, including teh creation of tables, partitioning and bucketing of data, and teh development and optimization of HiveQL queries.
- Hands on experience in usingAWS Kinesis, Lambda and Dynamo DB to implementreal time data streaming pipeline and deployedAWS Lambda codefrom AWS S3 buckets.
- Extensive experience with data wrangling and numerical computation tools such as Pandas and Numpy.
- Experience in creating tables, dropping and altered at run time without blocking updates and queries using Spark and Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs .
- Experienced with cloud platforms like AWS(Amazon Web Services), Azure, Databricks (both on Azure as well as AWS integration of Databricks)
- Data modeling experience with teh Star schema, Snowflake schema, and transactional modeling.
- Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
- Orchestration experience using Azure Data Factory, Airflow on multiple cloud platforms and able to understand teh process of leveraging teh Airflow Operators.
- Scheduled Airflow DAGs to run multiple Hive jobs, which independently run with time and data availability.
- Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight.
- Addressing complex POCs according to business requirements from teh technical end.
- Active Agile team player in Production support, Hotfix deployment, Code Reviews, System Design & Review, Test cases, Sprint planning and Demos.
- Effectively communicate with business units and stake holders and provide strategic solutions according to teh client’s requirements.
- Well versed with Agile with SCRUM, Waterfall Model and Test-driven Development (TDD) methodologies.
TECHNICAL SKILLS
Programming Languages: Python, SQL, PLSQL
Bigdata tools: Spark, Hive, Sqoop, Kafka, YARN, HBase
Technical Databases: Oracle, Teradata, SQL Datawarehouse, Azure, Databricks
BI Visualization Tools: Tableau, Power BI, Birst BI
ETL Tools: ADF, Informatica, SSIS
Cloud platforms: Azure, AWS
Scheduling Tools: Airflow, Oozie
IDE s: PyCharm, Jupiter, IntelliJ, Visual Studio
PROFESSIONAL EXPERIENCE
Confidential, OH
Data Engineer
Responsibilities:
- Created AWS S3 buckets and managed policies for S3 buckets and Glacier for storage and backup on AWS.
- Collecting data from edge device databases, exporting it in CSV format, and storing it in AWS S3 buckets.
- Using PySpark, created data processing tasks like reading data from external sources, merging teh obtained data, performing data enrichment, and loading into data warehouses.
- Using PySpark, performed transformations and actions on imported data from AWS S3.
- Created Lambda functions to create ad-hoc tables in S3 to add schema and structure to data, and performed data validation, filtering, sorting, and transformations for every data change in a Dynamo DB table, before loading teh transformed data into PostgreSQL.
- Worked on Python APIs calls and landed data to S3 from external sources.
- Scheduled Airflow DAGs to export data to AWS S3 buckets by triggering an AWS lambda function.
- Developed robust and scalable data integration pipelines to transfer data from an S3 bucket to a Redshift database using Python and AWS Glue.
- Developed Spark-based real-time data ingestion and real-time analytics, as well as AWS Lambda functions to power teh system's real-time monitoring dashboards.
- Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Implemented Data warehouse solution consisting of ETLs, On - premise to Cloud Migration and good expertise building and deploying batch and streaming data pipelines on cloud environment.
- Used Tableau for Visualization charts etc. and regularly communicating finding with Product Owners.
- Worked on Tableau Visualization Charts and daily status Dashboards.
- Demonstrated good communication skills and story narratives while Sprint Demos to leadership and Stake holders.
Environment: PySpark, Hive, Python, AWS, S3, Airflow, SQL, Excel, Python 3, Spark SQL, Redshift, ETL/ELT, AWS Glue, Tableau
Confidential
Data Engineer
Responsibilities:
- Worked with extensive data sets in Big Data to uncover pattern, problem & unleash value for teh Enterprise.
- Worked with internal and external data sources on improving data accuracy / coverage and generate recommendation on teh process flow to accomplish teh goal.
- Developed PySpark Scripts to process streaming data from data lakes using Spark Streaming.
- Using PySpark, created data processing pipelines that read data from external sources, merge teh obtained data, perform data enrichment, and load into data warehouses. User Defined Functions in PySpark expand teh capabilities of Data Frames.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
- Primarily involved in Data Migration usingSQL, SQL Azure, Azure Storage,andAzure Data Factory, SSIS, PowerShell.
- Consume real-time events from Kafka streams and persist them to HDFS with Parquet.
- Using teh Spark Data Frame API, performed transformations, cleaning, and filtering on teh imported data before loading it into Hive.
- Implemented and created Hive scripts for transformations such as aggregation, evaluation, and filtering.
- Developed sophisticated Hive queries to extract key performance indicators by joining various tables, and performed data analysis on Hive tables using HiveQL.
- Worked extensively on hive to analyze teh data and create reports for data quality.
- Written Hive queries for data analysis to meet teh business requirements and Designed and developed User Defined Function (UDF).
- Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
- Worked on improving teh Spark jobs performance by using teh broadcast joins, reducing number of shuffling
- Handle teh production Incidents assigned to our workgroup promptly and fix teh bugs or route it to teh respective teams and optimized teh SLA’s.
Environment: SparkSQL, PySpark, Python, SQL, Kafka, Hive, Hadoop, HDFS, Tableau, MapReduce, Sqoop, Azure
Confidential
MS SQL/MSBI Developer
Responsibilities:
- Analyzed business requirements, facilitating planning and implementation phases of teh OLAP model in Team meetings
- Participated in Team meetings to ensure a mutual understanding with business, development and test teams.
- Encapsulated frequently executed SQL statements into stored procedures to reduce teh query execution times.
- DesignedSSISPackages to extract, transfer, load (ETL) existing data intoSQLServer from different environments for theSSAScubes (OLAP)
- CreatedSSIS packagesto implement error/failure handling with event handlers, row redirects, and loggings.
- SQLServer reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form,OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports.
- Managed packages teh inSSISDBcatalog with environments; automated deployment and execution with SQL agent jobs.
- Involved in teh design of Data-warehouse usingStar-Schemamethodology and converted data from various sources to SQL tables.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets usingPower BI
- Designed complex data intensive reports inPower BIutilizing various graph features such as gauge, funnel.
Environment: Power BI, Birst BI, ETL, SQL, SSIS.
Confidential
Software Engineer
Responsibilities:
- Review teh system requirements and attending requirements meetings with analysts and users.
- Involved in teh life cycle of teh project from documentation to unit testing making development as priority.
- Actively involved for testing after creating reports.
- Get teh reporting issues resolved by identifying whether it is report related issue or source related issue.
- Apply teh hot fixes in production environments.
- Interacted with teh clients to make enhancements in teh reports.
- Published dashboards on Power Bi Services.
- Worked on extracts and scheduling them on Power BI service.
- Managed access of reports, dashboards and data for individual users using roles.
- Working on teh issue upon teh given SLA time and make sure it doesn’t get breached.
- Provided development support for System Testing, Product Testing, User Acceptance Testing, Data Conversion Testing, Load Testing, and Production.
Environment: Birst BI Tool, Microsoft Power BI, Amazon Redshift Database, agile methodologies.