Data Engineer/big Data Developer Resume
SUMMARY
- IT experience of over 8+ years of experience with multinational Clients includes 4+ years in Bigdata ecosystem components like Apache Spark, Hadoop, Hive, HBase, SQL, Sqoop,
- Hands - on fundamental building blocks of Spark - RDDs and related manipulations for implementing business logic Like Transformations, Actions, and Functions performed on RDD.
- Experience in handling various tickets/incidents related to stringent SLA and / providing 24*7 support to critical production environments
- This support would include file system management and monitoring, cluster monitoring and management and automating / scripting backups and restores
- Perform root cause analysis and identify and implement corrective and preventive measures
- Document standards, processes and procedures relating to best practices, issues and resolutions
- Experienced in real-time Big Data solutions using Hbase, handling billions of records.
- Experienced in importing and exporting data from RDBMS into HDFS using Sqoop.
- Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory(ADF), SSIS.
- Good working knowledge on processing Batch applications.
- Experienced in writing Map Reduce programs and UDFs in Hive.
- Depth understanding of Data-frames and Datasets in Spark SQL
- Exposure on usage of Apache Kafka to develop data pipeline of logs as a stream of messages using producers and consumers.
- Hands-on working in GCP (Google Cloud Platform) and good knowledge of GCP as a storage mechanism.
- Good experience in Hive partitioning, bucketing and perform different types of joins on Hive tables and implementing Hive SerDe like JSON and Avro.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
- Good knowledge of Data Warehousing, ETL development, Distributed Computing and large-scale data processing.
- Experience in Agile Methodology Using JIRA Scrum tool.
- Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
- Strong knowledge of Software Development Life Cycle and expertise in detailed design documentation.
- Excellent Communication Skills, Ability to perform at a high level and meet deadlines.
- Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON,ORC,Parquet on HDFS using PySpark.
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
TECHNICAL SKILLS
Big Data: Hadoop, Hive, Apache Spark, Spark SQL, Zookeeper, Data Factory, Sqoop, Hue, HBase, MySQL, Thought spot, SQL, Apace Kafka, Big ID, Automic
Data Warehousing: Informatica Power Center 9.x/8.x/7.x, Informatica Cloud, Talend Open studio & Integration suite, Azure SQL Analytics
Languages: Python, Shell Scripting, Selenium
Database: MySQL, Oracle, Microsoft SQL Server
IDE / Testing Tools: Eclipse, IntelliJ IDEA
Operating System: Windows, UNIX, Linux
Cloud Computing: AWS, Azure, Rackspace, Open Stack
SDLC Methodologies: Agile/Scrum, Waterfall
Scripting Languages: Unix, Python, Windows Power Shell
RDBMS Utility: Toad, SQL Plus, SQL Loader
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer/Big Data Developer
Responsibilities:
- Worked with the Hive for improving the performance and optimization in Hadoop using components.
- Developed custom aggregate functions using Spark SQL and performed interactive querying
- Designed good understanding of Partitions, bucketing concepts in Hive, and designed both Managed and External tables in Hive to optimize performance
- Created Hive external tables, views, and scripts for transformations such as filtering, aggregation, and partitioning tables.
- Responsible for monitoring the Production Status, and ensure the ETL process works as expected, handle customer communication around production issues
- In depth knowledge of hardware sizing, capacity management, cluster management and maintenance, performance monitoring and configuration
- Respond to system generated alerts/escalations relating to any failures on application platform.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Handled importing of data from various data sources, performed transformations using Hive and loaded data into Teradata to HDFS.
- Migrated On prem informatica ETL process to AWS cloud and Snowflakes.
- Expert in performing business analytical scripts using HiveSQL.
- Worked with the Python for improving the Log files and optimizing clear results.
- Developed custom aggregate functions using Python Dictionaries
- Used different python functions like RegEx, findall, search, split and sub
- Worked on automation using Selenium with Python
- Followed agile methodology and SCRUM meetings to track, optimize and tailored features to customer needs.
- Gained very good business knowledge on a different category of products and designs within.
- Involved in developing Thought spot reports and workflows automated to load data
- Implemented continuous integration & deployment (CICD) through Jenkins for Hadoop jobs
Environment: HDFS, Hive, Sqoop, SQL, Zookeeper, Horton works, Hue, LINUX, Big Data, UNIX Shell Scripting, Spark, Putty, Thought-spot, Aorta framework
Confidential
Big Data developer
Responsibilities:
- Involved in requirement gathering to connect with Business Analysis.
- Responsible for creating technical Documents like High-Level Design and low-Level Design specifications.
- Installed and configured Cloudera Manager for easy management of existing Hadoop cluster
- Configured various property files like core-site.xml, hdfs-site.xml, yarn-site.xml, mapred- site.xml and Hadoop-env.xml based upon the job requirement.
- Used Sqoop to transfer data between RDBMS and HDFS.
- Involved in ETL architecture enhancements to increase the performance using query optimizer.
- Worked with business functional lead to review and finalize requirements and data profiling analysis.
- Implemented complex Spark programs to perform Joins from Different tables
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
- Created ETL metadata reports using SSRS, reports include like execution times for the SSIS packages, Failure reports with error description.
- Responsible for creating tables based on business requirements
- Show data visualization and to generate reports for a clear result.
- Loaded and transformed large sets of structured, semi-structured and unstructured datain various formats like text, XML and JSON.
- Utilized Agile Scrum Methodology to help manage and organize a Project with the professor and regular code review sessions.
Environment: Hadoop HDFS, Apache Spark, Spark-Core, Spark-SQL, Scala, JDK 1.8, CHD 5, Sqoop, MySQL, CentOS Linux
Confidential
SQL Developer
Responsibilities:
- Involved in database design to create new databases that maintained referential integrity.
- Generated database SQL scripts and deployed databases including installation and configuration.
- Created indexed views, and appropriate indexes to reduce the running time for complex queries.
- Participated in designing a data warehouse to store all information from OLTP to Staging and Staging to Enterprise Data warehouse to do better analysis.
- Used basic Script Task for renaming the file names, storing Row Count Values into User Variables.
- Actively participated in designing the SSIS Package Model for initial load and incremental load for extracting, transforming, and loading data from and to various RDBMS and Non-RDBMS sources and destinations.
- Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Data Lake Analytics, HDInsights, Hive, Sqoop.
- Migration of on premise data (SQL Server) to Azure Data Lake Store(ADLS) using Azure Data Factory.
- Documentation of all the processes involved in maintaining the database for future reference.
- Performed unit tests was involved in deploying database objects to test/production environment.
- Involved in ER diagrams and mapping the data into database objects, design of the Database and the Tables.
- Created dashboards on data blending from different database and tables to meet the business requirements
- Generated database SQL scripts and deployed databases including installation and configuration.
- Experience in creating Indexes for faster performance and views for controlling user access to data.
- Performed unit testing, provided bug fixes and deployment support.
Environment: SQL Server 2014, Visual Studio 2014, TFS, SSRS, SSIS, Waterfall methodology