Sr Aws Big Data Engineer Resume
DE
PROFESSIONAL SUMMARY:
- 7+ years of overall experience as Big Data Engineer, Data Analyst and ETL developer, comprises designing, development and implementation of data models for enterprise - level application.
- Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS.
- Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala and Hue.
- Extensive experience in working with NO SQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra and HBase
- Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
- Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and also used Spark Data Frame Operations to perform required Validations in the data.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Involved in loading the structured and semi structured data into spark clusters using SparkSQL and DataFrames Application programming interface (API)
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
- Excellent knowledge in using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
- Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
- Worked on HBase to load and retrieve data for real time processing using Rest API.
- Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
- Experienced in designing different time driven and data driven automated workflows using Oozie.
- Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).
- Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
- Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.
- Skilled in using Amazon Redshift to perform large scale database migrations.
- Ingested data into Snowflake cloud data warehouse using Snowpipe.
- Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrives to staging area.
- Worked in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Extensive knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL. Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Skilled in using Kerberos, Azure AD, Sentry, and Ranger for maintaining authentication and authorization.
- Hands on Experience in using Visualization tools like Tableau, Power BI.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
- Designed UNIX Shell Scripting for automating deployments and other routine tasks.
- Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
- Proficient in relational databases like Oracle, MySQL and SQL Server.
- Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS.
- Capable in working with SDLC, Agile and Waterfall Methodologies.
- Proficient in understanding and application of safe Agile practices, including TDD, BDD, continuous integration, pairing, iterative development, and retrospectives.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS:
Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume.
ETL Tools: Informatica NO SQL
Database: HBase, Cassandra, Dynamo DB, Mongo DB.
Monitoring and Reporting: Tableau, Custom shell scripts
Hadoop Distribution: Horton Works, Cloudera
Build Tools: Maven
Programming & Scripting: Python, Scala, JAVA, SQL, Shell Scripting, C, C++
Databases: Oracle, MY SQL, Teradata
Version Control: GIT
Operating Systems: Linux, Unix, Mac OS X, CentOS, Windows 10, Windows 8, Windows 7 Cloud Computing
PROFESSIONAL EXPERIENCE:
Confidential, DE
Sr AWS Big Data Engineer
Responsibilities:
- Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).
- Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Using Spark Context, Spark-SQL, Spark MLib, Data Frame, Pair RDD and Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
- Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
- Developed Kafka consumer API in python for consuming data from Kafka topics.
- Consumed Extensible Markup Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipe-line system.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.
- Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Configured Snow pipe to pull the data from S3 buckets into Snowflakes table .
- Stored incoming data in the Snowflakes staging area.
- Created numerous ODI interfaces and load into Snowflake DB . worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse
- Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Used the Spark Data Cassandra Connector to load data to and from Cassandra.
- Worked from Scratch in Configurations' of Kafka such as Mangers and Brokers.
- Experienced in creating data-models for Clients transactional logs, analyzed the data from Cassandra.
- Tables for quick searching, sorting and grouping using the Cassandra Query Language.
- Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.
- Stored in Hive to perform data analysis to meet the business specification logic.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
- Worked in Implementing Kafka Security and Boosting its performance.
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
- Developed Custom UDF in Python and used UDFs for sorting and preparing the data.
- Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig etc.
- Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
- Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.
- Setting up and worked on Kerberos authentication principals to establish secure network communication.
- On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Environment:: Spark, Spark-Streaming, Spark SQL, AWS EMR, mapR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.
Confidential, FL
Azure Big Data Engineer
Responsibilities:
- Proficient in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow.
- Enabling other teams to work with more complex scenarios and machine learning solutions.
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Evolved in Spark Scala functions for mining data to provide real time insights and reports.
- Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table.
- DataLake is used to store and do all types of processing and analytics.
- Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets.
- Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and Improved the cluster performance.
- Involved in using Spark DataFrames to create Various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks.
- Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi.
- Tasks are distribution on celery workers to manage communication between multiple services.
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
- Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion.
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow. workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL).
- Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used to create, configure and manage a cluster of Virtual machines.
- Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization.
- Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.
- Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of the large data sets.
- Developed map reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing.
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: Azure HDInsight, DataBricks(ADBX), DataLake (ADLS), CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.
Confidential
Data Analyst
Responsibilities:
- Experience working in project with machine learning, big data, data visualization, R and Python development, Unix, SQL.
- Performed exploratory data analysis using NumPy, matplotlib and pandas.
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
- Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy and Matplotlib.
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.
- Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).
- Creating complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries.
- Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.
- Experience in working with Maps, Density Maps, Tree Maps, Heat Maps Pareto charts, Bubble chart and Bullet Chart, Piecharts, Barcharts, and Line charts.
- Worked on sort & filters of tableau like Basic Sorting, basic filters, quick filters, context filters, condition filters, top filters and filter operations.
- Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and Wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools.
- Read date from different sources like CSV file, Excel, HTML page and SQL and performed data analysis and written to any data source like CSV file, Excel or database.
- Experience in using the Lambda functions like filter (), map () and reduce () with pandas Data Frame and perform various operations.
- Used Pandas API for analyzing time series. Creating regression test framework for new code.
- Developed and handled business logic through backend Python code.
- Created templates for page rendering and Django views for the business logic.
- Worked on Django REST framework and integrated new and existing API's endpoints.
- Utilized PyUnit for unit testing of the application.
- Performed data analysis using goggle API's and created visualizations such as pie charts, waterfall charts and displayed in the web application.
- Extensive knowledge in loading data into charts using python code.
- Using High charts, passed data and created interactive JavaScript charts for the web application.
- Extensive knowledge in using python libraries like OS, Pickle, NumPy and SciPy.
- Involved in using Bit bucket for version control and coordinating with the team.
Environment: Python, PyQuery, HTML5, CSS3, Apache Spark, Django, SQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL, and python libraries such as PySpark, NumPy, AWS, Bit Bucket.
Confidential
ETL Developer
Responsibilities:
- Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor and Repository Manager.
- Extracted data from various heterogeneous sources like Oracle, Flat Files.
- Developed complex mapping using Informatica Power Center tool.
- Extracting data from Oracle and Flat file, Excel files and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, Update strategy transformations to load data into the target systems.
- Created Sessions, Tasks, Workflows and Worklets using Workflow manager.
- Worked with Data modeler in developing STAR Schemas.
- Developed workflow dependency in Informatica using Event Wait Task, Command Wait.
- Involved in analyzing the existence of the source feed in existing CSDR database.
- Handling high volume of day to day Informatica workflow migrations.
- Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed.
- Creating new repositories from scratch, backup and restore.
- Experience in working with Groups, roles, privileges and assigned them to each user group.
- Knowledge in Code change migration from Dev to QA and QA to Production.
- Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions and Workflows.
- Used Pre session and Post Session to send e-mail to various business users through the Workflow Manager.
- Leveraging the existing PL/SQL scripts for the daily ETL operation.
- Experience in ensuring that all support requests are properly approved, documented, and communicated using the MQC tool. Documenting common issues and resolution procedures
- Extensively involved in enhancing and managing Unix Shell Scripts.
- In in converting the business requirement into technical design document.
- Documenting the macro logic and working closely with Business Analyst to prepare BRD.
- Involved in requirement gathering for procuring new source feeds.
- Involved in setting up SFTP setup with the internal bank management.
- Building UNIX scripts in cleaning up the source files.
- Involved in loading all the sample source data using SQL loader and scripts.
- Creating Informatica workflows to load the source data into CSDR.
- Involved in creating various UNIX script used during ETL load process.
- Periodically cleaning up Informatica repositories.
- Monitoring the daily load and handing over the stats with the QA Team.
- Creating new repositories from scratch, backup and restore
Environment: Informatica, Load Runner 8.x, HP QC 10/11, Toad, SQL, PL/SQL.
Confidential
Java/J2EE Developer
Responsibilities:
- Involved in Java, web services and Hibernate in a fast-paced development environment.
- Followed agile methodology, interacted directly with the client on the features, implemented optimal solutions, and tailor application to customer needs.
- Involved in design and implementation of web tier using Servlets and JSP.
- Experience in using Apache POI for Excel files reading.
- Developed the user interface using JSP and Java Script to view all online trading transactions.
- Designed and developed Data Access Objects (DAO) to access the database.
- Extensively used DAO Factory and value object design patterns to organize and integrate the JAVA Objects
- Coded Java Server Pages for the Dynamic front end content that use Servlets and EJBs.
- Worked on HTML pages using CSS for static content generation with JavaScript for validations.
- Involved in using JDBC API to connect to the database and carry out database operations.
- Used JSP and JSTL Tag Libraries for developing User Interface components.
- Good knowledge in Performing Code Reviews.
- Have Knowledge on Spring Batch, which provides Functions like processing large volumes of records, including job processing statistics, job restart, skip, and resource management.
- Designed and developed web-based application using HTML5, CSS, JavaScript, AJAX, JSP framework.
- Performed unit testing, system testing and integration testing.
Environment: Java, SQL, Hibernate, Eclipse, Apache POI, CSS, JDK5.0, J2EE, Servlets, JSP, spring, HTML, Java Script Prototypes, XML, JSTL, XPath, jQuery.