Sr.aws Big Data Engineer Resume
Foster City, CA
SUMMARY
- 6 years of total IT experience as Sr. Big Data Engineer,with expertise on Big Data services in Health Care,Marketing,Finance and Retail.
- Expertise in Big Data/ Hadoop Ecosystem such as Apache Hive, Spark, MapReduce, Apache Kafka, Sqoop, Zookeeper, HDFS, YARN.
- Experience in working with Python libraries such as NumPy, SciPy, Requests, Report Lab, Pytables, cv2, HTTPLib, Urllib, Beautiful Soup and Pandas libraries during development life cycle.
- Processing Semi - Structured data (CSV,XML AND JSON) in Hive/Spark using Python.
- Hands on Spark Architecture which involve Spark Core,Spark SQL, Data Frames,Spark Streaming,Driver Node,Worker Node,Stages,Executors and Tasks.
- Experience in configuring Spark connections for data processing in batches and real-time, using HDFS, in-memory in Spark Data frame API.
- Experience in using different Hadoop Distributions like Cloudera CDH, Amazon Elastic MapReduce, Hortonworks Data Platform.
- Extensive usage of Spark Data frame API over Cloudera Platform to perform analytics on via Hive Query language and data frame operations via API calls for data manipulation.
- Has ability to interact with HDFS to query data using HiveQL for Ad-hoc data extraction and analysis, debugging and written Hive User Defined Functions (UDF) as per requirement.
- Accomplished complex HiveQL queries for required data extraction from Hive tables stream to HDFS and expertise in using Spark SQL to parse data from various sources like JSON, Parquet, and Hive from HDFS.
- Excellent ability to exploit Partition, bucketing techniques on managed and external tables to optimize performance.
- Excellence in Big data hadoop architecture and Yarn architecture along with variety of hadoop daemons such as Job Tracker,Task Tracker,Name Node,Data Node and Cluster Manager
- Extensive experience in working with AWS cloud platform (EC2, S3, EMR, Redshift, Lambda and Glue).
- Extensive experience working with AWS Cloud services and AWS SDKs to work with services like AWS API Gateway, Lambda, S3, IAM and EC2.Extensive experience working with AWS Cloud services and AWS SDKs to work with services like AWS API Gateway, Lambda, S3, IAM and EC2.
- Good understanding and knowledge of NoSQL databases like MongoDB, HBase,DynamoDB and Apache Cassandra.
- MongoDB, Cassandra, DynamoDB by installing and configuring required packages from python open source
- Experienced with version controlling using systems like Git, GitHub, CVS and SVN to maintain the versions and configurations of the code organized.
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Azure data lake Analytics,Azure SQL Database, Data Bricks andAzure SQL Data warehouseand controlling and granting database accessandMigrating On premise databases toAzure Data Lake storeusing Azure Data factory.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API,ExtractedTablesand exported data fromTeradatathroughSqoopand placed inCassandra
- Data load from Informatica server to HDFS of the EMR service using the Sqoop.
- Proficient in working with Azure cloud platform (DataLake, DataBricks, HDInsight, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
- Univariate,Bi-Variate,Multivariate statistical analysis,Hypothesis testing,Exploratory data analysis using pandas,numpy,matplotlib,seaborn and communicating with Data Scientists.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization.
- Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi. Tasks are distribution on celery workers to manage communication between multiple services.
- Experience in writing SQL queries, Stored Procedures, functions, packages, tables, views, triggers on relational databases like Oracle, DB2, MySQL, PostgreSQL, and MS SQL Server.
- Hands on Experience in using Visualization tools like Power BI(Microsoft Certified DA-100)
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, Engineering, features scaling, features engineering, Machine Learning modeling (Regression Models, Decision Trees, Naive Bayes, Neural Networks, Random Forest, Gradient Boosting, SVM, KNN, Clustering).
- Experience in Applied Statistics, Exploratory Data Analysis and Visualization using matplotlib, Tableau, Power BI, Google Analytics.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume,airflow
ETL Tools: Informatica,Tera Data
NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.
Monitoring and Reporting: PowerBI (Microsoft Certified)
Hadoop Distribution: Horton Works, Cloudera
Build Tools: Maven
Programming & Scripting: Python, Scala, JAVA, SQL, Shell Scripting, C, C++
Databases: PostgreSQL, MY SQL, TeraData,Oracle
Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7
Cloud technolgies: AWS, SnowFlake,Azure (Azure Data Lake, Azure Data Factory, Azure Databrick, Azure SQL Database, Azure SQL data Warehouse)
AWS Services: Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon Sagemaker, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon CloudFormation)
Version Control: GIT,GitHub
Database Modelling: ER modelling, dimension modelling, Start schema modelling, Snowflake modelling
Machine Learning: Regression (Linear and Logistic), Decision trees, Random Forest, SVM, KNN, PCA.
PROFESSIONAL EXPERIENCE
Confidential, Foster City, CA
Sr.AWS Big Data Engineer
Responsibilities:
- Successfully Developed Spark Applications by using Python and further Implemented Apache Spark data processing Project to process data from different Relational Databases and Streaming sources.
- Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3
- Optimised and increased the performance of algorithms in hadoop using spark
- Performed Transformations and actions on the fly to build common using Spark Streaming API’s
- Optimally written live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipe-line system
- Performed Continous Integration by writing several MapReduce jobs using PySpark,Numpy and Jenkins
- Created Data models for Client transactional logs and analyzed the data from Cassandra
- Used Cassandra Query Language in table quick searching,sorting and grouping
- Aggregated web log data from multiple servers Using Apache Kafka and made them available in Downstream systems for Data analysis and machine learning type of roles.
- Analyzed the partitioned and bucketed data using HiveQL and executed Hive queries on Parquet tables
- Worked from Scratch in Configurations' of Kafka such as Mangers and Brokers.
- Shifted all data warehouses into one data warehouse using Amazon Redshift
- Used snowflakes staging area to store incoming data
- Extensive knowledge on Data Frame API, Data set API, Data Source API,Spark SQL, Spark Streaming
- Consumed Extensible Markup Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
- Worked on Kerberos authentication principals for a secure network communication loaded data from RDBMS, External Systems into HDFS and HIVE with the developed sqoop and kafka jobs
- Scheduled Hive scripts to create Data pipelines using the developed oozie coordinators
- Implemented Kafka Security and boosted its performance
- Working Knowledge on Parquet, RCFile, Avro,JSON file formats,UDF developement in HIVE
- Extensively worked in Custom UDF developement using Python and used UDF’s in order for preperation and sorting of data
- Consumed data from Kafka topics using the developed Kafka consumer API
- Worked on Cassandra-stress-tool for testing the cluster performance and improve read/writes if needed
- Loaded D-Stream data into Spark RDD and do in memory data Computation to generate output response.
- Optimally written core jobs and real-time processing using spark streaming with kafka as a data pipeline system
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.
- Developed automation Shell scripts in Linux system for processing of manufacturing data from Blob storage API (spreadsheet data) and brought turnaround time from 20 minutes to under 10 seconds.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Configured Snow pipe to pull the data from S3 buckets into Snowflakes table.
- Extensive knowledge on Cassandra architecture, replication strategy, gossip, snitches
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager
- Implemented AWS Step functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, mapR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.
Confidential | Atlanta,GA
Azure Big Data Engineer
Responsibilities:
- Developed spark applications using PySpark and sparkSQL for data transformation and aggregation from multiple file formats for analyzing and observing customer usage patterns for customer segementation
- Extract,Transform and load data from source systems using a combination of azure data factory, T-SQL,SparkSql and USQL Azure Data Lake Analytics
- Ingested Data to multiple Azure services such as Azure Data Lake, Azure Storage,Azure SQL,Azure Data Warehouse.
- Developed modern Data solutions with Azure PAAS service for data visualisation purposes and analyzing current production state of application and deterministically forecast the affect of latest implementation on current business processes
- Successfully tuned the performance of the existing SPARK applications in order to set right batch interval time,memory tuning and even correct level of parellelism.
- Created ETL pipelines in ADF using from plethora of sources like AzureSQL, Blob Storage,Azure SQL Data Warehouse,writeback tool and backwards.
- Successfully estimated Cluster Size,Monitored and trouble shooted Spark Data Brick Cluster.
- Retrived Analytics data from different data feeds using REST API’s
- Worked extensively on Azure Cloud platform such as HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer
- Profitably designed data pipelines using DataLake,DataBricks and Apache AirFlow. integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied required transformation and then further load back to Azure Synapse.
- Mined Data in order to provide real-time insights and reports using Spark Scala Functions.
- Worked on recevieng real time data from Apache Flume which involved configuration of spark streaming configuration of spark streaming and store the stream data using scala to Azure table.
- Involved in using Spark DataFrames to create Various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks.
- Improved the query performance by transitioning the log storage from Cassandra to Azure SQL DataWare house.
- Succesfully built Data ingestion Pipelines Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. used Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS in order to design custom built input adapters.
- Used Sqoop, Flume and Spark Streaming API to load data from Web Servers and Tera Data. workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL).
- Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications.
- Utilized data for interactive Power BI dashboards and reporting purposes based on business requirements.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization.
Environment: Azure HDInsight, DataBricks(ADBX), DataLake (ADLS), CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase
Confidential
Data Engineering Analyst
Responsibilities:
- Enterprise Insurance data warehouse is a conversion project of migrating existing data marts at an integrated place to get the advantage of corporate wide data warehouse. It involves rewriting/developing existing data marts and adding new subject areas to existing data marts, it helps business users a platform queries across various subject areas using single OLAP tool (Cognos).
- Created map design document to transfer data from source system to data warehouse, built ETL pipeline which made analyst’s job easy and reduced the patient’s expense on treatment up to 40%.
- Development of Informatica Mappings, Sessions, Worklets, Workflows.
- Wrote Shell scripts to monitor load on database and Perl scripts to format data extracted from data warehouse based on user requirements.
- Designed, developed, and delivered the jobs and transformations over the data to enrich the data and progressively elevate for consuming in the layer of the delta lake.
- Managed multiple small projects with a team of 5 members, planned milestones, scheduled project milestones, and tracked project deliverables.
- Performed network traffic and analysis expertise using data mining, Hadoop ecosystem (MapReduce, HDFS Hive) and visualization tools by considering raw packet data, network flow, and Intrusion Detection Systems (IDS).
- Analyzed the company’s expenses on software tools and came up with a strategy to reduce those expenses by 30%.
Environment: Python, R, AWS EMR, Apache Spark, Hadoop ecosystem (MapReduce, HDFS, Hive) Scala, LogRythm, Openvas, Informatica, Ubuntu.
Confidential
ETL Developer
Responsibilities:
- Analysed, designed, and developed databases using ER diagrams, normalization, and relational database concept.
- Involved in design, development, and testing of the system.
- Developed SQL Server stored procedures, tuned SQL queries (using indexes and execution plan).
- Developed user defined functions and created views.
- Created triggers to maintain the referential integrity.
- Implemented exceptional handling.
- Worked on client requirement and wrote complex SQL queries to generate crystal reports.
- Created and automated the regular jobs.
- Tuned and optimized SQL queries using execution plan and profiler.
- Developed the controller component with Servlets and action classes.
- Business components are developed (model components) using Enterprise Java Beans (EJB).
- Established schedule and resource requirements by planning, analyzing and documenting development effort to include timelines, risks, test requirements and performance targets.
- Analysed system requirements and prepared system design document.
- Developed dynamic user interface with HTML and JavaScript using JSP and Servlet technology.
- Used JMS elements for sending and receiving messages.
- Created and executed test plans using quality center by test director.
- Mapped requirements with the test cases in the quality center.
- Supported system test and user acceptance test.
- Rebuilt indexes and tables as a part of performance tuning exercise.
- Involved in performing database backup and recovery.
- Worked on documentation using MS Word.
Environment: MS SQL Server, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel, MS Word.