Data Engineer/ Aws Data Engineer Resume
Westlake, TX
SUMMARY
- Dynamic and motivated IT professional with around 6+ years of experience as a Big Data Engineer with expertise in designing data intensive applications usingHadoopEcosystem,BigData,Analytical,Cloud,Datawarehouse/DataMart,Data Visualization, Dataengineering Reporting, and Data Quality solutions.
- In - depth knowledge of Hadoop architecture and its components likeYARN, HDFS, Name Node, Data Node, Job Tracker, Application Master,Resource Manager, Task Tracker and Map Reduce programming paradigm.
- Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such asApache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
- Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
- Expert in designingETLdata flows using creating mappings/workflows to extract data fromSQLServerand Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL ServerSSIS.
- Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
- Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMRand other services of the AWS family.
- Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop usingSpark Context,Spark-SQL,Dataframe API,Spark Streaming, MLlib, PairRDD's and worked explicitly onPySparkandScala.
- Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS usingFlume sources, andFlume sinksand transforming the data usingFlume interceptors.
- Experience inOozieand workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow -Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
- Hands-on experience in handling database issues and connections with SQL and NoSQL databases such asMongoDB,HBase,Cassandra,SQLserver, andPostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
- Experience in designing and creatingRDBMSTables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an under-standing of how to integrate with other Azure Services.
- Experience with GCP computing services like AppEngine and Cloud Functions.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, GCP, HBase, Kafka, Oozie, Apache Spark, Zookeeper, NiFi, Amazon Web Services, Customer 360.
Machine Learning: Decision Tree, LDA, Linear and Logistic Regression, Random Forest, Clustering: K-NN, K-Means, Neural Networks, ANN & RNN, PCA, SVM, NLP, Deep learning.
Python Libraries: NLP, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup, PySpark
Operating System: Linux (Centos, Ubuntu), Windows (XP/7/8/10)
Languages: Java, Shell scripting, Pig Latin, Scala, Python, R, C++
Databases: SQL Server, MySQL, Teradata, DB2, Oracle, Databricks
NoSQL: HBase, Cassandra and MongoDB
Hadoop Technologies and Distributions: Apache Hadoop Cloudera CDH5.13, MapR, PySpark
Visualization/Reporting: Power BI, Tableau, ggplot2, matplotlib
Versioning Tools: SVN, Git, GitHub
PROFESSIONAL EXPERIENCE
Confidential, Westlake, TX
Data Engineer/ AWS Data Engineer
Responsibilities:
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.
- Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
- Implemented the machine learning algorithms using Python to predict the quantity a user might want to order for a specific item so we can automatically suggest using Kinesis firehose and S3 data lake.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
- Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
- Created Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run
- Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
- Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
- Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
- Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
- Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
- Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
- Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into the internal cache system.
Confidential, Watertown, PA
Azure Data Engineer
Responsibilities:
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Managed, Configured, and scheduled resources across the cluster using Azure Kubernetes Service.
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Develop dashboards and visualizations to help business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
- Knowledge of U-SQL and how it can be used for data transformation as part of a cloud data integration strategy
- Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.
- CreatedLinked serviceto land the data from SFTP location to Azure Data Lake.
- Created various pipelines to load the data from Azure data lake into Staging SQLDB and followed by to Azure SQL DB
- Created Databrick notebooks to streamline and curate the data for various business use cases and mounted blob storage on Databrick.
- Developed streaming pipelines usingApache SparkwithPython
- Created pipelines, data flows and complex data transformations and manipulations usingADFandPySparkwithDatabricks.
- Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers etc.
- Worked extensively on Azure Data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- To meet specific business requirements wrote UDF’s inScalaandPySpark.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Hands-on experience on developing SQL Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
- Designed and developed a new solution to process the NRT data by usingAzure Stream Analytics, Azure Event Hub, and Service Bus Queue.
- Working with complexSQL views, Stored Procedures, Triggers, and packages in large databases from various servers.
Confidential, Atlanta, GA
Big Data Developer
Responsibilities:
- Understanding business needs, analyzing functional specifications and map those to develop and designingMapReduceprograms and algorithms
- Developed a data pipeline usingKafkaandStormto store data intoHDFS
- CustomizedFlumeinterceptors to encrypt and mask customer sensitive data as per requirement
- Recommendations using Item Based Collaborative Filtering inApache Spark.
- Worked with NoSQL databases likeHBasein creating HBase tables to load large sets of semi structured data coming from various sources.
- UsedKibana, which is an open source-based browser analytics and search dashboard forElasticsearch.
- Developed iterative algorithms usingSpark StreaminginScalafor near real-time dashboards.
- Involved in customizing the partitioner inMapReduceto root Key value pairs from Mapper to Reducers in XML format according to requirement.
- ConfiguredFlumefor efficiently collecting, aggregating, and moving large amounts of log data.
- Involved in creating Hive tables, loading the data using it and in writing Hive queries to analyze the data
- Experienced in migrating Hive QL intoImpalato minimize query response time.
- Used Java 8 streams and lambda expressions to increase performance.
- Involved in schedulingOozieworkflow engine to run multiple Hive and pig jobs
- Designed and built the Reporting Application, which uses theSpark SQLto fetch and generate reports onHBasetable data.
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinating tasks among the team.
- Worked on batch processing of data sources usingApache Spark, Elastic search
- Extracted the needed data from the server intoHDFSand Bulk Loaded the cleaned data intoHBase.
- Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
- Designed and developed ETL packages using SQL Server Integration Services (SSIS) to load the data from SQL server, XML files to SQL Server database through custom C# script tasks.
- Develop and deploy the outcome using Spark and Scala code in Hadoop cluster running on GCP.
- Designed and documented the error-handling strategy in the ETL load process. Prepared the complete ETL specification document for all the ETL flows.
- Used Cassandra CQL with Java APIs to retrieve data fromCassandratables and experience in storing the analyzed results back into theCassandracluster.
- Using Rest API with Python to ingest Data from some other site to BIGQUERY.
- Using g-cloud function with Python to load Data into Big Query for on arrival csv files in GCS bucket.
- Implemented CRUD operations involving lists, sets and maps inDataStaxCassandra.
- Have worked on ETL tools like SSIS for Data Collection and updating Longitudinal Databases and SPSS for Data Analysis and Modeling.
Confidential, Blue Ash, OH
Big Data Engineer
Responsibilities:
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto Hadoop ecosystem.
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
- Experienced Good understanding of NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
- Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
- Worked with various HDFS file formats like Parque, IAM, Json for serializing and deserializing.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
- Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark -SQL, Pair RDD's, Spark YARN
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
- Written the Map Reduce programs,HiveUDFsin Java
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
- Responsible to manage data coming from different sources through Kafka.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, DevOps, Spark YARN. ‘
- Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Written and implemented custom UDF's inPigfor data filtering
- Using Spark Dataframe API in Scala for analyzing data.
- Good experience in using Relational databasesOracle, MY SQL, SQL Server andPostgreSQL
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
- Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
- Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.
- Used HUE for running Hive queries. Created partitions according to day using Hive to improve performance.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
- Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.
Environment: Hadoop (HDFS, Map Reduce), Scala, Yarn, IAM, PostgreSql, Spark, Impala, Mongo DB, Java, Pig, DevOps, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git, AWS Services (Lambda, EMR, Auto scaling).