Sr. Data Engineer/aws Developer Resume
Columbia, SC
SUMMARY
- Over 7+ years of software experience in Big Data, Data Analytics and Cloud migration.
- Experienced in cloud environment like Amazon Web Services (AWS), Microsoft Azure, GCP.
- Hands on experience with Microsoft Azure components like DataBricks, DataLake, Storage Explorer, Stream Analytics Data Factory, SQL DB, SQL, Cosmos DB.
- Highly experienced in extracting ETL data from source systems to Azure data storage services by using a combination of Azure Data Factory, T - SQL, SSIS and Azure Data Lake Analytics.
- Expertise in using distributed computing architectures like AWS Cloud Services like EMR, EC2, S3, Redshift, and Elastic search and working on raw data migration to Amazon cloud into S3 and performed refined data processing.
- Developed data pipelines with Amazon AWS to extract the data from weblogs and stored in HDFS.
- Worked on extensive migration of Hadoop and Spark Clusters to AWS and Azure.
- Proficient in GCP Dataproc, Cloud functions, GCS and BigQuery.
- Good knowledge onGCPservice accounts, billing projects, authorized views, datasets, GCS buckets and gs util commands.
- Expertise on installation, configuring, and using Big Data ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, Yarn, Spark, Nifi, Pig, Hive, Flume, Hbase, Oozie, Zookeeper, Sqoop.
- Expertise in administering the Hadoop Cluster using Hadoop Distributions like Apache Hadoop & Cloudera.
- Good experience in creating real-time data streaming solutions using Apache Spark core, Spark SQL & Data Frames, Kafka, Spark streaming and Apache Storm.
- Strong expertise in coding MapReduce Programs using Python for analyzing Big Data.
- Proficient inSplunkAdministration andSplunkDevelopment of developing dashboards, forms, SPL searches, reports, and views.
- Expertise in Installation, Configuration, Migration, Troubleshooting, andmaintenance ofSplunk, Passionate about Machine data and Operational Intelligence.
- Efficient knowledge working on NoSQL technologies like HBase, Cassandra and MongoDB.
- Extensive working knowledge in different ETL tools like Talend, Informatica, Stitch and reporting services like SQL Server Reporting Services (SSRS).
- Good knowledge in using Apache Nifi to automate the data movement between different Hadoop systems.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Proficient at using Spark APIs to explore, cleanse, aggregate, transform and store machine sensor data.
- Experienced in creating and performing operation on the Data frames using Python.
- Knowledge of trade offs among PaaS, SaaS and IaaS cloud based solutions.
- Outstanding expertise in job workflow scheduling tool like Oozie and monitoring tool like Zookeeper.
- Imported and exported data from different data sources into HDFS using Sqoop and performed transformations using Hive, Map Reduce and then loaded data into HDFS.
- Experienced in creating tasks, sessions and workflows using Alteryx.
- Experience in creating interactive Dashboards and Creative Visualizations using tools like Tableau, Power BI.
- Extensive skills on LINUX and UNIX Shell command.
- Good experience in project execution methodologies like Scrum, Agile and Waterfall framework.
TECHNICAL SKILLS:
Big Data Eco-system: HDFS, MapReduce, Spark, Yarn, Hive, Pig, HBase, Sqoop, Flume, Kafka, Oozie, Zoo- Keeper, Impala
Hadoop Technologies: Apache Hadoop 2.x/1.x, Cloudera CDH4/CDH5, Hortonworks
Programming Languages: Python, Scala, Shell Scripting, HiveQL
Operating Systems: Windows, Linux (Ubuntu, Centos)
NoSQL Database: HBase, Cassandra and Mongo DB
Database: RDBMS, MySQL, Teradata, DB2, Oracle
Container/Cluster Managers: Docker, Kubernetes
BI Tool: Tableau, Power BI
Cloud Environment: AWS (Amazon Web Services), Azure, GCP
Web Development: HTML, CSS, Javascript
IDE Tools: Eclipse, Jupyter, Anaconda, Pycharm
Development Methodologies: Agile, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Columbia, SC
Sr. Data Engineer/AWS Developer
Responsibilities:
- Processed big data over Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
- Expertised in transferring the data from AWS S3 to Redshift by using Informatica tools.
- Highly efficient in AWS data migration across multiple database platforms such as Local SQL Server to Amazon RDS, EMR HIVE and expertised in organizing and evaluating Hadoop log files in AWS S3.
- Created and promoted multiple AWS, multiple server environment's using Amazon EC2, EMR, EBS, Redshift and deployed the Big Data Hadoop application on AWS cloud.
- Good experience on Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as a Storage mechanism.
- Provided support on AWS Cloud infrastructure automation with multiple tools including Gradle, Chef, Nexus, Docker and monitoring tools such as Splunk and CloudWatch.
- Helped the team in architecting a state-of-the-artdatalake on AWS using EMR,DataPipeline, Spark, NiFi and Kafka.
- Implemented dynamic dags in Apache Airflow and utilised various AWS and GCP operators as part of ETL.
- Designed and Developed ETL Processes in AWS Glue to migrate the Campaign data from external sources like S3, Text Files into AWS Redshift.
- Imported metadata into Hive by using Python and migrated actual tables, applications to work on AWS cloud (S3).
- DataExtraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Implemented Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.
- Scheduled clusters with Cloud watch and created Lambda to generate operational alerts for various workflows.
- Experience in creating and managingSplunkDB connect Identities, DatabaseConnections, Database Inputs, Outputs, lookups, access controls.
- Installing and configuration ofSplunkproduct upgrading version and Testing at different environments.
- Worked with product teams to design the integrations using Talend ETL, Python and Spark and thereby improve the performance of enterprisedatawarehouse.
- Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce Programs.
- Developed numerous MapReduce jobs for Data Cleansing and Analyzing Data in Impala.
- Designed appropriate Partitioning/Bucketing schema in HIVE for efficient data access during analysis and designed a data warehouse using Hive external tables and created Hive queries for analysis.
- Configured Hive meta store with MySQL to store the metadata for Hive tables and used Hive to analyze data ingested into HBase using Hive-HBase integration.
- Worked on migration of an existing feed from Hive to Spark to reduce latency of feeds in existing HiveQL.
- Developed Oozie Workflows for daily incremental loads to get data from Teradata and import into Hive tables.
- Validated, manipulated and performed exploratory data analytics tasks using Python along with its data-specific libraries Pandas and Pyspark, interpret and extract meaningful insights from large data sets consist of millions of records
- Responsible for maintenance of Alteryx workflow connections with upstream sources and maintenance of visuals of downstream KPIs.
- Retrieved data from Hadoop Cluster by developing a pipeline using Hive (HQL), SQL to retrieve data from Oracle database and used Extract, Transform, and Load (ETL) for data transformation.
- Worked with Flume for building fault tolerant data Ingestion pipeline for transporting streaming data into HDFS.
- Practiced Spark advanced procedures like processing and text analytics by using the in-memory processing.
- Utilised spring framework for Dependency Injection and incorporated with Hibernate.
- Implemented multiple spark batch jobs using Spark SQL and achieved transformations by using multiple APIs and updated the master data in Cassandra database as per the business requirements.
- Developed data models and data migration strategies utilizing concepts of snowflake schema.
- Involved in data pre-processing and cleaning the data to perform feature engineering and data imputation techniques for the missing values in the dataset using Python.
- Created and deployed Spark application by using Pyspark to add up the acceptance score to all contents by using an algorithm and stocked the data into Elastic Search for the content management team to absorb.
- Incorporated Tableau to portray the results in dashboard format to communicate with data science teams, marketing and other engineering teams.
- Generated the data cubes using Hive, Map-Reducing on provisioning Hadoop cluster in AWS.
- Expertise in Performance Tuning Tableau Dashboards and Reports built on huge sources.
- Involved in Agile Methodologies, Daily Scrum meetings, Sprint planning's and strong experience in SDLC.
Environment: AWS, Hadoop, HDFS, MapReduce, Apache Spark, Spark SQL, Spark Streaming, Airflow, Hive, Oozie, Splunk, Sqoop, Kafka, Flume, Nifi, Zookeeper, Informatica, Databricks, MongoDB, Python, Linux, Snowflake, Tableau
Confidential, Mountain View, CA
Staff Data Engineer/Azure Developer
Responsibilities:
- Implemented OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
- Developed Pipelines in ADF using Datasets to Extract, Transform and load datafrom multiple sources like Azure SQL, Blob storage, Azure SQL Data warehouse.
- Performed data migrations from on-prem to Azure Data Factory and Azure Data Lake.
- Worked on Microsoft Azuretoolsets including Azure DataFactory Pipelines, Azure Databricks, Azure Data Lake Storage.
- Builddatapipelines for ETL related jobs by using different airflow operators.
- Expertised in transfeeringdatausing AzureDataFactory.
- Used Kafka and Spark Streaming for data ingestion and cluster handling in real time processing.
- Developed flow XML files using Apache NIFI, a workflow automation tool to ingest data into HDFS.
- Involved in Designing Snowflake Schema for Data Warehouse, ODS architecture by using tools like Data Model, Erwin.
- Designed Batch Audit Process in batch/shell script to monitor each ETL job along with reporting status which includes table name, start and finish time, number of rows loaded, status, etc.
- Expertise inSplunkenterprise architecture such as Search Heads, Indexers,Deployment server, Deployer, License Master, Heavy/Universal Forwarders.
- Experience onSplunkEnterprise Deployments and enable continuous integration as part of the configuration using management.
- Excellent knowledge inDataIngestion projects to injectdataintoDataLake using multiple sources systems using Talend BigData.
- Active monitoring of jobs through alert tools and responding with certain actions to analyze logs and escalate to high level teams on critical issues.
- Interacted in maintaining and updating the Metadata Repository with nature details and usage of applications or data transformations to facilitate impact analysis.
- Developed integration checks around the Pyspark framework for Processing of large datasets.
- Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment.
- Hadoop metadata management by extracting and maintaining metadata from Hive tables with Hive QL.
- Worked with importing metadata into Hive & migrated existing tables, applications to work on Hive and Spark.
- Worked with Spark and improved the performance and optimized the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, RDD's, Spark YARN.
- Developed python scripts to automatedataingestion pipeline for multipledatasources and deployed Apache NIFI.
- Developed workflows by using Apache Oozie framework for task automation.
- Developed data pipeline using Flume, Sqoop, and Pig to extract the data from weblogs and store in HDFS.
- Collected and aggregated huge amounts of log data using Flume and tagging data in HDFS for further analysis.
- Developed a Spark job which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Good knowledge in working with differentdatamonitoring tools like Splunk, datadog to analyse, monitor and visualize the generateddatain real time.
- Experienced in Tableau and Power BI on publishing of visualizations, dashboards, and workbooks from Tableau Desktop to Servers, and reports using SSRS.
- Extensively used Agile methodology to implement the data models for organization standards.
Environment: Hadoop, HDFS, MapReduce, Pyspark, Spark SQL, ETL, Hive, Pig, Oozie, Databricks, Sqoop, Splunk, Azure, Airflow, Star Schema, Python, Nifi, Tableau, Power BI.
Confidential, Newark, CA
Data Engineer/GCP Developer
Responsibilities:
- Hands-on experience in migrating on premise ETL to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud DataProc, Google Cloud Storage,Cloud Dataflow, Cloud Composer, Cloud Functions, Cloud PubSub.
- Migrateddatafrom on-premises SQL Database to Google cloud using Python application, and PySpark designed optimized database architecture.
- Deep understanding of moving data intoGCPusing SQOOP process, using custom hooks for MySQL, using cloud data fusion for moving data from Teradata to GCS.
- Worked around Kubernetes inGCP, working on creating new monitoring techniques using the stackdriver’s log router and designing reports in data studio.
- Good experience in identifying production bugs in the data using stack driver logs inGCP.
- Worked on a migration project to migratedatafrom different sources (Teradata,Hadoop, DB2) to Google Cloud Platform (GCP) using UDP framework and transforming thedatausing Spark Scala scripts.
- Experience in design, development, and Implementation of Bigdataapplications using Hadoop ecosystem frameworks and tools like Hive, Spark, HBase, Kafka, Flume,Nifi, Airflow, etc.
- Migrated ETL code from Talend to Informatica. Involved in development, testing and post production for the entire migration project.
- Experienced in creating shell scripts to pushdataloads from various sources from the edge nodes onto the HDFS.
- Managing workflow and scheduling for complex map reduce jobs using Apache Oozie.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Expertised in developing applications using Scala and Pyspark for interactive analysis, batch processing and stream processing.
- Develop quality check modules in PySpark and SQL to validate data in data lake, automated the process to trigger the modules before the data gets ingested.
- Experience in creatingSplunkapps for Enterprise Security to identify and address emerging security threats through the use of continuous monitoring, alerting and analytics.
- Used Hive QL to analyze data and create summarized data for consumption on Power BI.
- Involved in Python OOP code for quality, logging, monitoring, and debugging code optimization.
- Created the Lambda script in Python for executing the EMR jobs
- Created presentations for data reporting by using pivot tables, VLOOKUP and other advanced Excel functions.
Environment: Hadoop, GCP, HDFS, Pyspark, Scala, Spark SQL, ETL, Pig, Hive, Sqoop, Oozie, Kafka, Databricks Sqoop, Python, Power BI.
Confidential
Data Analyst
Responsibilities:
- Analyzed and reported customer using data transactional and analytical data to meet business objectives.
- Worked on the entire CRISP-DM life cycle and actively incorporated in all the phases of a project life cycle including data acquisition, data cleaning and data engineering.
- Improved weekly and monthly reports which are associated with the marketing and financial department using a Teradata SQL.
- Designed high level ETL architecture for overall data transfer from the OLTP to OLAP with the help of SSIS.
- Extracted data from SQL Server using Talend to load it into a single data warehouse repository.
- Wrote SQL queries using joins, grouping, nested sub-queries, and aggregation depending on data needed from various relational customer databases.
- Optimized the data environment in order to efficiently access data Marts and implemented efficient data extraction routines for the delivery of data.
- Created V-Look Up functions in MS Excel for searching data in large spreadsheets.
- Developed ad-hoc reports with V-lookups, Pivot tables, and Macros in Excel and recommended solutions to drive business decision making.
- Configured and monitored resource utilization throught the cluster by using Cloudera Manager, Navigator and search.
- Used Apache Flume to collect and aggregate complex volume of log data and then staging the data into HDFS for further analysis.
- Developed and designeddataingestion using Apache Nifi/Kafka.
- Developed the UNIX shell scripts for creating the reports from Hivedata.
- Vast experience in identifying production bugs in thedatausing stack driver logs inGCP.
- Involved in Trouble Shooting, performance tuning of reports, resolving issues within Tableau Server & Reports.
- Wrote a complex SQL, PL/SQL, Procedures, Functions, and Packages to validate data and testing process.
- Used parameters and variables for storing expressions to avoid dimension/measure redundancy and for improving performance.
- Done SAS programming such as Proc Sql (Join/ Union), Proc Append, Proc Datasets, And Proc Transpose.
- Created worksheets reports and converted into interactive dashboards by using Tableau Desktop and provided to Business Users, Project Managers and End Users.
- Accomplished in deploying data visualization and analyzing data by developing dashboards in Tableau that empowers business users to make business decisions.
- Utilised Git for version control for all codes and resources.
- Extensively used JIRA for executing the test cases, defect tracking and test management.
Environment: GCP, SQL, PL/SQL, Tableau, Apache Nifi, Kafka, Apache Flume, Hive, HDFS, Cloudera, GitHub, Agile Methodology, ETL, MS Excel
Confidential
SQL Developer
Responsibilities:
- Responsible for creating complex Stored Procedures, SSIS packages, triggers, cursors, tables, views and other SQL joins and statements for applications.
- Responsible for developing processes, automation of maintenance jobs, tuning SQL Server, locks and indexes configurations, administering SQL Server security, SQL Server automatic e-mail notification and SQL Server backup strategy and automation.
- Configured SSIS packages using Package configuration wizard to allow Packages run on different environments.
- Designed and implemented SQL server objects such as Tables, Indexes, Views, Stored Procedures and Functions in Transact-SQL.
- Optimized the performance of queries with improvement in T-SQL queries and removed unneeded columns, eliminated redundant and inconsistent data, established joins and created indexes.
- Developed SSIS packages using for each loop in Control Flow to process all excel files within folder, File System Task to move file into Archive after processing and Execute SQL task to insert transaction log data into the SQL table.
- Involved in gathering requirements, performing source system analysis and development of ETL jobs to populate data from the transactional data source to the target Data warehouse.
- Developed advanced correlated and un-correlated sub-queries in T-SQL to develop complex reports.
- Developed multi-dimensional cubes and dimensions using SQL Server Analysis Services (SSAS)
- Handled Performance Tuning by creating partitions on Tables with strong analytical and troubleshooting skills for quick solution to large-scale production environments which are located globally.
- Improved the performance of the Stored procedures by using SQL profiler, performance monitor, Execution plan, and Index tuning advisor.
- Developed many Tabular Reports, Matrix Reports, cascading parameterized Drill down, drop down Reports and Charts using SQL Server Reporting Services (SSRS 2012).
Environment: SQL, Tableau, ETL