Data Engineer Resume
Atlanta, GA
PROFESSIONAL SUMMARY:
- Accomplished IT proficient with 6 years of engagement, spent significant time in Big Data systems, Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
- An adept fanatic engineer with solid critical thinking, investigating and logical abilities, who effectively participates in understanding and delivering business requirements.
- 4+ yearsof industrial experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spark integration with Cassandra, Avro and Zookeeper.
- Closely worked together with business outcomes, production prop up, designing group consistently for diving profound on informational data, effective dynamic decision making and to help Analytics phases.
- Strong Hadoop and stage uphold involvement in major Hadoop Distributions like Cloudera, Hortonworks, Amazon EMR, and Azure HDInsight.
- Excellent information on Hadoop design and its core key ideas like distributed frameworks, Parallel transformations, High accessibility, Fault resistance and Flexibility.
- Extensive working involvement in Big Data systems like Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi.
- Proficient at composing MapReduce jobs and UDF's to assemble, examine, change, and convey the information according to business prerequisites.
- Hands on involvement with making continuous data streaming solutions utilizing Apache Spark, Spark SQL and Data Frames, Kafka, Spark streaming and Apache Storm.
- Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
- Strong working involvement in SQL and NoSQL databases, data modeling and data pipelines. Associated with start to finish advancement and automate ETL pipelines utilizing SQL and Python.
- Acquired significant knowledge with AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elastic-search, Kinesis, SQS, DynamoDB, Redshift, and ECS).
- Experience on Azure cloud segments (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB).
- Acquired experience in Spark scripts in Python, Scala and SQL for advancement in development and examination through analysis.
- Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka Confluent, Storm, Flume and Sqoop.
- Broad involvement with advancement in Bash scripting, T-SQL, and PL/SQL Scripts.
- Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
- Experience with NoSQL databases and integration with Hadoop cluster like HBase, Cassandra, MongoDB, DynamoDB and CosmosDB.
- Leveraged diverse file formats like Parquet, Avro, ORC and Flat files.
- Knowledge on growing profoundly versatile and strong Restful APIs, ETL arrangements and third-party product incorporations as a component of Enterprise Site platform.
- Profound involvement in building ETL pipelines between a few source frameworks and Enterprise Data Warehouse by utilizing Informatica PowerCenter, SSIS, SSAS and SSRS.
- Experience in all stages of Data Warehouse advancement like prerequisites gathering, design, development, implementation, testing, and documentation.
- Solid information on Dimensional Data Modeling with Star Schema and Snowflake for FACT and Dimensions Tables utilizing Analysis Services.
- Worked in both Agile and Waterfall environment, Utilizing Git and SVN form control frameworks.
- Experience in planning intuitive dashboards, reports, performing analysis and perceptions utilizing Tableau, Power BI, Arcadia and Matplotlib.
TECHNICAL SKILLS:
Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala
Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS
Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbour (KNN), Principal Component Analysis
Languages: Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions
Web Technologies: HTML, JavaScript, Restful, SOAP
Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.
Version Control: GIT, GIT HUB
IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau
Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).
Operating Systems: Windows 7,10, Mac OS, Unix, Linux
Cloud Technologies: MS Azure, Amazon Web Services (AWS)
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, Google Shell, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.
PROFESSIONAL EXPERIENCE:
Confidential, Atlanta, GA
Data Engineer
Responsibilities:
- Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
- Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
- Prepared the completedata mappingfor all the migrated jobs using SSIS.
- Designed SSISPackagesto transfer data from flat files to SQL Server using Business Intelligence Development Studio.
- Extensively used SSIS transformations such asLookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Developed Spark applications usingScalaandSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
- Azure Kubernetes Service was used to deploy a managed Kubernetes cluster in Azure and built an Azure portal AKS cluster with Azure CLI, and also used template-driven deployment options such as templates for the Resource Manager and Terraform.
- Used Kubernetes to deploy scale, load balance, scale and manage docker containers with multiple name spaced versions.
- Designed strategies for optimizing all aspect of the continuous integration, release and deployment processes using container and virtualization techniques like Docker and Kubernetes. Built Docker containers using microservices project and deploy to Dev.
- Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables
- Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
- Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
- Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
- Utilized machine learning algorithms such aslinear regression, multivariate regression, PCA, K-means, & KNNfor data analysis.
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure, ADF, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.
Confidential
Data engineer/Big data developer
Responsibilities:
- Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
- Experience in data cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 Data Lake and further processed it using PySpark.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
- Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
- Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
- Troubleshoot and maintain ETL/ELT jobs running using Matillion.
- Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
- Deploy the code to EMR via CI/CD using Jenkins
- Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.
Confidential
Big Data Developer
Responsibilities:
- Worked with Hortonworks dispersion. Introduced, arranged, and kept up a Hadoop cluster dependent on the business prerequisites.
- Involved in start to finish execution of ETL pipelines utilizing Python and SQL for high volume data analysis, likewise audited use cases before on boarding to HDFS.
- Mindful to stack, oversee and audit terabytes of log records utilizing Ambari web UI.
- Utilized Sqoop to relocate information between relational DBMS and HDFS. Ingested data from MS SQL, Teradata, and Cassandra databases.
- Performed specially appointed inquiries utilizing Hive joins, bucketing techniques for faster data access.
- Utilized Nifi to mechanize the data stream between divergent frameworks. Planned dataflow models and objective tables to acquire applicable metrics from different sources.
- Developed Bash contents to get log documents from FTP server and executed Hive responsibilities to parse them.
- Actualized different Hive queries for analytics. Created External tables, advanced Hive queries and improved the cluster execution by 30%.
- Enhanced contents of existing Python modules. Dealt with composing APIs to stack the processed data to HBase tables.
- Migrated ETL tasks to Pig contents to apply joins, aggregations, and transformations.
- Worked with Jenkins build and continuous integration tools. Involved in writing Groovy scripts to automate the Jenkins pipeline's integration and delivery service.
- Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to plan and create dashboards, workbooks, and complex aggregate calculations.
- Used Jenkins for CI/CD and SVN for version control.
- Used Informatica as an ETL tool to create source/target definitions, mappings and sessions to extract, transform and load data into staging tables from various sources.
- Designed and Developed Informatica processes to extract data from internal check issue systems.
- Used Informatica Power exchange to extract data from one of the EIC s operational system called Datacom.
- Extensive experience in Building, publishing customized interactive reports and dashboards, report scheduling using Tableau Desktop and Tableau Server.
- Extensive experience in Tableau Administration Tool, Tableau Interactive Dashboards, Tableau suite.
- Developed Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server.
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components.
- Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations and complex calculations to manipulate the data using Tableau Desktop.
- Used Custom SQL feature on Tableau Desktop to create very complex and performance optimized dashboards.
- Connected Tableau to various databases and performed live data connections, query auto updates on data refresh etc.