Sr. Data Engineer Resume
Plano, TX
SUMMARY
- Experienced Data Engineering professional with 6+ years of experience, specialized in Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
- Experienced in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, MapReduce, Yarn, Spark, Sqoop, Hive, Pig, Flume, Kafka, Impala, Oozie, Zookeeper and HBase.
- Closely collaborated with business products, production support, engineering team on a regular basis for Diving deep ondata, Effective decision making and to support Analytics platforms.
- Experience with working on AWS platforms (EMR, EC2, RDS, EBS, S3, IAM, Lambda, Glue, Elasticsearch, Cloud Watch, SQS, DynamoDB, Redshift, API Gateway, Athena, Glue).
- Implemented multiple applications to consume and transport data from S3 to Redshift and maintained by EC2.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Experienced in building PySpark and Spark-Scala applications for interactive analysis, data analysis, batch processing, and stream processing.
- Experience working on Azure cloud components (DataBricks, DataLake, stream Analytics Data Factory, Storage Explorer, SQL DB, SQL, Cosmos DB).
- Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
- Experienced in Testing and Standardized Data to meet the business standards using Execute SQL task, Conditional Split, Data Conversion, and Derived column in different environments
- Proficient at using Spark APIs for streaming real time data, staging, cleansing, applying transformations and preparing data for machine learning needs.
- Expertise on spark ecosystem and Architecture in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML, talendt.
- Profound experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Worked on NoSQL 0databases including in Cloudera Hadoop, Horton Works, Hadoop, various ETL tools, Cassandra, and various Confidential IaaS/PaaS services.
- Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows
- Created Spark scripts in both Python and Scala for development and data analysis
- Expertise with Docker and Kubernetes on multiple cloud providers and helped developers to build and containerize their application pipeline (CI/CD) to deploy them on cloud.
- Sound knowledge and Hands-on-experience with - MapR, Ansible, Presto, Storm, Stream Sets, Star & Snowflake Schema, ER Modeling and Talend.
- Experience in Importing and exporting data from different databases like MS SQL Server, Oracle, Cassandra, Teradata, PostgreSQL Post into HDFS using Sqoop, Talend.
- Leveraged different file formats Parquet, Avro, ORC and Flat files
- Experience in analyzing data from multiple sources and creating reports with Interactive Dashboards using power BI, Tableau and Matplotlib.
- Expertise in ticketing & Project Management tools like JIRA, Azure DevOps, Bugzilla and ServiceNow.
- Experience in software methodologies like Agile, Waterfall model.
- Excellent communication, interpersonal and analytical skills. Also, a highly motivated team player with the ability to work independently.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Yarn, Spark, Kafka, Airflow, Hive, Impala, StreamSets, Sqoop, HBase, Flume, Ambari, Oozie, Zookeeper, NIFI, Sentry, Ranger.
Hadoop Distributions: Apache Hadoop 1.x, Cloudera CDP, Hortonworks HDP, AWS (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, API Gateway DynamoDB, Redshift, ECS, Quick sight), Azure (DataBricks, DataLake, Data Factory ADF, SQL DB, SQL, Cosmos DB, Azure AD).
Programming Languages: Python, Scala, Java, HiveQL.
NoSQL Database: MongoDB, Hadoop HBase, and Apache Cassandra, Redis.
Database Modeling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling.
Version Control: Git, SVN, Bitbucket
ETL/BI: Snowflake, Informatica, SSIS, SSRS, SSAS, Tableau, JIRA.
Reporting & Visualization: Tableau 9.x/10.x, Matplotlib, Power BI.
Web Development: JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Hibernate, Maven, WebSphere.
Operating systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10), MacOS.
Others: StreamSets, Terraform, DockerS, Kubernetes, Jenkins, Ansible, Splunk, Jira.
PROFESSIONAL EXPERIENCE
Confidential, Plano, TX
Sr. Data Engineer
Responsibilities:
- Helped in building a data pipeline and performed analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Athena, Glue, Redshift and ECS).
- Worked with Data Science team running Machine learning models on Spark EMR cluster and delivered the data needs as per business requirements.
- Utilized Spark’s in memory capabilities to handle large datasets stored on S3 Data Lake.
- Developed SparkSQL andPySparkjobs to perform data cleansing, validation, and applied transformations.
- Involved in extracting and enriching multiple HBase tables using joins in SparkSQL. Also converted Hive queries into Spark transformations.
- Automated the process of transforming and ingesting terabytes of monthly data in Parquet format using Kafka, S3, Lambda and Airflow.
- Created workflows using Airflow to automate the process of extracting weblogs into S3 Data Lake.
- Worked inOozieand on workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
- Created external and permanent tables in Snowflake on the AWS data.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.
- Fetched live streaming data from Oracle database using Spark Streaming and Kafka.
- Co-ordinated with Kafka team to build an on-premises data pipeline using Kafka and Spark Streaming using the feed from API Gateway REST service.
- Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights.
- Performed interactive Analytics like cleansing, validation and quality checks on data stored in S3 buckets usingAWSAthena.
- Working with data science team to do preprocessing and feature engineering and assisted Machine Learning algorithm running in production.
- Worked on resolving several tickets generated when issues arise in production pipelines.
- Created analytical reporting and facilitating datafor Quicksight and Tableau dashboards.
- Used Git for version control and Jira for project management and to track issues and bugs.
Technologies: AWS, EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop 2.x (HDFS, MapReduce, Yarn), Hive v2.3.1, Spark v2.1.3, Python, SQL, Sqoop v1.4.6, Kafka v2.1.0, Airflow v1.9.0, HBase, Oracle, Cassandra, Quicksight, Tableau, Docker, Maven, Git, Jira.
Confidential, Trenton, NJ
Cloud Data Engineer
Responsibilities:
- Planned and designed data warehouse in STAR schema. Designed structure of tables and documenting it.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Designed and implemented end to end big data platform on Teradata Appliance.
- Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
- Worked on Apache Spark Utilizing the Spark, SQL, and Streaming components to support the intraday and real-time data processing.
- Monitored Spark cluster using Log Analytics and Ambari Web UI and monitored cluster availability with Apache Ambari in Azure HDInsight
- Performed Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in Azure Databricks.
- Transferred data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
- Developed Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
- Implemented Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
- Designed and implemented database solutions in Azure SQL Data Warehouse, Azure SQL
- Installed and configured Apache airflow for workflow management and created workflows in python
- Wrote UDFs in Hadoop PySpark to perform transformations and loads.
- Used NIFI to load data into HDFS as ORC files.
- Created TDCH scripts and Apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
- Performed Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
- Created Pipelines in Azure Data Factory using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool & backwards.
- Implemented Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages, and requirements for the applications.
- Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
- Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology.
- Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
- Developed an automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script.
- Installed and configured the applications like Docker tool and Kubernetes for the orchestration purpose
- Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.
Technologies: Apache Spark, Hadoop, PySpark, HDFS, Cloudera, Kafka, Snowflake, Docker, Jenkins, Ant, Maven, Kubernetes, Azure, Data Bricks, Data Lake, Data factory, Nifi, JSON, Teradata, DB2, SQL Server, MongoDB, Shell Scripting.
Confidential, Portland, OR
AWS Data Engineer
Responsibilities:
- Experienced in using distributed computing architectures like AWS (EC2, Redshift, and EMR, Elastic search), Hadoop, Spark, Python, and effective use of MapReduce, SQL, and Cassandra to solve big data type problems.
- Worked with Spark and improved the performance and optimized the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, RDD's, Spark YARN.
- Developed and implemented various spark jobs using AWS EMR to perform bigdata operations in AWS.
- Migrated AWS EMR based Data Lake and data ingestions system onto cloud Snowflake database.
- Installed the application on AWS EC2 instances and configured the storage on S3 buckets.
- Utilized Spark’s in memory capabilities to handle large datasets stored on S3 Data lake
- Worked on data ingestion by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions and loaded into S3 buckets.
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Worked with engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Loaded data from web servers using Flume and Spark Streaming API.
- Worked on Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's.
- Created tables, dropping, and altered at run time without blocking updates and queries using Spark and Hive.
- Converted Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation
- Developed Impala queries for faster querying and perform data transformations on Hive tables
- Deployment mode of the cluster was achieved through YARN scheduler and the size is Auto scalable.
- Worked on querying data using Spark SQL on the top of PySpark engine jobs to perform data cleansing, validation, and applied transformations and executed the program using python API.
- Used Apache Kafka for importing real time network log data into HDFS. Experience integration of Kafka with Spark for real time data processing and then then deployed on the Yarn cluster.
- Co-ordinated data pipeline using Kafka and Spark Streaming using the feed from API Gateway REST service.
- Involved in CI/CD process using Jenkins and GIT.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, checks & analysis.
- Created and Maintained Teradata Tables, Views, Macros, Triggers and Stored Procedures
- Performed interactive Analytics like cleansing, validation, checks on data stored in S3 buckets using AWS Athena.
- Involved in migrating table from RDBMS into Hive using SQOOP & later generated data visualizations using Tableau.
- Experience with analytical reporting and facilitating data for Quicksight and Tableau dashboards.
- Used Git for version control and Jira for project management and to track issues and bugs
Technologies: AWS EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop 2.x (HDFS, MapReduce, Yarn), Hadoop 2.x (HDFS, MapReduce, Yarn), Hive v2.3.1, Spark v2.1.3, Python, SQL, Sqoop v1.4.6, Kafka v2.1.0, Airflow v1.9.0, HBase, Cassandra, Oracle, Teradata, MS SQL Server, Agile, Unix, Informatica, Talend, Tableau
Confidential
Data Analyst
Responsibilities:
- Collaborating with different teams throughout the software development life cycle as per business requirements.
- Managed data or databases that support performance improvement activities.
- Developed data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
- Planning various test case scenarios to detect bugs, classify the errors based on severity, priority, and informing the development team.
- Good in formulating database designs, data models, data mining skills and segmentation techniques.
- Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into data warehouses.
- Created complex SQL queries and scripts to extract, aggregateand validatedata from MS SQL, Oracle, and flat files using Informatica and loaded into a single data warehouse repository.
- Gathering and Extracting data, generating the reports using Tableau / SRSS and finding trends in the analyzed data.
- Preparing test case modules for systems, and integration testing to assist in tracking defects.
- Conducted Quality inspections on products and parts.
- Worked with various teams such as software development, system integration to identify defects in the software.
- Performed Unit Testing, Regression Testing, and Integration Testing.
- Performed regression testing on software after the bugs were fixed to verify if new bugs are generated in the product.
- Extract, transform and analyze measures from multiple sources to generate reports, dashboards, and analytical solutions for increasing the sales.
Technologies: Python, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel, Tableau, Informatica.