Sr. Data Engineer Resume
Irving, TX
SUMMARY
- Accomplished IT professional with 7+ years of experience, specialized in Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
- A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.
- Closely collaborated with business products, production support, engineering team on a regular basis for Diving deep on data , Effective decision making and to support Analytics platforms .
- Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Extensive working experience with Big data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi.
- Sound Experience with AWS cloud (EMR, EC2, RDS, EBS, S3, Kinesis, Lambda, Glue, Athena, Elasticsearch, SQS, DynamoDB, Redshift, ECS)
- Working knowledge on Azure cloud components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB ).
- Excellent knowledge of Hadoop cluster architecture and its key concepts - Distributed file systems, Parallel processing, High availability, Fault tolerance and Scalability.
- Obtained and processed data from Enterprise applications, Clickstream events, API gateways, Application logs and database updates.
- Proficient at writing MapReduce jobs and UDF’s to gather, analyze, transform, and deliver the data as per business requirements.
- Expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.
- Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, DataFrames, Datasets and Spark-ML.
- Experienced in writing Spark scripts in Python, Scala, Java and SQL for development and analysis.
- Proficient at using Spark API’s for streaming real time data, staging, cleansing, applying transformations and preparing data for machine learning needs.
- Worked with various streaming ingest services like Kafka, Kinesis, flume, and JMS.
- Involved in end to end implementation of Enterprise Data Lakes with Batch and Real-time processing using Spark streaming, Kafka, Flume and Sqoop.
- Extensive experience in development of Bash scripting, T-SQL, and PL/SQL Scripts.
- Extensive experience using Sqoop to ingest data from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
- Experience working on NoSQL databases and its integration with Hadoop cluster - HBase, Cassandra, MongoDB, DynamoDB and CosmosDB.
- Leveraged different file formats Parquet, Avro, ORC and Flat files. Used Snappy and GZIP compression codec to optimize storage and processing.
- Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions and third-party platform integrations as part of Enterprise Site platform.
- Sound experience in building production ETL pipelines between several source systems and Enterprise Data Warehouse by leveraging Informatica PowerCenter, SSIS, SSAS and SSRS.
- Experience in all phases of Data Warehouse development like requirements gathering, design, development, implementation, testing, and documentation.
- Solid knowledge of Dimensional Data Modeling with Star Schema and Snowflake for FACT and Dimensions Tables using Analysis Services.
- Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
- Hands on working experience with RESTful API’s, API life cycle management and consuming RESTful services
- Worked in both Agile and Waterfall environment. Used Git and SVN version control systems.
- Experience with Jira and HP ALM for project management and Control - M scheduling tool. Also used Cron and Autosys tools to schedule jobs.
- Experience in designing interactive dashboards, reports, performing ad-hoc analysis and visualizations using Tableau, Power BI, Arcadia and Matplotlib .
- Used Kerberos , Azure AD, Sentry, and Ranger for maintaining security.
- Involved in developing an Identity and access management application with Spring, Java and Microservices SOA architecture. Also involved in APIs development.
- Sound knowledge and Hands-on-experience with - NLP, MapR, IBM infosphere suite, Storm, Flink, Talend, ER Studio and Ansible.
- Successfully working in a fast-paced environment, both independently and in a collaborative way. Expertise in complex troubleshooting, root-cause analysis, and solution development.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Yarn, Spark, Kafka, Kafka Connect, Airflow, Hive, Impala, StreamSets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry, Ranger.
Programming Languages: Python, Scala, Java, R, Shell Scripting, Pig Latin, HiveQL.
NoSQL Database: Cassandra, MongoDB, Redis, Neo4j.
Database: MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2.
Version Control: Git, SVN, Bitbucket
ETL/BI: Snowflake, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia.
Web Development: JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Angular, JFrog, Mokito, Flask, Hibernate, Maven, Tomcat, Websphere.
Operating systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10)
Others: Machine learning, NLP, StreamSets, Spring Boot, Jupyter Notebook, Terraform, Docker, Kubernetes, Jenkins, Ansible, Splunk, Jira.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Irving, TX
Responsibilities:
- Involved in building a data pipeline and performed analytics using AWS stack (EMR, EC2, S3, RDS, Lambda, Kinesis, Athena, Glue, SQS, Redshift, and ECS).
- Worked with Data Science team running Machine Learning models on Spark EMR cluster and delivered the data needs as per business requirements.
- Automated the process of transforming and ingesting terabytes of monthly data in Parquet format using Kinesis, S3, Lambda and Airflow.
- Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Utilized Spark’s in memory capabilities to handle large datasets on S3 Data lake .
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Migrated Java analytical applications into Scala. Used Scala where performance and logic is critical.
- Created workflows using Airflow to automate the process of extracting weblogs into S3 Datalake .
- Involved in developing batch and stream processing applications that require functional pipelining using Spark Scala and Streaming API.
- Involved in extracting and enriching multiple Cassandra tables using joins in SparkSQL . Also converted Hive queries into Spark transformations.
- Hands-on experience on API design and development using Spring Boot for Data movement across different systems.
- Fetched live data from Oracle database using Spark Streaming and Amazon Kinesis using the feed from API Gateway REST service.
- Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights.
- Performed interactive Analytics like cleansing, validation and quality checks on data stored in S3 buckets using AWS Athena .
- Involved in writing Python scripts to automate ETL pipeline and DAG workflows using Airflow . Manage communication between multiple services by distributing tasks on celery workers.
- Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins, git, maven and artifactory.
- Involved in writing unit tests, worked along with DevOps team in Installing libraries, Jenkins agents and productionized ETL jobs and microservices .
- Involved in developing a custom-built Rest API to support real time customer analytics for data scientists and applications.
- Managed and deployed configurations for the entire datacenter infrastructure using Terraform .
- Experience with analytical reporting and facilitating data for Quicksight and Tableau dashboards.
- Used Git for version control and Jira for project management, tracking issues and bugs.
Environment: AWS, EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Python, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, HBase, Oracle, Cassandra, MLlib, Quicksight, Tableau, Maven, Git, Jira.
Data Engineer
Confidential, Wayne, PA
Responsibilities:
- Experience in working with Azure cloud platform (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Synapse, SQL DB, SQL DWH and Data Storage Explorer).
- Involved in building an Enterprise DataLake using Data Factory and Blob storage, enabling other teams to work with more complex scenarios and ML solutions.
- Used Azure DataFactory, SQL API and Mongo API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Developed Spark Scala scripts for mining data and performed transformations on large datasets to provide real time insights and reports.
- Supported analytical platform, handled data quality, and improved the performance using Scala’s higher order functions, lambda expressions, pattern matching and collections.
- Implemented scalable microservices with Scala and Akka to handle concurrency and high traffic. Optimized existing Scala code and improved the cluster performance.
- Performed data cleansing and applied transformations using Databricks and Spark data analysis .
- Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
- Designed and automated Custom-built input adapters using Spark, Sqoop and Airflow to ingest and analyze data from RDBMS to Azure Datalake.
- Reduced access time by refactoring data models, query optimization and implemented Redis cache to support Snowflake.
- Involved in developing automated workflows for daily incremental loads, moved data from RDBMS to Data Lake.
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from MS SQL to CosmosDB and improved the query performance.
- Created Automated ETL jobs in Talend and pushed the data to Azure SQL data warehouse.
- Managed resources and scheduling across the cluster using Azure Kubernetes Service .
- Used Azure DevOps for CI/CD, debugging and monitoring jobs and applications. Used Azure Active Directory and Ranger for security.
- Working with data science team to do preprocessing and feature engineering and assisted Machine Learning algorithm running in production.
- Fine-tuned parameters of Spark NLP applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.
- Facilitated data for interactive Power BI dashboards and reporting purposes.
Environment: Azure (HDInsight, Databricks, DataLake, Blob Storage, DataFactory, SQL DB, SQL DWH, AD, AKS), Scala, Python, Hadoop 2.x, Spark v2.0.2, NLP, Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Talend, CosmosDB, MS SQL, MongoDB, Ambari, PowerBI, Azure DevOps, Ranger, Git.
Big Data Developer
Confidential
Responsibilities:
- Responsible for building scalable and distributed data solutions using Cloudera CDH .
- Involved in migrating large amounts of data from on-prem Cloudera cluster to EC2 instances deployed on Elastic MapReduce (EMR) cluster.
- Gathered data and performed analytics using AWS stack ( EMR, EC2, S3, RDS, Lambda, Redshift).
- Developed an ETL pipeline using to extract archived logs from disparate sources and stored in S3 data lake. Used Cron and AutoSys schedulers for weekly automation.
- Implemented Spark Scala UDF's to handle data quality, filter and validate data sets, Also Involved in converting Java analytical applications to Scala .
- Involved in converting Java MapReduce jobs to Scala UDF’s and improved the performance.
- Analyzed and optimized pertinent data stored in Snowflake using PySpark and SparkSQL .
- Worked with Impala for massive parallel processing of queries for ad-hoc analysis. Designed and developed complex queries using Hive and Impala for a logistics application.
- Developed Sqoop jobs for data ingestion, incremental data loads from RDBMS to Snowflake .
- Created Bash scripts to add dynamic partitions to Hive staging tables. Responsible for loading bulk amount of data into HBase using MapReduce jobs.
- Loaded data from web servers using Flume and Spark Streaming API. Used flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion.
- Involved in creating broker topics, producers, and consumers to monitor, process and archive live streaming data.
- Co-ordinated with Kafka team and built an on-premise data pipeline. S upported Kafka Integrations, performance tuning and identified bottlenecks to improve performance and throughput.
- Used StreamSets for analytics and Involved in debugging and optimizing data pipelines collecting logs and metrics from various application APIs.
- Involve in creating database schema and objects like tables, views, stored procedures, triggers, packages, and functions to provide structure and maintain data efficiently.
- Developed Oozie workflows for scheduling and orchestrating the ETL process.
- Managed and deployed configurations for the entire datacenter infrastructure using Terraform .
- Used Cloudera Hue and Zeppelin notebooks to interact with HDFS cluster. Used Cloudera Manager, Search and Navigator to configure and monitor resource utilization across the cluster.
- Used Arcadia to connect with Impala , designed interactive dashboards and reports for the BI team.
- Presented creative business insights with KPI reports and delivered actionable insights by identifying significant and correlated variables.
- Involved in setting up CI/CD pipelines using Jenkins . Involved in writing Groovy scripts to automate the Jenkins pipeline's integration and delivery service.
- Also worked on resolving several tickets generated when issues arise in production pipelines.
- Used Kerberos for authentication and Apache Sentry for authorization.
- Used Git for version control, Interacted with Onsite team for deliverables.
Environment: Cloudera CDH4, Hue, Zeppelin, Hadoop 2.x, AWS (EMR, EC2, S3, Lambda, RDS, DynamoDB), Snowflake, Spark v1.6.0, Python, Scala, Sqoop v1.4.5, Hive v1.2.1, Impala, HBase, Kafka v0.9.0.0, Spring Boot, Talend, Flume, StreamSets, Oozie v4.1.0, Terraform, Zookeeper, Druid, live, PostgreSQL, Jenkins, Arcadia, Sentry, Kerberos .
Big Data Developer
Confidential
Responsibilities:
- Worked with Hortonworks distribution. Installed, configured, and maintained a Hadoop cluster based on the business requirements.
- Experience with Apache bigdata components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari and Nifi.
- Involved in end to end implementation of ETL pipelines using Python and SQL for high volume analytics, also reviewed use cases before on boarding to HDFS.
- Responsible to load, manage and review terabytes of log files using Ambari web UI.
- Involved in writing rack topology scripts and Java map reduce programs to parse raw data.
- Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.
- Used Sqoop to migrate data between traditional RDBMS and HDFS. Ingested data, from MS SQL, Teradata, and Cassandra databases.
- Identified required tables, views and exported them into Hive . Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.
- Used Nifi to automate the data flow between disparate systems. Designed dataflow models and complicated target tables to obtain relevant metrics from various sources.
- Developed Bash scripts to get log files from FTP server and executed Hive jobs to parse them.
- Implemented various Hive queries for analytics. Created External tables, optimized Hive queries and improved the cluster performance by 30% .
- Performed data analysis using HiveQL , Pig Latin and custom MapReduce programs in Java .
- Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data to HBase tables.
- Migrated ETL jobs to Pig scripts to apply joins, aggregations, and transformations.
- Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complex aggregate calculations.
- Troubleshooted defects by identifying root cause and fixed them during QA phase.
- Used Jenkins for CI/CD and SVN for version control.
Environment: Hortonworks 2.0, Hadoop, Hive v1.0.0, HBase, Sqoop v1.4.4, Pig v0.12.0, Zookeeper, Kafka v0.8.1, Nifi, Ambari, Python, SQL, Java, Teradata, MS SQL, Cassandra, Power BI, Jenkins, SVN, Jira.
Data Analyst
Confidential
Responsibilities:
- Played a key role in gathering business requirements, system and design requirements, gap analysis, use case diagrams and flow charts.
- Performed ETL operations using with Informatica power center to - data extraction, staging, apply transformations and stored in target data centers.
- Parsed complex files using Informatica Data Transformations (normalizer, Lookup, Source Qualifier, Expression, Aggregator, Sorter, Rank and Joiner) and loaded them into databases.
- Created complex SQL queries and scripts to extract, aggregate and validate data from MS SQL, Oracle, and flat files using Informatica and loaded into a single data warehouse repository.
- Involve in creating database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL to provide structure and maintain data efficiently.
- Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/data warehouses.
- Involved in writing python scripts to extract data from different API’s.
- Responsible for collecting, scrubbing, and extracting data, generated compliance reports using SSRS, analyzed and identified market trends to improve product sales.
- Performed data profiling, answered complex business questions by providing data to business users.
- Generated DDL and created the tables and views in the corresponding architectural layers.
- Extract, transform and analyze measures/indicators from multiple sources to generate reports, dashboards, and analytical solutions.
Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
