Data Engineer Resume
Fort Worth, TX
SUMMARY
- Over 8+ years of hands - on professional experience as a Big Data Developer with expertise in Python, Spark, Hadoop Ecosystem, cloud services, etc.
- Development, Implementation, Deployment, and Maintenance using Bigdata technologies in designing and implementing complete end-to-end Hadoop-based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Nifi, Ambari.
- Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server, and MySQL databases.
- Experience in Extraction, Transformation, and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, and PowerBI.
- Acquired profound knowledge in developing production-ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets, and Spark-ML.
- Profound experience in creating real-time data streaming solutions using PySpark/Spark Streaming, Kafka.
- Worked on NoSQL databases including HBase, Cassandra, and Mongo DB.
- Strong Hadoop and platform support experience with the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Proficient in data visualization tools such as Matplotlib, Seaborn, ggplot2, and PySpark
- In-depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce.
- Experience developing iterative algorithms using Spark Streaming in Scala and Python to build near real-time dashboards.
- Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate the transfer of the data from HBase.
- Expertise in working with AWS cloud services like EMR, S3, Redshift, EMR, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, and Athena for big data development.
- Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
- Worked on data processing and transformations and actions in spark by using Python (Spark) language.
- Experienced Orchestrating, scheduling, and monitoring job tools like Oozie, and Airflow.
- Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
- Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka, Storm, Flume, and Sqoop.
- Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
- Expertise in python scripting and Shell scripting. Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis.
- Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
- Involved in all the phases of the Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
- Strong understanding of Data Modeling and ETL processes indatawarehouse environments such as star schema, and snowflake schema.
- Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy, and Joiner.
- Strong working knowledge across the technology stack including ETL, data analysis, data cleansing, data matching, data quality, audit, and design.
- Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
TECHNICAL SKILLS
Big Data Eco System: HDFS, Spark, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Zookeeper, Amazon Web Services.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
Languages: Python, Scala, Java, Pig Latin, HiveQL, Shell Scripting.
Software Methodologies: Agile, SDLC Waterfall.
Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.
NoSQL: HBase, MongoDB, Cassandra.
ETL/BI: Power BI, Tableau, Informatica.
Version control: GIT, SVN, Bitbucket.
Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.
Cloud Technologies: Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Code Build, CloudWatch. Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).
PROFESSIONAL EXPERIENCE
Confidential, Fort Worth, TX
Data Engineer
Responsibilities:
- Migrated terabytes of data from the data warehouse into the cloud environment in an incremental format, worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data. Created Airflow Scheduling scripts in Python.
- Developed Spark jobs on Databricks to perform tasks like data cleansing, data validation, and standardization, and then applied transformations as per the use cases.
- Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Created several Databricks Spark jobs with PySpark to perform several tables-to-table operations. Developed a High-Speed Business Intelligence layer on the Hadoop platform with PySpark & Python.
- Performed data cleansing and applied transformations using Databricks and Sparkdata analysis. Developed spark applications in PySpark and Scala to perform cleansing, transformation, and enrichment of the data.
- Utilized Spark-Scala API to implement batch processing of jobs. Developed Spark-Streaming applications to consume the data from Snowflake and to insert the processed streams to DynamoDB.
- Developed ingestion modules using AWS Step functions, AWS Glue and Python modules.
- Experienced with installation of AWS CLI to control various AWS services through SHELL/BASH scripting.
- Utilized Spark in Memory capabilities, to handle large datasets.
- Experienced with event- driven and scheduled AWS Lambda functions to trigger events in variety of AWS resources using boto3 modules.
- Used Broadcast variables in Spark, Effective & efficient Joins, transformations, and other capabilities for data processing.
- Developed custom aggregate functions using Spark SQL and performed interactive querying. Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Creating Hive tables, loading, and analyzing data using hive scripts.
- Created Partitioned and Bucketed Hive tables in Parquet File Formats (Effective storage) with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, and XML Files. Mastered in using different columnar file formats like ORC and Parquet formats.
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
- Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
- Extracted Tables and exported data from Teradata through Sqoop and placed them in Cassandra. Created Spark JDBC APIs for importing/exporting data from Snowflake to S3 and vice versa. Experienced in working with EMR cluster and S3 in AWS cloud.
- Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
- Utilized AWS services like EMR, S3, Glue Meta store, and Athena extensively for building the data applications.
- Involved in creating Hive external tables to perform ETL on data that is produced on daily basis.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer. Validated the data being ingested into Hive for further filtering and cleansing.
- UsedJenkinspipelines to drive all microservices builds out to the Docker registry and then deployed to Kubernetes, Created Pods, and managed using Kubernetes.
- Performed all necessary day-to-day GIT support for different projects, Responsible for the design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Docker, Mongo DB, Kubernetes, SQS, PySpark, Kafka, Data Warehouse, Big Data, MS SQL, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.
Confidential, Chicago, IL
Data Engineer/ Data Scientist
Responsibilities:
- Performed data analysis and developed analytic solutions. Data investigation to discover correlations/trends and the ability to explain them. Used Python Pandas libraries to import database using different APIs.
- Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization in Python)
- Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
- Created Python and Bash tools to increase efficiency of retail management application system and operations; data conversion scripts, AMQP/Rabbit MQ, REST, JSON and CRUD scripts for API integration.
- Developed various MYSQL database queries from Python using Python MySQL connector and MYSQL database package.
- Hands on experience in creating in AWS IAM services including creating roles, users, groups. Have experience in implementing MFA to provide strong security to AWS account and its resources.
- Creating AWS Lambda deployment function and configured it to receive events from S3 bucket.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as AWS S3 and Amazon DynamoDB.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing, responsible for importing data from PostgreSQL to HDFS, HIVE using SQOOP tool.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Hands-on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper, and Flume.
- Data warehouse, Business Intelligence architecture design, and development. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
- Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, and Spark SQL Azure Data Lake Analytics.
- Created, and provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Developed ADF Pipelines to load data from on-prem to AZURE cloud Storage and databases.
- Creating pipelines, data flows, and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions, and Data Frames. Developed custom ETL solutions, batch processing, and real-time data ingestion pipeline to move data in and out of Hadoop using PySpark and shell scripting.
- Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.
- Created Spark RDDs from data files and then performed transformations and actions to other RDDs.
- Created Hive Tables with dynamic and static partitioning including buckets for efficiency. Also Created external tables in HIVE for staging purposes.
- Loaded HIVE tables with data, wrote hive queries that run on MapReduce, and Created a customized BI tool for management teams that perform query analytics using HiveQL.
- To meet specific business requirements wrote UDFs in Scala and PySpark. Experience in developing Spark applications using Spark-SQL inEMR for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data to uncover insights into the customer usage patterns.
- Utilized Spark in Memory capabilities, to handle large datasets. Used Broadcast variables in Spark, Effective & efficient Joins, transformations, and other capabilities for data processing.
- Experienced in working with EMR cluster and S3 in AWS cloud, creating Hive tables, loading, and analyzing data using hive scripts. Implemented Partitioning (both dynamic Partitions and Static Partitions) and Bucketing in HIVE. Involved in continuous Integration of applications using Jenkins.
- Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins.
- Implemented aCI/CDpipeline withDocker,Jenkins,andGitHubby virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization.
- Installing, configuring, and administering Jenkins CI tool using Chef on AWS EC2 instances.
- Performed Code Reviews and was responsible for Design, Code, and Test signoff.
- Worked on designing and developing the Real-Time Tax Computation Engine usingOracle, Stream Sets, Kafka, and Spark Structured Streaming.
- Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW.
- Extensively utilized Informatica to create a complete ETL process and load data into the database which was to be used by Reporting Services.
- Created Tidal Job events to schedule the ETL extract workflows and to modify the tier point notifications.
Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Python Visualization, Docker, Mongo DB, Kubernetes, PySpark, SNS, Kafka, Data Warehouse, Sqoop, Pig, Zookeeper, Flume, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, GCP, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.
Confidential
Data Engineer
Responsibilities:
- Developed Map Reducejobs in both PIG and Hivefor data cleaning and pre-processing.
- Imported Legacy data from SQL Server and Teradata into Amazon S3, created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. Compare the data in a leaf level process from various databases when data transformation or loading takes place.
- Created Snow pipe for continuous data load from staged dat residing on cloud gateway servers
- Used Snowflake time travel feature to access historical data.
- Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Using g-cloud function with Python to load Data into Big query for on arrival CVS files in GCS bucket.
- Submit spark jobs using gsutil and spark submission get it executed in Dataproc cluster
- Write a Python program to maintain raw file archival in GCS bucket.
- Analyze various type of raw file like JSON, CVS, XML with Python using Pandas, NumPy etc., Analyzed data and provided insights using Python Pandas
- Developed spark code and spark-SQL/streaming for faster testing and processing of data. Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
- Worked on Azure Data factory to integrate data of both on-prem and cloud (Azure SQL DB) and applied transformations to load back to Azure synapse.
- Managed configured and scheduled resources across the cluster using Azure Kubernetes service.
- Monitor the Daily, Weekly, and Monthly jobs and provide support in case of failures/issues.
- Worked on analyzing Hadoop clusters and different big data analytic tools, worked with various HDFS file formats like Avro, Sequence File, and JSON.
- Working experience with data streaming process with Kafka, Apache Spark, and Hive, developed and configured Kafka brokersto pipeline server logs data into spark streaming.
- Developed Spark scripts by usingScalashell commands as per the requirement, imported from the data CASSANDRAdatabases and stored it inAWS.
- Involved in convertingHive/SQLqueries into Spark transformations using Spark RDDs, used AmazonCLIfor data transfers to and fromAmazon S3 buckets.
- ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data stored inS3 Buckets, implemented the workflows using the ApacheOozieframework to automate tasks.
- ImplementedSpark RDDtransformations, and actions to implement the business analysis, developedSparkscriptsby usingScalashell commands as per the requirement.
- Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
- Worked on Ingesting high volumes of tuning events generated by client Set-top boxes from elastic search in batch mode and Amazon Kinesis Streaming in real-time via Kafka brokers to Enterprise Data Lake using Python and NiFi.
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation, and Materialized views to optimize query performance.
- Developed data pipelines with NiFi, which can consume any format of real-time data from a Kafka topic and push this data into the enterprise Hadoop environment.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: GCP, Big query, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, BQ Command Line Utilities, Dataproc, HDFS, MapReduce, Snowflake, Pig, Nifi, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, EMR, Scala, SQL Server, Cassandra, Oozie.
Confidential
Big Data Engineer
Responsibilities:
- Developed Spark streaming model which gets transactional data as input from multiple sources and creates multiple batches and later processed for already trained fraud detection model and error records.
- Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning, and troubleshooting Hadoop clusters.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala.
- Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Developed DDLs and DMLs scripts in SQL and HQL for creating tables and analyzing the data in RDBMS and Hive.
- Used Sqoop to import and export data from HDFS to RDBMS and vice-versa, Created Hive tables, and was involved in data loading and writing Hive UDFs.
- Exported the analyzed data to the relational database MySQL using Sqoop for visualization and generating reports.
- Loaded the flat file data using Informatica to the staging area. Researched and recommended a suitable technology stack for Hadoop migration considering current enterprise architecture.
- Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
- Performed Data Profiling, Data pipelining, and Data Mining, validating, and analyzing data (Exploratory analysis / Statistical analysis) and generating reports.
- Responsible for building scalable distributed data solutions using Hadoop, selected and generated data into CSV files and stored them in AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis. Used Sqoop to transfer data between relational databases and Hadoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filter and handled pre-defined filters on HBase data using API.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components
- Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations, and complex calculations to manipulate the data using Tableau Desktop.
Environment: Spark, Hive, Python, HDFS, Sqoop, Tableau, HBase, Scala, AWS, Azure Data Lake, Azure Data Factory, Azure Storage, Azure SQL, MySQL, Impala, AWS, S3, EC2, Redshift, Tableau, Informatica.