We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

3.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Data Engineer with over 9+ years of experience in Data warehousing, Data engineering, Feature engineering, big data, ETL/ELT, and Business Intelligence. As a big data architect and engineer, specialize in AWS and Azure frameworks, Cloudera, Hadoop Ecosystem, Spark/Py Spark/Scala, Data bricks, Hive, Redshift, Snowflake, relational databases, tools like Tableau, Airflow, DBT, Presto/Athena, and Data DevOps Frameworks/Pipelines with strong Programming/Scripting skills in Python.
  • Over 8+ years of experience in Data Engineer, including profound expertise and experience on statistical data analysis such as transforming business requirements into analytical models, designing algorithms, and strategic solutions that scales across massive volumes of data.
  • Expert experience over various database platforms: Oracle, Mongo DB, Redshift, SQL Server, Postgres.
  • Strong experience in full System Development Life Cycle (Analysis, Design, Development, Testing, Deployment and Support) in waterfall and Agile methodologies.
  • Effective team player with strong communication and interpersonal skills, possess a strong ability to adapt and learn new technologies and new business lines rapidly.
  • Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.
  • Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
  • Experience writing Machine Learning algorithms (Regression Models, Decision Trees, Naive Bayes, Neural Networks, Random Forest, Gradient Boosting, SVM, KNN, Clustering.
  • Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark - Scala applications for interactive analysis, batch processing and stream processing.
  • Data Streaming from various sources like cloud (AWS, Azure) and on - premises by using the tools Spark and Flume.
  • Experience in Amazon AWS, Google Cloud Platform and Microsoft Azure cloud services.
  • Azure cloud experience using Azure Data Lake, Azure Data Factory, Azure Machine Learning, Azure Databricks
  • SQL cloud experience using EC2, S3, EMR, RDS, Redshift, AWS Sage maker, Glue.
  • Adept in statistical programming languages like Python and R including Big-Data technologies like Hadoop, HDFS, Spark and Hive.
  • Developed mappings in Informatica to load the data from various sources into the Data warehouse different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
  • Having good knowledge of tools like Snowflake, SSIS, SSAS, SSRS to design warehousing applications.
  • Experience in data mining, including predictive behavior analysis, Optimization and Customer Segmentation analysis using SAS and SQL.
  • Experience in Applied Statistics, Exploratory Data Analysis and Visualization using matplotlib, Tableau, Power BI, Google Analytics.
  • Extensive experience on various version controls like Git, SVN, Bitbucket.
  • Experience in database design, entity relationships and database analysis, programming SQL, stored procedures PL/SQL, packages, and triggers in Oracle.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Superior communication skills, strong decision making and organizational skills along with outstanding analytical and problem-solving skills to undertake challenging jobs. Able to work well independently and in a team by helping to troubleshoot technology and business-related problems.

TECHNICAL SKILLS

Hadoop Distributions: Cloudera, AWS EMR and Azure Data Factory.

Languages: Scala, Python, SQL, Hive QL, KSQL.

IDE Tools: Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

AWS Services: VPC, IAM, S3, Elastic Beanstalk, CloudFront, Redshift, Lambda, Kinesis, DynamoDB, Direct Connect, Storage Gateway, EKS, DMS, SMS, SNS, and SWF

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, Databricks, Kafka, Cloudera

Machine Learning And Statistics: Regression, Random Forest, Clustering, Time-Series Forecasting, HypothesisExplanatory Data Analysis

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Other Software: Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel

Frameworks: Django, Flask, WebApp2

PROFESSIONAL EXPERIENCE

Confidential, Plano, Tx

Sr. Data Engineer

Responsibilities:

  • Worked in Creating data pipeline of gathering, cleaning, and optimizing data using Hive, Spark.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS.
  • Designed, developed data integration programs in a Hadoop environment with Kafka data store Cassandra for data access and analysis.
  • Imported and Exported Data from Different Relational Data Sources like SQL Server to HDFS using Sqoop.
  • Developed Spark SQL scripts using Python for faster data processing.
  • Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets
  • Worked huge datasets stored in AWS S3 buckets, used spark data frames to perform preprocessing in Glue.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Developed spark workflows using Scala to pull the data from AWS and apply transformations to the data.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on Creating, Debugging, Scheduling and Monitoring jobs using ki and Oozie.
  • Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Worked on Migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Extracted the data from HDFS using Hive and performed data analysis using Spark with Scala, PySpark, Redshift for feature selection and created nonparametric models in Spark.
  • Worked on RDS databases like MySQL server and NOSQL databases like MongoDB, HBase.
  • Designed and implemented end-to-end systems for Data Analytics and Automation, integrating custom visualization tools using R, Tableau.
  • Developed Tableau visualizations and dashboards using Tableau Desktop.
  • Handling the day-to-day issues and fine tuning the applications for enhanced performance.
  • Collaborate with team members and stakeholders in design and development of data environment
  • Implement One time Data Migration of Multistatelevel data from SQL server to Snowflake by using SnowSQL.
  • Good experience in migrating other databases to snowflake.
  • In depth knowledge of Snowflake Database, Schema, Table Structures.

Environment: Python, SQL, Hive, Spark, AWS, Hadoop, NoSQL, Cassandra, SQL Server AWS, HDFS, PySpark, Tableau, Mongo DB, Postgres SQL, Redshift, HBase, Sqoop, Airflow, Oozie.

Confidential, Wilmington, DE

Big Data Engineer

Responsibilities:

  • Developed cloud architecture and strategy for hosting complicated app workloads on Microsoft Azure.
  • Using the Azure cloud platform, I designed and implemented data pipelines (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Using PySpark and Shell Scripting, created a customized ETL solution, batch processing, and real- time data intake pipeline to transport data into and out of the Hadoop cluster.
  • Using Azure Data Factory, we combined on-premises data (MySQL, Cassandra) with cloud data (Blob Storage, Azure SQL DB) and implemented transformations to load back into Azure Synapse.
  • Built and published Docker container images using Azure Container Registry and deployed them into Azure Kubernetes Service (AKS).
  • Metadata was imported into Hive, and existing tables and applications were transferred to Hive and Azure.
  • Combining ADF and Scala, I created complicated data conversions and manipulations.
  • To meet business functional needs, configured Azure Data Factory (ADF) to ingest data from various sources such as relational and non-relational databases.
  • Built DAGs in Apache Airflow to schedule ETL processes and added other Apache Airflow components like Pool, Executors, and multi-node capability to optimize workflows.
  • Improved Airflow performance by identifying and applying the best setups.
  • Configured Spark streaming to receive real-time data from Apache Flume and store the stream data to an Azure Table using Scala, and DataLake is used to store and process all forms of data.
  • Applied various aggregations given by the Spark framework to the transformation layer utilizing Apache Spark RDD, Data frame APIs, and Spark SQL.
  • By mining data with Spark Scala functions, I was able to provide real-time insights and reports. Existing Scala code was optimized, and cluster performance was improved.
  • Utilized Spark Context, SparkSQL, and Spark Streaming to process large datasets.
  • Performed ETL operations using Python, SparkSQL, S3, and Redshift on terabytes of data to obtain customer insights.
  • Transitioned log storage from Cassandra to Azure SQL Datawarehouse, which improved query performance.
  • Using Spark, Hive, and Sqoop, created custom input adapters to ingest data for analytics from multiple sources (Snowflake, MS SQL, MongoDB) into HDFS.
  • Automated the jobs and data pipelines using AWS Step Functions, AWS Lambda and configured various performance metrics using AWS Cloud watch.
  • Involved in writing Python scripts to automate ETL pipeline and DAG workflows using Airflow.
  • Created workflows using Airflow to automate the process of extracting weblogs into the S3 Data Lake.
  • Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins.
  • Experience with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop.
  • Used Git for version control and Jira for project management, tracking issues and bugs.
  • Azure Kubernetes Service was used to manage resources and schedule across the cluster.

Environment: Spark, Hadoop, AWS (Lambda, Glue, EMR), NoSQL, Python, HDFS, Amazon Elastic Compute Cloud, Amazon Simple Storage Service(S3), CloudWatch Triggers (SQS, Event Bridge, SNS), REST, ETL, DynamoDB, Redshift, JSON, Tableau, JIRA, Jenkins, git, Maven, Bash

Confidential, Boston, MA

Data Engineer

Responsibilities:

  • Involved in developing batch and stream process sing applications that require functional pipelining using Spark APIs.
  • Developed Streaming applications using PySpark and Kafka to read messages fromAmazon AWS Kafka queues & write the JSONdatato AWS S3 buckets.
  • Developed Streaming applications using PySpark to read from the Kafka and persist thedataNoSQL databases such as HBase and Cassandra.
  • Developed tools using Python, Shell scripting, XML to automate some of the menial tasks.
  • Developed analytical components using Scala, Spark, and Spark Stream
  • Developed streaming and batch processing applications using PySpark to ingest data from the various sources into HDFSDataLake.
  • Developed the back-end web services using Python and Django REST framework.
  • Developed and implemented HQL scripts to create Partitioned and Bucketed tables in Hive for optimizeddataaccess.
  • Implemented PySpark Scripts using SparkSQL to access hive tables into a spark for faster processing ofdata.
  • Implementing Microservices in Scala along with Apache Kafka.
  • Extract real-timedatafeed using Kafka, process core job using Spark Streaming to Resilient Distributed Datasets (RDD) to process them asDataFrames and save asParquet format in HDFS and NoSQL databases.
  • Developed High Speed BI layer on Hadoop platform with Apache Spark & Java &Python.
  • Written Hive UDFs to implement custom functions in the hive for aggregations.
  • Worked extensively with Sqoop for importing and exporting thedatafrom HDFS to Relational Database systems/mainframe and vice-versa loadingdatainto HDFS.
  • Processed the schema oriented and non-schema-orienteddatausing Scala andSpark.
  • Developed DDLs and DMLs scripts in SQL and HQL for analytics applications inRDBMS and Hive.
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract and Zookeeper for providing coordinating services to the cluster.
  • Experience in Designing, Architecting, and implementing scalable cloud-based web applications using AWS.
  • Design and Develop ETL Processes in AWS Glue to migrate datafrom external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Used Apache Airflow to builddatapipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
  • Good knowledge in using Apache NiFi to automate thedatamovement between different Hadoop systems.
  • Developed and implemented Apache NiFi across various environments, written QA scripts in Python for tracking files.
  • Experience on moving the rawdatabetween different systems using Apache NiFi.ETL created by multiple Informatica transformations (Source Qualifier, Lookup,Router, Update Strategy) were utilized to create SCD type mappings to illustrate changes in loan relateddatain a timely manner.
  • Setting up and building AWS infrastructure with various resources EC2, S3, Auto Scaling, CloudWatch and RDS inCloudFormation using JSON templates.
  • Involved in designing and deploying multitude of applications utilizing almost all the AWS stack (Including EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS cloud formation.
  • Setup full CI/CD pipelines so that each commit adevelopermakes will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
  • Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
  • Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.

Environment: Hadoop, HDFS, Hive, Sqoop, HBase, Oozie, YARN, NiFi, Cassandra, Zookeeper, Spark, Kafka, Oracle, MySQL, Shell Script, AWS, EC2, Source Control GIT, AWS Redshift, AWS Glue.

Confidential

Bigdata Developer

Responsibilities:

  • Developed Apache Spark applications by using spark fordataprocessing from various streaming sources.
  • Used SparkDataFrames Operations to perform required Validations in thedataand to perform analytics on the Hivedata.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and processdatain the form ofDataFrame.
  • Worked on Kafka and REST API to collect and load thedataon Hadoop file system also used Sqoop to load thedatafrom relational databases.
  • Wrote Spark-Streaming applications to consume thedatafrom Kafka topics and write the processed streams to HBase.
  • Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics ondatain Hive.
  • Used Spark SQL ondataframes to access hive tables into spark for faster processing ofdata.
  • Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning, andBucketing.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib,DataFrame, PairRDD's, Spark YARN.
  • Implemented usage of Amazon EMR for processingBigDataacross a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple StorageService (S3).
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets.
  • Imported thedatafrom different sources like AWS S3, Local file system intoSpark.
  • Used AWS S3 and Local Hard Disk as underlying File System (HDFS) for Hadoop.
  • Storeddatain AWS S3 like HDFS and performed EMR programs ondatastored.
  • Exported the analyzeddatausing Sqoop into Database to generate reports for theBI team.
  • Experience building batch, real-time and streaming analytics pipelines withdatafrom eventdatastreams, NoSQL and APIs.
  • Used HUE for running Hive queries. Created partitions according today usingHive to improve performance.
  • Developed Oozie workflow engine to run multiple Hive, Pig, Sqoop and Spark jobs.
  • Experience in variousdatamodelling concepts like star schema, snowflake schema in the project.
  • Designed, configured, and managed public/private cloud infrastructures utilizingAWS.
  • Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.

Environment: Hadoop, Spark, AWS, UNIX Shell Scripting,Sqoop, HDFS, Pig, Hive, Oozie, Java, Oracle, GIT, NiFi,Python, MongoDB.

Confidential

Software Engineer

Responsibilities:

  • Hands on experience in loadingdatafrom UNIX file system to HDFS. Also performed parallel transfer ofdataon cluster using DistCp.
  • Used SQOOP to import and exportdatainto Hadoop distributed file system for further processing.
  • Implemented flume script to load streameddatainto HDFS.
  • Involved in creating mappings and loading thedatainto target tables as per the obligation while employing the logic and transformation with specific goal on noticing the sourcedatavalidation.
  • Automate processes in Cloudera environment and building OOZIE workflows.
  • Developed hive internal and external tables and functioned on them by means of HIVEQL.
  • Wrote Hive UDF to make the function reusable for different models.
  • Created Hive tables based on the business requirements.Hive queries were used to analyze the largedatasets.
  • Handleddatafrom different datasets, join and preprocess them using Pig join operations.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizingdatain a logical fashion.
  • Extensive working knowledge of partitioned table, UDFs, performance tuning, compression-related properties in Hive.
  • Working on QA support activities, testdatacreation and unit testing activities.Reviewing and managing Hadoop log files.
  • Participated in daily SCRUM meetings and give the daily status report.

Environment: Hadoop, HDFS, Java, MapReduce, HIVE, SQOOP, Spark SQL, HQL, OOZIE,Git, Oracle, Pig, Cloudera, UNIX, Agile.

We'd love your feedback!