We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

San Antonio, TX

SUMMARY

  • Senior data engineer with overall 7 years of professional experience in Development, Delivery, Maintenance of Data warehouse Application and Big data Applications across multiple domains.
  • Have hands - on ETL/ ELT pipelines, data modeling, data lake architecture, data warehouse, database design in a distributed environment utilizing big data tools on cloud-based platforms.
  • Expertise in processing big data pipelines using big data tools like Hadoop, Spark and creating optimized ETL pipelines on AWS, Azure, GCP to develop various machine learning and business intelligence applications.
  • Hands-on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala, and Hue.
  • Extensively worked onAWS services like EC2, S3, EMR, RDS, SageMaker, Athena, Glue Data Catalog, RDS(Aurora), Redshift, DynamoDB, and Elastic Cache (Memcached & Redis) & Quick Sightand other services of the AWS family.
  • Hands on experience inAzureDevelopment, worked onAzure web application,App services,Azure storage,Azure SQL Database,Virtual machines,Fabric controller,Azure AD.
  • Extensive knowledge in working with Azure cloud platforms (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
  • Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
  • Expertise inf Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing the Machine Learning Lifecycle.
  • Experience in building and architecting multiple Data pipelines, includingETLandELTprocess for Data ingestion and transformation inGCP using BigQuery, Dataproc, CloudSQL, Datastore
  • Extensive experience in working with NoSQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB, Cassandra, and HBase.
  • Expertise in transforming businessrequirementsinto analytical models, designing algorithms, building models, developing Data mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation and Visualization, and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Excellent knowledge about the architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark Streaming, and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.
  • Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
  • Proficient in Python scripting and developed various internal packages to process big data.
  • Developed various shell scripts and Python scripts to automate Spark jobs and Hive scripts and Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
  • Excellent knowledge in using Partitions, bucketing concepts in Hive.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
  • Created workflow scheduling and locking tools/services like Airflow, AWS Step functions, Oozie, Zookeeper, and Apache NiFi.
  • Experience in parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing.
  • Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive at the staging area.
  • Integrated variousjQueryplugins liketoken-input, moment, validator, dropdown, carousel,d3.jsetc. into various Backbone modules.
  • Hands-on Experience in using Visualization tools like Tableau, Power BI.
  • Good Experience withPython web frameworkssuch asDjango,Flask, andPyramidFramework.
  • Experience in working with GIT, Bitbucket Version Control System.
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
  • Designed UNIX Shell Scripting for automating deployments and other routine tasks.
  • Extensive experience in the development of Bash scripting, T-SQL, and PL/SQL scripts.
  • Capable of working with SDLC, Agile, and Waterfall Methodologies.
  • Written Kusto (KQL) and Cosmos Queries (SCOPE) and publish the data into Power BI
  • Highly experienced with source code management tools like git, subversion, perforce.
  • Installed both Cloudera (CDH4) and Hortonworks (HDP1.3-2.1) Hadoop clusters on EC2, Ubuntu 12.04, CentOS 6.5 on platforms ranging from 10-100 nodes.
  • Excellent communication skills, interpersonal skills, problem-solving skills, and being a team player. Ability to quickly adapt to new environments and technologies.

TECHNICAL SKILLS

Hadoop/Big Data Ecosystem: Apache Spark, HDFS, Map Reduce, HIVE, Sqoop, Flume, Kafka, IntelliJ

Programming & Scripting: Python, Scala, JAVA, SQL, Shell Scripting, C, C++

NoSQL Databases: Mongo DB, HBase, Cassandra, Dynamo DB

Databases: Oracle 11g/10g, DB2, MS-SQL Server, MySQL, Teradata, PostgreSQL

Web Technologies: HTML, XML, JDBC, JSP, CSS, JavaScript, SOAP

Tools: Used: Eclipse, Putty, Winscp, NetBeans, QlikView, PowerBI, Tablueau

Operating Systems: Ubuntu (Linux), Mac OS-X, CentOS, Windows 10, 8, Red Hat

Cloud Computing: AWS, Azure

Distributed platforms: Hortonworks, Cloudera, MapR

Programming & Scripting: Python, Scala, JAVA, JS, SQL, Shell Scripting, C, C++

Version Control: Git, GitHub, SVN, CVS

Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall

PROFESSIONAL EXPERIENCE

Confidential, San Antonio TX

Sr. Big Data Engineer

Responsibilities:

  • Extensively worked on AWS services like S3, EC2, EMR, Redshift, Athena, Glue, DynamoDB, RDS, IAM along with big data services like Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming, SQL, MongoDB.
  • Created end to end data pipeline which includes data ingestion, data curation, data provision using AWS cloud services.
  • DevelopedSpark applicationsusingPythonand implemented Apache Spark data processing project to handle data from bothBatchandStreamingsources.
  • Ingested data into S3 bucket from different sources including MySQL, Oracle, MongoDB, SFTP.
  • Created EMR cluster on EC2 instances and developed PySpark applications to perform data transformations on top of it and stored into Redshift.
  • Used Kusto Explorer to write Kusto queries which are used in Power BI Reports
  • Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.
  • Used AWS Lambda to automate the operation to read data types (Parquet, DAT, JSON, Avro, CSV) from AWS S3 to AWS RDS.
  • Automated the python scripts using AWS Lambda to convert the file format from JSON to Parquet
  • Configured S3 event to trigger the Lambda function which automatically converts the file format from CSV to JSON and load into DynamoDB.
  • Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake.
  • Created an event-driven AWS Glue ETL pipeline using Lambda function by reading the data from the S3 bucket and storing it in Redshift on daily basis.
  • Migrated an existing on-premises application toAWS and usedAWSservices likeEC2andS3for data processing and storage.
  • Developed python script using Boto3 library to configure the services AWS glue, EC2, S3, DynamoDB.
  • Performed tuning of Spark Applications to set batch interval time and the correct level of Parallelism and memory tuning.
  • Built data visualizations to monitor file server load, web server speeds, data processing and moreusingD3.js,Backbone,jQuery.
  • UsedD3.jscharting library to develop reusable charting Backbone module.
  • UsedSpark Streaming APIsto perform transformations and actions for the data coming fromKafkain real-time and persists it toAWS S3.
  • Developed Kafka consumer API in python for consuming data from Kafka topics.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available indownstream systems for Data analysis and engineering type of roles.
  • Extracted the data from HDFS using Hive and performed data analysis using PySpark, Redshift for feature selection and created nonparametric models in Spark.
  • Configured Snow pipe to pull the data from S3 buckets into Snowflake table and stored incoming data in the Snowflake staging area.
  • Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data
  • Used AWS Redshift, S3, Spectrum, and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process.
  • Scheduled an end-to-end data pipeline including ingestion, curation, and provision using Apache Airflow.
  • Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Used SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Hands on Experience in Apache Superset for indexing and load balanced querying to search for specific data in larger datasets.
  • Used Job management scheduler Apache Superset to execute the workflow
  • Imported and exported data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code using SQL for faster testing and processing of data.
  • Developed Sqoop and Kafka Jobs to load data from RDBMS into HDFS and HIVE.
  • Worked in the development of applications especially in the LINUX environment and familiar with all its commands and worked on Jenkins continuous integration tool for deployment of the project and deployed the project into Jenkins using GIT version control system.
  • Developed entire frontend and backend modules using Python on Django Web Framework and created User Interface (UI) using JavaScript, bootstrap, and HTML5/CSS with Cassandra, and MySQL.

Environment: Spark, Spark-Streaming, PySpark, Spark SQL, AWS EMR, S3, EC2, Redshift, Athena, Lambda, Glue, DynamoDB, MapR, Snowflake, HDFS, Hive, Pig, Apache Kafka,Sqoop, Python, Scala, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins,Oracle, Git, Airflow, Tableau, Power BI, SOAP, Cassandra.

Confidential, Plano TX

Big Data Engineer

Responsibilities:

  • Worked on Azure cloud platform services like HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer.
  • Extract Transform and Load data from source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Ingested data to one or more Azure Services - (Azure Data Lake, Azure Blob Storage, Azure SQL DW) and processed the data in In Azure Databricks.
  • Developed data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL and persist into Azure Synapse analytics.
  • Worked on Azure Data Factory to integrate data of both on-prem (MySQL, PostgreSQL, Cassandra) and cloud (Blob Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
  • Used Spark DataFrames to create various Datasets and applied business transformations and data cleansing operations using DataBricks Notebooks.
  • Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Ingested data in mini-batches and performs RDD transformations on mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
  • Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Apache Airflow, Apache NiFi.
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
  • Created data pipeline for different events in Azure Blob storage into Hive external tables and used various Hive optimization techniques like partitioning, bucketing, and Mapjoin.
  • ExtractedTablesand exported data fromTeradatathroughSqoopand placed them inCassandra.
  • Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and Apache Ranger for authorization.
  • Implemented a Continuous Delivery pipeline with Docker, Jenkins and GitHub and AWS AMI's, whenever a new GitHub branch gets started, Jenkins, Continuous Integration server, automatically attempts to build a new Docker container from it.
  • Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics, and machine learning applications.
  • Managed resources and scheduling across the cluster using Azure Kubernetes Service (AKS). AKS has been used to create, configure, and manage a cluster of virtual machines.

Environment: Azure Data Factory, Blob Storage, Synapse, Azure SQL, Azure HDInsight, Databricks, DataLakeGen2, CosmosDB, AKS, Docker, Jenkins, AD, MySQL, PostgreSQL, Snowflake, MongoDB, Cassandra, Teradata, Python, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark, RDD, PySpark, Airflow, DAG, Hive, Sqoop, HBase, Tableau, PowerBI.

Confidential, Bridgeport, CT.

BI Data Engineer

Responsibilities:

  • Involved in designing and deploying multi-tier applications using all the AWS services like EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM focusing on high availability, fault tolerance, and auto-scaling in AWS Cloud Formation and developed machine learning models.
  • Supported continuous storage in AWS using Elastic Block Storage, S3, Glacier.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytics using Apache Spark Scala APIs.
  • Used Scala for amazing concurrency support where Scala played the key role in parallel processing of the large data sets.
  • Developed map-reduce jobs using Scala for compiling the program code into bytecode for the JVM for data processing.
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP.
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries, and writing data back into OLTP system through Sqoop.
  • Developed Hive queries to pre-process the data required for running the business process.
  • Created HBase tables to load large sets of structured, semi-structured, and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
  • Implemented various machine learning models using AWS SageMaker.
  • Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as multivariate regression, Random Forests, K-means, & KNN for data analysis.
  • Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline that can be written to Glue Catalog and can be queried from Athena.
  • Programmed in Hive, Spark SQL, Java, C#, and Python to streamline the incoming data and build the data pipelines to get the useful insights and orchestrated the pipelines.
  • Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
  • Worked on ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server) & Credit Edge server.
  • Experience in using and tuning relational databases (Microsoft SQL Server, AWS RDS) and columnar databases (Amazon Redshift, Microsoft SQL Data Warehouse).
  • Worked in the development of applications especially in the UNIX environment and was familiar with all its commands.
  • Administered and monitored multi - Data center Cassandra cluster based on the understanding of the Cassandra Architecture.
  • Created automated archive process to remove unused tables to ensure optimal database speed. Implemented 3rd party data transformation process using Redshift, Lambda S3, Kinesis & EDI Exchange software reducing integration time by a factor of 10.

Environment: EC2, Route53, S3, RDS, EMR, Redshift, Glue, Dynamo DB, Lambda, SNS, SQS, IAM, Spark, Machine learning, Python, Scala, Hive, HBase, MySQL, Unix.

Confidential, Philadelphia PA

Data Engineering Analyst

Responsibilities:

  • Experienced working with big data, data visualization, Python development, SQL, and Unix.
  • Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
  • Handled high volume of day-to-day Informatica workflow migrations.
  • Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed.
  • Designed and implemented complex ETL data process using Informatica power Center and advanced SQL queries (analytical functions).
  • Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
  • Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries.
  • Prepared high-level analysis reports using Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.
  • Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools.
  • ETL Data Cleansing, Integration &Transformation using Hive and PySpark. Responsible for managing data from disparate sources.
  • UsedSpark optimizations techniqueslike Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism, and modifying the spark default configuration variables for performance tuning.
  • Read data from different sources like CSV file, Excel, HTML page, and SQL and performed data analysis and written to any data source like CSV file, Excel, and database.
  • Developed and handled business logic through backend Python code.
  • Created templates for page rendering and Django views for the business logic.
  • Worked on Django REST framework and integrated new and existing API endpoints.

Environment: Python, SQL, ETL, Tableau, Informatica, Spark, HTML, Django.

Confidential

BI Developer

Responsibilities:

  • Developed and maintained ETL packages to extract the data from various sources and was responsible fordebugging and upgrading several ETL structuresas per the requirements.
  • Developed SSIS packages for different tasks like data cleansing and standardizing, sorting, conditional split, back up and involved in scheduling of SSIS packages.
  • Involved in Data Extraction, Transformation and Loading (ETL) between Homogenous and Heterogeneous systems using SQL tools (SSIS, Bulk Insert).
  • Involved in error handling of SSIS packages by evaluating error logs.
  • Successfully extracted transformed and load data into data warehouse.
  • Created reports using SSRS from OLTP and OLAP data sources and deployed on report server.
  • Created tabular, matrix, chart, drill down reports, parametric, cascaded reports, dashboards and score cards reports (SSRS) according to business requirement.
  • Created reports by dragging data from cube and wrote MDX scripts.
  • Used SSIS 2008 to create ETL packages( dtsx files) to validate, extract, transform and load data to data warehouse, data mart databases, and process SSAS cubes to store data to OLAP databases.
  • Experience in managing and automating Control flow, Data flow, Events, and Logging programmatically using Microsoft .NET framework for SSIS packages.
  • Used SSIS and T-SQL stored procedures to transfer data fromOLTP databases to the staging areaand finally transfer it into the data warehouse.
  • Usedvarious SSIS taskssuch as conditional split, derived column, lookup which were used for data scrubbing, data validation checks during Staging, before loading the data into the data warehouse.
  • Created SQL server configurations for SSIS packages and experienced increating jobs, alerts, and SQL mail agents and schedule SSIS packages.
  • Responsible for creating reports based on the requirements using SSRS and defined the report layout and identified datasets for the report generation.
  • Created different types of reports including financial reports,crosstab, conditional, drill-down, sub reports also parameterized reports, ad hoc reportsfor existing databases.
  • Design and created financial report templates, bargraphs, and pie chartsbased on thefinancial data.Scheduled the monthly /weekly/daily reportsto run automatically.
  • Developed, implemented, and maintained various database objects suchas stored procedures, triggers, functions, indexes, and views.

Environment: SSIS, SSRS, SSAS, ETL, .NET, OLAP, OLTP, T-SQL, SQL.

We'd love your feedback!