We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Nashville, TennesseE

SUMMARY

  • Accomplished IT proficient with 8 years of engagement, spent significant time in Big Data systems, Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
  • An adept fanatic engineer with solid critical thinking, investigating and logical abilities, who effectively participates in understanding and delivering business requirements.
  • 4+ yearsof industrial experience inBig Data analytics,Data manipulation, using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop,AWS,Spark integration with Cassandra, Avro and Zookeeper.
  • Experience with Django, star schema, flask, a high - level Python Web framework.
  • Experienced in WAMP (Windows, Apache, MYSQL, and Python) and LAMP (Linux, Apache, MySQL, and Python) Architecture.
  • Excellent information on Hadoop design and its core key ideas like distributed frameworks, Parallel transformations, High accessibility, Fault resistance and Flexibility.
  • Extensive working involvement in Big Data systems like Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi.
  • Proficient at composing MapReduce jobs and UDF's to assemble, examine, change, and convey the information according to business prerequisites.
  • Hands on involvement with making continuous data streaming solutions utilizing Apache Spark, Spark SQL and Data Frames, Kafka, Spark streaming and Apache Storm.
  • Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
  • Strong working involvement in SQL and NoSQL databases, data modeling and data pipelines. Associated with start to finish advancement and automate ETL pipelines utilizing SQL and Python.
  • Acquired significant knowledge with AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elastic-search, Kinesis, SQS, DynamoDB, Redshift, and ECS).
  • Experience on Azure cloud segments (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB).
  • Acquired experience in Spark scripts in Python, Scala and SQL for advancement in development and examination through analysis.
  • Good experience fact tables, PostgreSQL with design, coding, Cassandra, debug operations, reporting and data analysis utilizing star schema, Snowflake Python and using Python libraries to speed up development.
  • Strong knowledge of Object-Oriented Design and Programming concepts and Experience in Object Oriented Programming (OOP) concepts using Python.
  • Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
  • Experience with NoSQL databases and integration with Hadoop cluster like HBase, Cassandra, MongoDB, DynamoDB and CosmosDB.
  • Implemented detailed systems and services monitoring using Nagios, Zabbix, & AWS Cloud Watch.
  • Knowledge on growing profoundly versatile and strong Restful APIs, ETL arrangements and third-party product incorporations as a component of Enterprise Site platform.
  • Profound involvement in building ETL pipelines between a few source frameworks and Enterprise Data Warehouse by utilizing Informatica PowerCenter, SSIS, SSAS and SSRS.
  • Experience in all stages of Data Warehouse advancement like prerequisites gathering, design, development, implementation, testing, and documentation.
  • Experienced on R and Python for statistical computing. Also experience with MLlib (Spark), Matlab, Excel, Minitab, SPSS, and SAS.
  • Experience in planning intuitive dashboards, reports, performing analysis and perceptions utilizing Tableau, Power BI, Arcadia and Matplotlib.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, AWS Glue, Zookeeper, Kafka, DataStax bulk loader, Apache Spark, Spark Streaming, HBase, Flume, Impala, Cloudera clusters

Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Principal Component Analysis

Languages: Python, R, PySpark, Shell scripting, SQL, PL/SQL, Java, Scala

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).

Cloud Technologies: MS Azure, Amazon Web Services (AWS)

IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

PROFESSIONAL EXPERIENCE

Confidential, Nashville, Tennessee

Data Engineer

Responsibilities:

  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications without container orchestration expertise.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
  • Used Azure Event Grid for managing event service that enables you to easily manage events across many different Azure services and applications.
  • Used Service Bus to decouple applications and services from each other, providing the benefits like Load-balancing work across competing workers.
  • Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
  • Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • Delta lake supports merge, update and delete operations to enable complex use cases.
  • Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
  • Used Databricks to integrate easily with the whole Microsoft stack.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Used Azure Data Catalog which helps in organizing and to get more value from their existing investments.
  • Used Azure Synapse to bring these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
  • Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Provide guidance to development team working on PySpark as ETL platform
  • Utilized machine learning algorithms such as linear regression, multivariate regression, PCA, K-means, & KNN for data analysis.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta lake, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Confidential

Data engineer/Big data developer

Responsibilities:

  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
  • Migrate existing architecture to Amazon Web Services and utilize several technologies like Kinesis, RedShift, AWS Lambda, Cloud watch metrics and Query in Amazon Athena with the alerts coming from S3 buckets and finding out the alerts generation difference from the Kafka cluster and Kinesis cluster.
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Involved in start to finish execution of ETL pipelines utilizing Python and SQL for high volume data analysis, likewise audited use cases before on boarding to HDFS.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Wrote Lambda functions in python for AWS's Lambda which invokes python scripts to perform various transformations and analytics on large data sets.
  • Created and run jobs on AWS cloud to extract transform and load data intoAWS RedshiftusingAWS Glue,S3for data storage andAWS Lambdato trigger the jobs.
  • Worked with structured/semi-structured data ingestion and processing on AWS using S3, Python and Migrate on-premises big data workloads to AWS.
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
  • Experienced in developing Web Services with Python programming language.
  • Involved in Migrating Objects from Teradata to Snowflake, scheduled different Snowflake jobs using NiFi to ping snowflake to keep Client Session alive.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Created and managed CloudFormation templates for provisioning and configuring AWS resources.
  • Experience with AWS CloudWatch, including creating and managing alarms along with metrics monitor and troubleshoot issues with AWS resources.
  • Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.
  • Experience with DataStax Enterprise (DSE), including data modeling, data loading, and performance tuning.
  • Experience loading large datasets into DataStax using the bulk loader tool, including its usage and configuration options.
  • Worked on migratingMapReduce programsintoSparktransformations usingSparkandScala, initially done usingpython (PySpark).
  • Deploy the code to EMR via CI/CD using Jenkins.
  • Extensively used Code cloud for code check-in and checkouts for version control.

Environment: AWS Glue, AWS S3, AWS Redshift, AWS EMR, AWS RDS, CloudFormation, CloudWatch, SSRS, SSIS, DynamoDB, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Airflow, Python, JSON, Parquet, CSV, Code cloud.

Confidential, Illinois

Big Data Engineer

Responsibilities:

  • Responsible for the planning and execution of big data analytics, predictive analytics and machine learning initiatives.
  • Implemented a proof of concept deploying this product in Amazon Web Services AWS.
  • Assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
  • Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open-Source framework knowledge.
  • Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
  • Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
  • Designed and developed software applications, testing, and building automation tools.
  • Designed efficient and robust Hadoop solutions for performance improvement and end-user experiences.
  • Worked in a Hadoop ecosystem implementation/administration, installing software patches along with system upgrades and configuration.
  • Conducted performance tuning of Hadoop clusters while monitoring and managing Hadoop cluster job performance, capacity forecasting, and security.
  • Automated data movements using Python scripts. Involved in splitting, validating and processing of files.
  • Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
  • Lead architecture and design of data processing, warehousing and analytics initiatives.
  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies using Hadoop, MapReduce, HBase, Hive and Cloud Architecture.
  • Worked on implementation and maintenance of Cloudera Hadoop cluster.
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
  • Active involvement in design, new development and SLA based support tickets of Big Machines applications.
  • Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Developed Oozie workflow jobs to execute hive, Sqoop and MapReduce actions.
  • Used Pig UDF's in Python and used sampling of large data sets.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
  • Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
  • Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using AmazonEC2.
  • Build Hadoop solutions for bigdata problems using MR1 and MR2 in YARN.
  • Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
  • Developed complete end to end Big-data processing in Hadoop eco-system.
  • Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
  • Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
  • Proof-of-concept to determine feasibility and product evaluation of Bigdata products.
  • Writing Hive join query to fetch info from multiple tables, writing multiple MapReduce jobs to collect output from Hive.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Involved in developing MapReduce framework, writing queries scheduling map-reduce.
  • Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
  • Developed customized classes for serialization and De-serialization in Hadoop.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

Environment: Spark, Airflow, Machine learning, AWS, MS Azure, Cassandra, Avro, HDFS, GitHub, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), SAS, SPSS, MySQL, Bitbucket, Eclipse, XML, PL/SQL, SQL connector, JSON, Tableau, Jenkins.

Confidential

Python Developer

Responsibilities:

  • Developed views and templates with Python and Django view controller and templating language to create a user-friendly interface to perform in a high-level.
  • Used the Django Framework to develop the application and build all database mapping classes using Django models.
  • Built an Interface between Django and Salesforce with REST API.
  • Used NumPy and Pandas in python for Data Manipulation.
  • Worked on infrastructure with Docker containerization.
  • Kubernetes is being used to orchestrate the deployment, scaling and management of Docker Containers.
  • Used GitHub for Python source code version control, Jenkins for automating the build Docker containers
  • Worked on Micro Services deployments on AWS ECS and EC2 instances
  • Refactored existing batch jobs and migrated existing legacy extracts from Informatica to Python based micro services and deployed in AWS with minimal downtime.
  • Created AWS Security Groups for deploying and configuring AWS EC2 instances.
  • Added support for Amazon AWS S3 and RDS to host files and the database into Amazon Cloud.
  • Extensively used EC2, Autoscaling, Load Balancing, containerized various applications, built, and migrated applications into Elastic beanstalk for better deployments.
  • Utilized PyUnit, the Python unit test framework and used PyTest for all Python applications.
  • Enhanced contents of existing Python modules. Dealt with composing APIs to stack the processed data to HBase
  • Used Custom SQL feature on Tableau Desktop to create very complex and performance optimized dashboards.
  • Connected Tableau to various databases and performed live data connections, query auto updates on data refresh etc.

Environment: Python, SQL, SQL Server, SSRS, PL/SQL, T-SQL, Tableau, MLlib, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, OLAP, MariaDB, SAP CRM, HDFS, SVM, JSON, Tableau, XML, AWS.

We'd love your feedback!