We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • 8+ years of experience specializing in Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
  • Good noledge of Technologies on systems that comprise a massive amount of data running in teh highly distributive mode in Cloudera, Hortonworks Hadoop distributions, GCP and Amazon AWS.
  • Implemented various frameworks like Data Quality Analysis, Data Governance, Data Trending, Data Validation and Data Profiling with teh halp of technologies like Big data, Data Stage, Spark, Python, Mainframe with databases like Netezza and DB2, Hive & Snowflake
  • Experience in optimizing Hive SQL queries and Spark Jobs.
  • Build teh infrastructure required for optimal extraction, transformation, and loading (ETL) of data from a wide variety of data sources like Salesforce, SQL Server, Oracle & SAP using Azure, Spark, Python, Hive, Kafka and other Bigdata technologies.
  • Worked together with teh engineering team, production support, and business products team frequently to support analytics platforms, dig deep into data, and make effective decisions.
  • A thorough grasp of teh Hadoop architecture and its different parts such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, and Spark.
  • Good noledge in importing and exporting data using Sqoop from Relational Database Systems and vice-versa
  • Using Python with Snowflakes to create ETL pipelines in and out of a data warehouse Writing SQL query against Snowflake using SnowSQL.
  • Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
  • Good understanding of Spark Architecture with Databricks, Structured Streaming.
  • Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing teh Machine Learning Lifecycle
  • Strong Python scripter who TEMPhas experience with NumPy for statistics, Matplotlib for data visualization, and Pandas for data management. Utilizing teh application programming interfaces SparkSQL and DataFrames, involved in loading structured and semi-structured data into Spark clusters (API).
  • Strong working noledge of data modeling, data pipelines, and SQL and NoSQL databases. utilizing Python and SQL to create and automate ETL pipelines from beginning to end.
  • Knowledgeable of Snowflake principals including Snowpipe, SnowSQL, Cloud services, Data sharing, Reader accounts, designing ideal virtual warehouses in accordance with company SLAs, utilizing clustering keys, and materialized views.
  • Having noledge of setting up Snowflake AAD/SSO, row-level security, custom roles, and hierarchies, RBAC permissions, and data masking.
  • Experience in design and implementation of a Snowflake Data Warehouse-based large-scale production data solution.
  • Sound AWS cloud experience (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elasticsearch, Kinesis, SQS, DynamoDB, Redshift, ECS).
  • Capable of using AWS utilities such as EMR, S3, and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).
  • Working familiarity with teh Azure cloud's components (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, CosmosDB).
  • Experienced in building Spark scripts for development and analysis in Python, Scala, and SQL.
  • Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
  • Worked on analyzing Cloudera Hadoop and Hortonworks clusters and different big data analytic tools including Pig, Hive, and Sqoop.
  • Monitored workload, job performance, and capacity planning using Cloudera Manager.
  • Extensive noledge of text analytics, teh creation of data visualizations using Python and R, and teh development of diverse statistical machine learning solutions to business problems.
  • Excellent comprehension and expertise with NoSQL databases like DynamoDB, MongoDB, HBase, and Cassandra. I worked on loading and retrieving data from HBase utilizing teh Rest API for processing in real-time.
  • Experience in integrating data using tools such as Informatica, SSIS ETL data loading experience with SQL Server, and experience creating task specific SSIS and DTS packages.
  • Created databases using T-SQL programs, which were used to program queries, sub-queries, ranking functions, derived tables, common table expressions, stored procedures, views, user-defined functions, constraints, and database triggers.
  • Able to comprehend and have noledge of tools and services for job workflow scheduling and lockings, such as Oozie, Zookeeper, Airflow, and Apache NiFi.
  • Spark (Java/Python/Scala) engineer with hands-on experience who TEMPhas at least three years of python development. having noledge of AWS, having distributed capabilities, and having a framework mindset.
  • Utilized Spark streaming, Kafka Confluent, Storm, Flume, and Sqoop to work with a variety of streaming ingest services with batch and real-time processing.
  • Experience using Tableau, Power BI, Arcadia, and Matplotlib to create interactive dashboards, reports, ad hoc analyses, and visualizations.
  • Involved in development methodologies such as SDLC, Agile, and Scrum.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: Hortonworks, Map Reduce, Hive, Spark, Zookeeper, Cloudera, Oozie, Kafka, Flume, Hadoop, Sqoop, Redshift, Snowflake, DataBricks

Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS) Azure HDInsight (Databricks, DataLake, Blob Storage, Data Factory, SQL DB, SQL DWH, CosmosDB, Azure DevOps, Active Directory).

ETL Tools: Informatica, SSAS, SSRS, SSIS

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.

Monitoring and Reporting: Power BI, Tableau

Application Servers: Apache, JDBC, ODBC

Build Tools: Maven, Jenkins, Apache, Gradle

Project Management: MS Project, PowerPoint, Excel, Jira

Programming & Scripting: Python, Scala, SQL, R

Databases: Oracle, MY SQL, MongoDB, DynamoDB, MS-SQL SERVER, MS-Access, PostgreSQL, DB2.

Version Control: GIT, SVN

IDE Tools: RStudio, Eclipse, Jupyter, IntelliJ, PyCharm

Operating Systems: Linux, Unix, Mac OS-X, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, Azure

Cluster Managers: Docker, Kubernetes, Jira, Slack

Development Methodologies: Agile, Waterfall, Scrum

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential, Plano TX

Responsibilities:

  • Have alot of expertise using mapping data flow, scheduling triggers, and constructing pipeline jobs.
  • Processed data into HDFS by developing solutions, analyzed teh data using MapReduce, Pig, Hive, and produce summary results from Hadoop to downstream systems.
  • Optimize teh Pyspark jobs to run on Kubernetes Cluster for faster data processing
  • Design and develop ETL integration patterns using Python on Spark.
  • Developed framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
  • Used Kettle widely to import data from various systems/sources like MySQL into HDFS. Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins.
  • Involved in creating Hive tables, and tan applied HiveQL on those tables for data validation. Moved teh data from Hive tables into Mongo collections.
  • Used Zookeeper for various types of centralized configurations.
  • Import Data from various systems/sources like MYSQL into HDFS
  • Knowing cloud strategy, solution design, migration, network, storage, and visualization.
  • In-depth understanding and navigation of AWS cloud including EC2, Identity Access Manager, Lambda, VPC, S3, and Amazon Elastic MapReduce (EMR).
  • Experience setting up Elastic pool databases and planning Elastic tasks to run TSQL commands.
  • Spark was used for log analytics and better query response to query and analyze valuable information and cut costs.
  • SQL Synapse analytics (DW).
  • Worked on creating teh brand-new AWS Fargate API, which is like teh ECS run task API.
  • Create Spark applications that use Spark and Spark SQL to aggregate, transform, and extract data from numerous file formats to analyze teh data and gain insight into client usage patterns.
  • Performing Tuning, Testing, integration, troubleshooting, implementation, and maintenance for web applications, client/server, and data warehouse.
  • Assorting technologies like AWS, Oracle, Kubernetes, Apache NiFi, Spark 1 / 2, PL/SQL, Ansible Tower, and HDP/CDH.
  • Teh capacity to do data manipulation tasks within a spark session using teh spark Data Frame API.
  • Implementation expertise with Apache Spark for sophisticated distributed computing systems with generic functionality.
  • Grasp of teh internals of Apache Spark.
  • Worked on Snowflake DW and performed complex operations using SNOWSQL
  • Knowledge of teh Snowflake architecture and Data Model for developing warehouses and pipelines.
  • Knowledge AWS Code Commit, Code Build, Code Deploy, Code Pipeline, Jenkins, Bit bucket Pipelines, and Elastic Beanstalk to construct CI/CD procedures.

Environment: Python, PySpark, PyCharm, Oracle, SQL Server, Power BI, GIT, DataLake, Cloudera CDH 5.9.16, Hive, Impala, Kubernetes, Flume, Apache NiFi, Sqoop, Oozie, Snowflake.

Data Engineer

Confidential

Responsibilities:

  • Deep expertise in using parallel processing methods and distributed computing systems to effectively handle big data.
  • Design and develop ETL integration patterns using Python on Spark. Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
  • Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Translate business requirements into maintainable software components and understand impact (Technical and Business)
  • Provide guidance to development team working on PySpark as ETL platform makes sure that quality standards are defined and met.
  • Optimize teh Pyspark jobs to run on Kubernetes Cluster for faster data processing
  • Data pipelines were designed and put into place to move enormous amounts of data from different data sources into Azure Data Lake and Azure SQL Data Warehouse.
  • Using classes to make teh code simple to maintain and debug, create an object-oriented Python project that collects data from several APIs and is scheduled using teh SQL Server Agent.
  • Proficiency in Databricks, Azure SQL DB, and Azure Synapse SQL query optimization.
  • Extensive exposure to numerous hive concepts, including Ser-Des, built-in UDFs, bucketing, join optimizations, and custom UDFs.
  • Skilled in retrieving and manipulating data from Azure SQL Server, as well as working with files on Data Lake, Blob Storage, and SFTP/FTP servers.
  • U-SQL, Spark SQL, and Factory Data Lake Analytics in Azure. Ingestion of data into one or more Azure Services (Azure Data Lake).
  • Create automated ETL pipelines to automatically import structured and unstructured data for data analysis.
  • Using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics extract transform, and load (ETL) data from source systems to Azure Data Storage services.
  • Investigate Spark to enhance teh functionality and optimization of teh current Hadoop techniques using Spark Context, Spark-SQL, DataFrames, Pair RDDs, and Spark YARN.
  • Exposure to Azure tools in detail, including Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL Server, and Azure DevOps.

Environment: Python, PySpark, Azure SQL, DataLake, Databricks, Metastore, Spark, YARN, Redshift, Kafka, HBase, Zookeeper.

Senior Data Analyst

Confidential

Responsibilities:

  • Expert in writing SQL queries and query optimization for Teradata, Oracle, and SQL Server 2008.
  • Good working understanding of testing procedures, disciplines, tasks, resources, and scheduling in teh context of teh Software Development Life Cycle (SDLC).
  • Outstanding skills in data analysis, validation, cleaning, and verification, as well as in spotting data inconsistencies.
  • Analyzed and profiled data using complicated SQL on a variety of sources, including Oracle and Teradata.
  • Excellent noledge of Teradata's SQL queries, indexes, and utilities like Reload, Tpump, Fast load, and FastExport.
  • Strong background in data analysis based on business demands utilizing MS Access and Excel.
  • Excellent understanding of Perl and Unix.
  • TEMPHas to experience with VBA macros and Excel Pivot in a variety of business contexts.
  • Extensive experience using multiple ETL tools, such as Ab Initio and Informatica PowerCenter, for data analysis, migration, cleaning, transformation, integration, import, and export; proficiency in testing and writing SQL and PL/SQL statements, including stored procedures, functions, triggers, and packages.
  • Knowledge of setting up Korn jobs for Informatica sessions and automating and scheduling Informatica jobs using UNIX shell scripting.
  • Excellent experience in teh analysis of large volumes of data in teh finance, healthcare, and retail industries.
  • Proven expertise in building reports using SAP Business Objects and Webi for various data suppliers.
  • Excellent noledge of how to create teh necessary project documentation, track project progress and inform all project stakeholders regularly.
  • Teh vast expertise and practical understanding of creating tables, reports, graphs, and listings utilizing a variety of resources.
  • Excellent data mining skills, including teh ability to query and mine big databases to find teh transition.
  • Patterns and examine financial data.
  • Knowledge of checking BI reports produced by several BI Tools like Cognos and Business Objects.
  • Outstanding proficiency in writing DML statements for supporting statements.
  • Extensive noledge of Informatica 8.6.1/8.1 (Power Center/Power Mart) ETL testing (Designer, Workflow Manager, Workflow Monitor, and Server Manager).
  • Have solid experience working in an onsite or offshore environment, be able to comprehend and/or develop functional requirements while collaborating with clients and have a solid noledge of requirement analysis.
  • Creating test artifacts from teh requirements documents.
  • Outstanding at producing a variety of project artifacts, including specification documents, data mapping, and data analysis documents.
  • A superb team player and technically proficient individual with teh ability to collaborate with business users, project managers, team leaders, architects, and peers to create a positive project environment.

Environment: Power BI, ETL Testing, Webi, Mload, Tpump, FastExport, Perl, Unix, SQL, Oracle, SDLC.

Data Analyst

Confidential

Responsibilities:

  • Create database apps and reports in MS Access and Excel that support all our sites' admission logs and productivity.
  • Automated repetitive payroll and accounting tasks with Python to save 3 hours of repetitive work a week.
  • Experience in analyzing case study, increased revenue by 10% using point-of-sales data.
  • Make data backups and halp with data-related issues.
  • Support in transferring data from Access databases to Excel reports.
  • All project management and employee training.
  • Python and R were used to graphically critique teh datasets and obtain an understanding of how to evaluate teh data's nature.
  • Made use of Python's Pandas, NumPy, TensorFlow, and matplotlib to create a variety of machine learning algorithms.
  • Teh use of several statistical modeling techniques and machine learning algorithms, such as Linear and Logistic Regression, Decision Trees, Random Forests, Clustering, and Support Vector Machines.
  • Crafted complicated SQL queries using joins and subqueries and performed performance-tuned SQL queries for data analysis and profiling.
  • Using Python and R libraries like NumPy, Matplotlib, and Pandas to create a variety of capacity planning reports (graphically),
  • Used to collect data inputs from internal systems and maintain databases in MySQL and PostgreSQL.
  • Use SQL and Excel to slice and dice data for data preparation and cleaning.
  • During POC, created Tableau scorecards, dashboards, and Gantt charts using stack bars, bar graphs, dispersed plots, and geographic maps.
  • Used data research, discovery, and mapping tools to examine every piece of data from a variety of sources.

Environment: MS Access, Excel, SQL queries, Regression, Random Forest, NumPy, Matplotlib, Pandas, Decision Tree, TensorFlow, Tableau.

We'd love your feedback!