We provide IT Staff Augmentation Services!

Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Bellevue, WA

SUMMARY

  • Experience in AWS CloudFront, including creating and managing distributions to provide access to S3 bucket or HTTP server running on EC2 instances.
  • Experience in Big Data platforms like Hadoop platforms Microsoft Azure Data lake, Azure Data Factory, Azure Databricks, Azure Blob Storage and Graph Databases
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, and design.
  • Highly experienced in bash scripting, Linux command, Hadoop command.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Skilled in configuring Zookeeper to coordinate servers in clusters & maintain consistency.
  • Experience in analyzing data using HiveQL, Pig Latin, HBase and custom Map Reduce programs in Java.
  • Java development skills using J2EE, spring, J2SE, Servlets, JUnit, MRUnit, JSP, JDBC.
  • Knowledge on implementing Big Data in Azure Databricks for processing, managing Hadoop framework.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
  • Hands on experience in implementing LDA, Naive Bayes, QDA and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principal Component Analysis, Boosting and Natural Language Processing.
  • Worked and extracted data from various database sources like Teradata, Oracle, SQL Server, DB2, regularly accessing JIRA tool and other internal issue trackers for the Project development.
  • Experience in optimizing Map Reduce algorithms using Mappers, Reducers, combiners and partitions to deliver the best results for the large datasets.
  • Expertise in loading the data from the different data sources like into HDFS using Sqoop and load into partitioned Hive tables.
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau, PowerBI.
  • Knowledge on implementing Big Data in Amazon Elastic MapReduce (EMR) for processing, managing Hadoop framework dynamically scalable Amazon EC2 instances.

TECHNICAL SKILLS

Languages: Scala, Java, SAS, R, Python, SQL, Shell Scripting, Maven, spark 2, 2.3, Spark Sql, Spark Streaming, Hadoop, MapReduce, R - (Packages: Stats, Zoo, Matrix, data table, OpenSSL), HDFS, Eclipse, Anaconda, Jupyter notebook,Pyspark

NO SQL Databases: HBase, MongoDB

Operating Systems: Linux, Unix, Windows

BI Tools: Tableau, Power BI

Algorithms: Logistic Regression, Random Forest, KNN, SVM, Neural Network, NLP, Linear Regression, Lasso Regression, Generalized Linear Models, Boxplots, K-Means Clustering, SVM, QDA, LDA, Ridge Regression, Random Forest, Boosting, Neural networks, AI, XGBoost, LigthGBM

Big Data: Hadoop, HDFS, HIVE, Hbase, Impala, Cloudera Hue, PuTTy, Spark, Scala, Sqoop, Cloudera Hue, HortonWorks, Talend, SSIS, Azure Data Factory, Azure Data Lake, AWS, EMR, S3, OOZIE, Kafka, Shell Scripts, Python Scripts, Unix, Linux, PuTTY, WinSCP, Apache Airflow

ETL: SSIS, Azure Data Factory, Talend

Database Design Tools and Data Modeling: Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Teradata, Oracle, Sql Server

PROFESSIONAL EXPERIENCE

Confidential, Bellevue, WA

Big Data Engineer

Responsibilities:

  • Created information models for information examination and extraction writing database complex SQL inquiries in Oracle, PostgreSQL, MySQL, Microsoft SQL Server.
  • Experienced in using the spark application master to monitor thespark jobsand capture the logs for the spark jobs.
  • ImplementedSparkusing Pyspark and Spark SQL for faster testing and processing of data.
  • Used Spark API over Hortonworks, Hadoop YARN to perform data analysis in Hive.
  • Used SQL Server, SSIS extensively to load and transform data.
  • Implemented Spark using Scala and utilizingData framesand Spark SQL API for faster processing of data. supported in designing efficient and robust ETL workflows (extract, transform, load) on large datasets and creating big data warehouses that can be used for reporting or analysis by Data Scientists.
  • Worked with cloud computing to store, retrieve, and share large quantities of data in Azure is the Azure Data lake. Read and wrote to Data lake from Apache Hadoop, Apache Spark, and Apache Hive. PCA was used for dimensional Reduction and created the K-means clustering.
  • To load both structured and unstructured streaming data to HDFS, hive and HBase apache Flume, and Apache Sqoop data loading was used
  • Generated Java APIs for retrieval and analysis on No-SQL database such as HBase.
  • Used Spark stream processing to get data into in-memory, implemented RDD transformations, actions to process as units
  • Created Hive tables and implemented partitioning, dynamic partitions, buckets and created external tables to optimize performance.
  • Used Spark andSpark-SQLto read the parquet data and create the tables in hive using the Scala API.
  • Developed pipelines using SparkML that drive data for the automation of training and testing the models.
  • Implemented MapReduce programs to handle unstructured data like XML & JSON files and sequence files for log files
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Preformed aggregation over large amounts of log data collected using Apache Flume and staging data in HDFS for further analysis.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Migrated Flume with Spark for real time data and developed the Spark Streaming Application with java to consume the data from Kafka and push them into Hive.
  • Used Maven to build/deploy Spring Boot Microservices to inner enterprise Docker registry, worked on Kubernetes. Worked on Jenkins CI/CD pipeline.
  • Used AWS services like EC2 and S3 for small data sets processing and storage.
  • Provisioning of Cloudera Director AWS instance and adding Cloudera manager repository to scale up Hadoop Cluster in AWS.
  • Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Used HiveQL for data analysis like creating tables and import the structured data to specified tables for reporting.
  • Used Pig to perform data validation on the data ingested using scoop and flume and the cleansed data set is pushed into HBase.
  • Executed speedy reviews and first mover advantages by using workflows like Oozie in order to automate the data.
  • Delivered fraud dashboards, Trends, plots on fraud data
  • Study the fraud cases and identify process gaps to prevent losses
  • Identified fraud pattern and rebuilt the machine learning models for fraud alarms.
  • Used Clustering and Statistical plots to analyze data.
  • Design, development and implementation of performant ETL pipelines using python API (pySpark, Python) of Apache Spark on Azure Databricks.

Confidential, Atlanta, GA

Data Engineer

Responsibilities:

  • Successfully connected to different warehouse supply chain databases like OMS, TMS, WMS, PKMS, YARDVIEW from Hadoop, Azure Databricks and Azure Data Factory
  • Experienced with the Hadoop ecosystem and Spark framework (YARN, HDFS, SPARK, Scala, Pyspark, Python)
  • Extensively involved in writing SQL queries (Sub queries, nested queries, views, Join conditions, removal of duplicates) in Impala/Hive, Oracle and Spark SQL, T-SQL in Microsoft SQL Server.
  • Performed complex queries, stored procedure, triggers, integrating data with SSIS in SQL Server.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Read variety of databases from Azure Data Bricks using JDBC connections using Scala, Python and Pyspark and saved in ADL.
  • Successfully read data, cleaned data, filtered data, preprocessed data, removed outlier, subset data, read preprocessed data, used model (Linear Regression, Random Forest Regression) and selected the reasonable model on the basis of R-square and accuracy.
  • Improved estimated customer delivery date (ECDD) accuracy up to 80% using AI/ML models which increased the customer satisfactions by 20% and reduction in the customer calls by 10%
  • Successfully Ingested batch file and tables from Supply Chain Warehouse databases OMS, TMS, WMS, PKMS, YARDVIEW, STELLA, BOLD360 mostly in Oracle, DB2 databases, converted into parquet and Delta file, transformed and loaded in Azure Data Lake (ADL) by reading from Azure Data bricks using Scala and Pyspark with the help of all databases server, host information, username, password, driver JCBC jar files corresponding to different databases.
  • Used various sparkTransformationsandActionsfor cleansing the input data.
  • Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
  • Performance was optimized using Hyperparameter tuning, debugging, parameter fitting and troubleshooting of models and automated the processes.
  • Analyze Data and Performed Data Preparation by applying historical model on the data set in AZURE ML.
  • Perform Data cleaning process applied Backward - Forward filling methods on dataset for handling missing value.
  • Perform Data Transformation method for Rescaling and Normalizing Variables.
  • Plan, develop, and apply leading-edge analytic and quantitative tools and modelling techniques to help clients gain insights and improve decision-making.
  • Developed simple tocomplexMap-Reduce jobs using Java programming language that was implemented usingHiveandPig.
  • Utilize Spark, Scala, Java, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Microsoft Azure, Python, a broad variety of machine learning methods including regressions, dimensionally reduction etc.
  • Apply various machine learning algorithms and statistical modelling like regression models, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, Random forests regression
  • Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Experienced in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS or ADL.
  • Successfully connected to different data sources using SSH, SFTP from Hadoop cluster and Azure data factory.
  • Used Git/GitHub as a repository tool, merged the Hadoop code in Git branch.

Confidential, Molin, IL

Data Engineer

Responsibilities:

  • Interacted with Data Modelers and Business Analysts to understand the requirements and the impact of the ETL on the business.
  • Worked on Extracting, Transforming and Loading (ETL) data from Excel, Flat file, to MS SQL Server by using DTS andSSISservices.
  • Created packages inSSISwith error handling.
  • Worked with different methods of logging inSSIS.
  • Extensively worked on Facts and Slowly Changing Dimension (SCD) tables.
  • Performed Loading operation of historical data using full load and incremental load into Enterprise Data Warehouse.
  • Involved in building Data Marts and multi-dimensional models like Star Schema and Snowflake schema.
  • Filtered data from Transient Stage to EDW by using complex T-SQL statements in Execute SQL Query Task and in Transformations and implemented various Constraints and Triggers for data consistency and to preserve data integrity.
  • Writing T-SQL scripts, dynamic SQL, complex stored procedures, functions, triggers and SQLCMD.
  • Used data conversion tasks inSSISto load the data from flat file to SQL Server database.
  • MS SQL Server Configuration, Performance Tuning, Client-Server Connectivity, Query Optimization, Database Maintenance Plans. Performed database transfers and queries tune-up, integrity verification, data cleansing, analysis and interpretation.
  • Constructed OLTP and OLAP Databases.
  • Created complex SSAS cubes with multiple fact measures groups, and multiple dimension hierarchies based on the OLAP reporting needs.
  • ETL, data profiling, data quality and clean ups for SSIS packages.
  • Creating and managing schema objects such as tables, views, indexes, stored procedures, and triggers & maintaining Referential Integrity.
  • Created packages using SSIS for data extraction from Flat Files, Excel Files, and OLEDB to SQL Server.
  • Developed complex SSRS reports using multiple data providers, Global Variables, Expressions, user defined objects, aggregate aware objects, charts, and synchronized queries.
  • Designed and developed OLAP Cubes and Dimensions using SQL Server Analysis Services (SSAS).
  • Work with project teams for upgrading SQL Server 2005 to SQL Server 2008.

Confidential, Deerfield, IL

Data Engineer

Responsibilities:

  • Developed, reviewed and updated architecture and process documentation, server diagrams, requisition documents and other technical documents.
  • Followed Agile methodology, attended daily Scrum meetings and Sprint planning meetings.
  • Responsible for cluster maintenance, commissioning data nodes, cluster monitoring, troubleshooting, management & review data backups, and Hadoop log files.
  • Implemented MapReduce programs to handle unstructured data like XML & JSON files and sequence files for log files.
  • Expertise in writingSpark RDDtransformations, Actions,Data Frames, Case classes for the required input data and performed the data transformations usingSpark - Core.
  • Experience in integratingHivequeries intoSparkenvironment usingSpark SQL and Pspark.
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Preformed aggregation over large amounts of log data collected using Apache Flume and staging data in HDFS for further analysis.
  • Experience in implementingSpark RDD'sinScala and Python.
  • Developed Preprocessing job usingSpark Dataframes to flattenJsondocuments to flat file.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Scheduled periodic jobs which range from updates on MapReduce jobs to creating Ad-hoc jobs for the business users.
  • Configured Kafka for efficiently collecting, aggregating and moving large amounts of click stream data from many different sources to HDFS.
  • Developed REST-based Microservices supporting both XML and JSON to perform task such as demand response management.
  • Implemented AWS solutions using EC2, S3, RDS, EMR, EBS, Elastic Load Balancer, Auto scaling groups and EC2 instances.
  • Used AWS services like EC2 and S3 for small data sets processing and storage.
  • Provisioning of Cloudera Director AWS instance and adding Cloudera manager repository to scale up Hadoop Cluster in AWS.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Used HiveQL for data analysis like creating tables and import the structured data to specified tables for reporting.
  • Used Pig to perform data validation on the data ingested using scoop and flume and the cleansed data set is pushed into HBase.
  • Responsible in creatingmappingsandworkflowsto extract and load data from relationaldatabases, flat file sources and legacy systems usingTalend.
  • Developed and designed ETL Jobs usingTalendIntegration Suite in Talend 5.2.2
  • UsedSpark-SQLto LoadJSONdata and createSchemaRDDand loaded it into Hive Tables and handled Structured data usingSpark SQL.
  • UsedSpark-SQLto Load data intoHive tablesand Written queries to fetch data from these tables.
  • DevelopedSparkPrograms usingPysparkandScalaAPI's and performed transformations and actions onRDD's.
  • Designed and implementedSparkjobs to support distributed data processing.
  • Experienced in writingSparkApplications inScalaandPython (Pyspark).

Confidential, Sacramento, CA.

Hadoop Developer

Responsibilities:

  • Participated in software development life cycle from requirement gathering to product delivery.
  • Involved in developing Hadoop Map Reduce jobs using Java Runtime Environment for the batch processing to search and match the scores.
  • Loading process into the Hadoop Distributed File System (HDFS) and Pig language in order to preprocess the data.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box such as Map-Reduce, Pig, Hive, Sqoop, Flume.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Wrote optimized Pig Script along with involved in developing and testing PIG Latin Scripts.
  • Working knowledge in writing Pig's Load and Store functions.
  • Developed job flows to automate the workflow for PIG and HIVE jobs.
  • Created final tables in Parquet format. Use of Impala to create and manage Parquet tables.
  • Implemented data ingestion & handling clusters in real time processing using Apache Kafka.
  • Involve in creating Hive tables, loading with data and writing Hive queries which will run internally in map reduce way.
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Worked on multithreaded middleware using socket programming to introduce whole set of new business rules implementing OOPS design and principles.
  • Involved in implementing Java multithreading concepts.
  • Developed several REST web services supporting both XML and JSON to perform task such as demand response management.
  • Implemented the log functionality by using Log4j and internal logging API's, used on Git for a version control tool.

We'd love your feedback!