We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Detroit, MichigaN

SUMMARY

  • 8 years of experience as a Sr. Data Engineer & Hadoop developer utilizing big data, Hadoop technologies, Spark, Scala, Python, Machine Learning Algorithms, Deployment, Data Pipeline Design, Development, and Implementation as a Data Engineer.
  • Worked on Databricks Unified Data Analytics, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, and Delta Lake with Spark SQL.
  • Thorough knowledge of Databricks, Spark Architecture and Structured Streaming. Setting up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Managing Clusters in Databricks, and Managing the Machine Learning Lifecycle are all topics covered in this course.
  • Database administration, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB), SQL Server, Oracle, and Data Warehouse, among others. Construct numerous Data Lakes.
  • Built, trained, and deployed machine learning models fast using AWS Sage Maker.
  • Worked with AWS services such as EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, and SQS to build data pipelines.
  • Expertise with a variety of databases, including MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
  • Building Triggers, Tables, implementing stored Procedures, Functions, Views, User Profiles, Data Dictionaries, and Data Integrity using T - SQL is a plus.
  • Developed Snowflake Schemas by normalizing dimension data as needed and constructing a Demographic Sub Dimension as a subset of the Customer Dimension.
  • Practical knowledge of test-driven development (TDD), behavior-driven development (BDD), and acceptance test-driven development (ATDD).
  • Hands-on experience with Snowflake Database, including creating and materializing views in Snowflake.
  • Extensive expertise in text analytics, data visualizations using R and Python, and dashboard creation using Tableau and PowerBI.
  • Supported logical/physical database design, schema management, and deployment over the whole life cycle.
  • Up-to-date software development methodologies such as Agile Software Development, Scrum, Test Driven Development (TDD), and Continuous Integration and Continuous Deployment (CI/CD) are familiar.
  • Built, tested, and deployed using Kubernetes and Docker as the CI/CD system's runtime environment. Working knowledge on how to create and execute docker images with many microservices.
  • Extensive hands-on expertise with distributed computing architectures to tackle large data challenges, including AWS products (e.g., EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark, and efficient use of Azure SQL Database, MapReduce, Hive, SQL, and PySpark.
  • Proficiency in converting corporate resources and needs into manageable data formats and analytical models, as well as devising algorithms, establishing models, and implementing data mining and reporting solutions that scale across large volumes of structured and unstructured data.
  • Expert in system analysis, E-R/Dimensional Data Modeling, Database Design, and RDBMS implementation.
  • Python development experience for bespoke UDFs to expand Hive and Pig Latin features.
  • Have experience building sophisticated Mappings and performance tweaking, as well as progressively updating Dimension Tables and Fact Tables.
  • Python experience creating Automation Regressing Scripts for ETL process validation between many databases such as Oracle, SQL Server, Hive, and Mongo DB.
  • SQL proficiency in a variety of dialects (MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • It would be preferable to use and use API management technologies such as AWS API Gateway, RESTful APIs, Route 53, AWS Lambda, webservices, and so on.
  • Outstanding communication abilities. A self-motivated eager learner who has successfully worked in a fast-paced multitasking workplace both solo and in a collaborative team.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, MongoDB, Cassandra, Avro, Parquet, Snappy, AWS, Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie.

Machine Learning Classification Algorithms: Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, K-Nearest Neighbor (KNN), K-Nearest Neighbor (KNN), K-Nearest Neighbor (KNN), K-Near (SVM).

Cloud Technologies: AWS and Azure.

IDE’s: IntelliJ, Eclipse, Spyder, Jupyter, Netbeans.

Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, and other terms are used.

Databases & Warehouses: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE, NoSQL, SQL Server, MS Access, Teradata

Programming/ Query Languages: SQL, PL/SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, T SQL,Seaborn, Matplotlib, NLTK), NoSQL, PySpark, SQL, SAS, Linux shell scripts, Scala.

Version Controllers: GIT, SVN, Bitbucket.

ETL Tools: Informatica, OWB, Talend

Operating Systems: UNIX, LINUX, Mac OS, Windows.

Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapReduce, Apache EMR

PROFESSIONAL EXPERIENCE

Confidential - Detroit, Michigan

Senior Big Data Engineer

Responsibilities:

  • Create scalable and dependable ETL solutions that efficiently combine massive and complicated data from several platforms.
  • Automated operations were possible thanks to Amazon Web Services (AWS), which included EC2, S3, Cloud Front, Dynamo DB, Lambda, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM.
  • Designed and deployed multi-tier applications on AWS Cloud Formation utilizing all AWS services (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) with an emphasis on high availability, fault tolerance, and auto-scaling.
  • Using Elastic Block Storage, S3, and Glacier in AWS to support continuous storage. For EC2 instances, I created Volumes and set Snapshots.
  • Consume Data from Restful APIs, databases, and csv files were ingested.
  • Data Pipeline was migrated from Cloudera Hadoop clusters to AWS EMR clusters.
  • Hadoop cluster design and deployment experience, as well as other Big Data analytic tools such as Pig, Hive, HBase, Sqoop, Kafka, and Spark on Cloudera distribution.
  • Developed an Apache Kafka-Spark Streaming integration to ingest data from external REST APIs and perform bespoke functions.
  • Using Kafka and Spark Streaming, extract real-time feeds, transform them to RDDs, analyze data in Data Frames, and save the data in Parquet format in HDFS.
  • Created several Kafka Producers and Consumers from the ground up to meet the requirements.
  • Previous experience with spark job performance tweaking.
  • Worked on optimizing the performance of Spark tasks utilizing Cache and taking full use of the cluster environment.
  • Developed Spark scripts based on the requirements using Scala Shell commands.
  • Created an Oozie workflow engine for scheduling numerous Hive and Pig operations.
  • Assisted with the execution of Hadoop streaming tasks that processed terabytes of text data. Text, Sequence files, Avro, ORC, and Parquet were among the file types used.
  • Used Amazon EMR for Big Data processing in a Hadoop Cluster of virtual machines on Amazon's EC2 and S3 services.
  • Built, trained, and deployed machine learning models using AWS Sage Maker
  • Assisted in the creation of Docker images and the execution of tasks on a Kubernetes cluster.
  • AWS Sage Maker implementations of generalized solution models.
  • Extensive experience with Spark's core APIs and data processing on an EMR cluster.
  • Developed and deployed AWS Lambda functions for building a serverless data pipeline that can be uploaded to Glue Catalog and queried from Athena for ETL Migration services.
  • Using S3 buckets and Glacier for storage and backup on AWS, as well as setting rules for S3 buckets.
  • Knowledge of how to manage IAM users by establishing new ones, granting them restricted access based on their needs, and assigning roles and policies to them.
  • Used Snowflake's time travel functionality to retrieve previous data and participated in Snowflake testing to determine the optimum method to use cloud resources.
  • Assisted with the creation of Delta Lake tables and the execution of merging scripts to handle upserts.
  • Scripts for auto-refreshing external tables in Snowflake were implemented.
  • Used Python, Spark, and Spark Stream to create an analytical component.
  • Serve as a technical contact between the customer and the team for all AWS-related issues.
  • Worked with Spark RDDs and Scala to turn Hive/SQL queries into Spark transformations.
  • Hadoop streaming work experience processing gigabytes of xml and Json type data.
  • For Hive data analyses, Spark API was used instead of Hadoop YARN as the execution engine.
  • Installed YARN Capacity Scheduler in a variety of settings and fine-tuned configurations based on application-specific workloads.
  • Using Jenkins, Maven, and GIT, configured a Continuous Integration system to run automated test suites at predetermined intervals.
  • Working knowledge of Agile and Scrum methodologies. Designing, implementing, and maintaining Continuous Build and Integration Environments.

Environment: Hadoop, HDFS, Hive, Spark, Cloudera, AWS EC2, AWS S3, AWS ERM, Sqoop, Kafka, Yarn, Shell Scripting, Scala, Pig, Cassandra, Oozie, Agile methods, MySQL

Confidential - Boston, MA

Sr. Data Engineer

Responsibilities:

  • Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
  • Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
  • Have good experience working with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW).
  • Extract, transform and load data from source systems to Azure Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL of Azure Data Lake Analytics.
  • Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL DB, Azure SQL DW), and processing the data in Azure Databricks.
  • Have experience of working on Snow-flake data warehouse.
  • Moved the data from Azure Blob storage to snowflake database.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Designed custom Spark REPL application to handle similar datasets.
  • Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
  • Performed Hive test queries on local sample files and HDFS files.
  • Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
  • Developed Spark Applications by using Scala, python, and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Develop ETL Process using SPARK, SCALA, HIVE and HBASE.
  • Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
  • Assigned name to each of the columns using case class option in Scala.
  • Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Used the AWSSageMakerto quickly build, train and deploy the machine learning models
  • Developed analytical component using Scala, Spark, and Spark Stream.
  • Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
  • Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
  • Have good experience in logging defects in Jira and Azure Devops tools.
  • Analyzed Data Profiling Results and Performed Various Transformations.
  • Hands on Creating Reference Table using Informatica Analyst tool as well as Informatica Developer tool.
  • Written Python scripts to parse JSON documents and load the data in database.
  • Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
  • Hands-on experience with Snowflake utilities, SnowSQL, SnowPipe, Big Data model techniques using Python.
  • ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Used python APIs for extracting daily data from multiple vendors.

Environment: Azure, ADF, Azure Databricks, Snowflake, Spark, Hadoop, Hive, Oozie, Java, Linux, Oracle 11g, MySQL, IDQ Informatica Tool 10.0, IDQ Informatica Developer Tool 9.6.1 HF3.

Confidential, Coppell, Texas

Data Engineer

Responsibilities:

  • Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.
  • Migrated Existing MapReduce programs to Spark Models using Python.
  • Migrating the data from Data Lake (hive) into S3 Bucket.
  • Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
  • Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
  • Used Kafka for real time data ingestion.
  • Created different topic for reading the data in Kafka.
  • Read data from different topics in Kafka and run spark structured streaming jobs on AWS EMR cluster.
  • Moved data from s3 bucket to Snowflake data warehouse for generating the reports.
  • Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL.
  • Developed complex ETL Packages using SQL Server 2008 Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse.
  • Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports
  • Migrated an existing on-premises application to AWS.
  • Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
  • Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
  • Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
  • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
  • Good knowledge on Spark platform parameters like memory, cores and executors
  • By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking.

Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP, MS SQL Server 2014, Teradata, ETL, Tableau (Desktop 9.x/Server 9.x), Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL).

Confidential, Charlotte, NC

Hadoop Developer

Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
  • Involved in loading data from UNIX file system to HDFS.
  • Installed and configured Hive and written Hive UDFs. Importing and exporting data into HDFS and Hive using Sqoop
  • Worked hands on with ETL process using Informatica.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Extracted the data from Teradata into HDFS using Sqoop.
  • Loaded data into Hive tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.
  • Great expose to Unix scripting and good hands-on shell scripting.
  • Wrote Python scripts to process semi-structured data in formats like JSON.
  • Involved in loading and transforming of large sets of structured, semi structured, and unstructured data.
  • Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • Exported the patterns analyzed back into Teradata using Sqoop.
  • Partner with technical and non-technical resources across the business to leverage their support and integrate our efforts.
  • Partner with infrastructure and platform teams to configure, tune tools, automate tasks and guide the evolution of internal big data ecosystem; serve as a bridge between data scientists and infrastructure/platform teams.
  • Implemented Big Data Analytics and Advanced Data Science techniques to identify trends, patterns, and discrepancies on petabytes of data by using Azure Databricks, Hive, Hadoop, Python, PySpark, Spark SQL, MapReduce, and Azure Machine Learning.
  • Data analysis using regressions, data cleaning, excel v-look up, histograms and TOAD client and data representation of the analysis and suggested solutions for investors

Environment: Hadoop, MapReduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, Oozie, Bigdata ECO systems, PIG, Cloudera, Python, Informatica Cloud Services, Salesforce, Unix scripts, FlatFiles, XML files

Confidential

Data Analyst

Responsibilities:

  • Describe program creation, logic, testing, and implementation, as well as application integration and code, in detail.
  • Using MS Visio, gathered requirements and generated Use Cases, Use Case Diagrams, and Activity Diagrams
  • Performed Gap Analysis to ensure that the current system infrastructure was compatible with the new business needs.
  • Wrote technical papers, gathered technical requirements, and assembled them to aid the design system.
  • Performed unit testing, system testing, and system integrated testing.
  • Complex ETL Mappings and Sessions were tested to load data from source flat files and RDBMS tables to destination tables depending on business user needs and business rules.
  • Responsible for different Data mapping activities from Source systems toTeradata
  • Experience in creatingUNIX scripts for file transferand file manipulation.
  • Create high level and low-level design documents and review the Business requirement documents.
  • Coordinate with the onshore coordinators for daily updates and new project requirements on daily basis.
  • Worked closely with QA team and developers to clarify/understand functionality, resolve issues and provided feedback to nail down the bugs
  • Performed Acceptance Testing (UAT), unit testing and documenting
  • Maintained a close and strong working relationship with teammates and management staff to achieve expected results for the project team.
  • Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders.
  • Developed functional specifications and testing requirements using Test Director to conform to user needs.

Environment: SQL, SQL Server, MS Office, MS Visio, SQL Server 2012, Jupyter, R 3.1.2, Python, SSRS, SSIS, SSAS, MongoDB, HBase, HDFS, Hive, Pig, Microsoft office, SQL Server Management Studio, Business Intelligence Development Studio.

We'd love your feedback!