We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

4.00/5 (Submit Your Rating)

CA

PROFESSIONAL SUMMARY

  • 8 years of technical IT experience in all phases of Software Development Life Cycle (SDLC) wif skills in data analysis, design, development, testing and deployment of software systems.
  • Had more than 5 yearsof industrial experience inBig Data analytics,Data manipulation using Hadoop Eco system toolsMap - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie, Avro,AWS,Spark integration wif Cassandra, Zookeeper.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data Bricks andAzure SQL Data warehouseand controlling and granting database accessandmigrating on premise databases toAzure Data Lake storeusing Azure Data factory.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analysing and transforming teh data to uncover insights into teh customer usage patterns.
  • Hands on experience in Hadoop Ecosystem components such as Spark, SQL, Hive, Pig, Sqoop, Flume, Zookeeper/Kafka, HBase and MapReduce.
  • Experience in converting SQL queries into Spark Transformations using Spark RDDs, Scala and Performed map-side joins on RDD's.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)/ Non-Relational Database Systems and vice-versa.
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
  • Experience in developing, support and maintenance for teh ETL (Extract, Transform and Load) processes using Talend Integration Suite.
  • Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
  • Excellent Programming skills at a higher level of abstraction usingScala, Java, and Python.
  • Experience in Hive partitioning, bucketing and perform joins on Hive tables and implement Hive SerDes.
  • Worked on different file formats like delimited files, Avro, json and parquet.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Hands on Experience in designing and developing applications in Spark using Scala and PySpark to compare teh performance ofSpark wif Hive and SQL/Oracle.
  • Experience in manipulating/analysing large datasets and finding patterns and insights wifin structured and unstructured data.
  • Solid experience in working wifcsv, text, sequential, Avro, parquet, orc, Jasonformats of data.
  • Widely used different features of Teradata such as BTEQ, Fast load, Multiload, SQL Assistant, DDL and DML commands and very good understanding of Teradata UPI and NUPI, secondary indexes and join indexes.
  • Experience in writing complex SQL queries, creating reports and dashboards.
  • Ability to tune Big Data solutions to improve performance and end-user experience.
  • Having working experience wif Building RESTful web services, and RESTful API.
  • Managed multiple tasks and worked under tight deadlines and in fast pace environment.
  • Excellent analytical, communication skills which helps to understand teh business logics and develop a good relation between stakeholders and team members.
  • Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.

TECHNICAL SKILL-SET

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Hadoop Distribution: Cloudera CDH, Horton Works HDP, Apache, AWS

MachineLearning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbour (KNN), TEMPPrincipal Component Analysis

Languages: Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions

Web Technologies: HTML, JavaScript, Restful, SOAP

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Version Control: GIT, GIT HUB

IDE & Tools, Design: Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB).

Operating Systems: Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux

Cloud Technologies: MS Azure, Amazon Web Services (AWS)

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, Google Shell, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design.

PROFESSIONAL EXPERIENCE

Confidential, CA

Sr. Big Data Engineer

Roles & Responsibilities

  • Worked as Data Engineer to review business requirement and compose source to target data mapping documents.
  • Advice teh business on best practices in teh Spark SQL while making sure teh solution meet teh business needs.
  • Involve in preparation, distribution, and collaboration of client specific quality documentation on developments for Big Data and Spark along wif regular monitoring on reflecting teh modifications or enhancements done in Confidential Schedulers.
  • Participated in requirements sessions to gather requirements along wif business analysts and product owners.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Data Extraction, aggregations, and consolidation of Adobe data wifin AWS Glue using PySpark.
  • Involved in Big data requirement analysis, develop and design solutions for ETL and Business Intelligence platforms.
  • Designed dimensional data models using Star and Snowflake Schemas.
  • Consumed XML messages using Kafka and processed teh xml file using Spark streaming to capture UI updates.
  • Migrate data into RV Data Pipeline using Data Bricks, Spark SQL and Scala.
  • Used Databricks for encrypting data using server-side encryption.
  • Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
  • Used Delta bricks as it combines data science, data engineering, and production workflows.
  • Machine learning life cycles can be illustrated perfectly by using delta lakes.
  • Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Built and designed ETL pipeline using python to fetch data from Redshift data warehouse and applications.
  • Create Pyspark frame to bring data from DB2 to Amazon S3.
  • Optimize teh Pyspark jobs to run on Kubernetes Cluster for faster data processing.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
  • ACID transactions on spark are offered by using Delta Lake.
  • Databricks access to teh managed S3 bucket wif a cross-account AWS Identity and Access Management (IAM) role.
  • Accessed teh Hive tables using Spark Hive context (Spark SQL) and used Scala for interactive operations.
  • Implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
  • Worked wif structured/semi-structured data ingestion and processing on AWS using S3, Python and Migrate on-premises big data workloads to AWS.
  • Wrote Python scripts to parse XML documents and load teh data in database.
  • Worked on QA teh data and adding Data sources, snapshot, caching to teh report.
  • Involved in SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on basis of using defect reports.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
  • Involved in preparing SQL and PL/SQL coding convention and standards.
  • Involved in Data mapping specifications to create and execute detailed system test plans.

ENVIRONMENT: Agile, ODS, OLTP, ETL, HDFS, Kafka, AWS, S3, Python, K-means, XML, SQL, Talend, Redshift, Glue, Delta Lake Lambda, MS SQL, Cosmos DB, MongoDB, Ambari, Power BI, Azure DevOps, Ranger, Git. Spark, Hive, Scala, pyspark.

Confidential, Greenwood Village, CO

Senior Big Data Developer

Roles & Responsibilities

  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data Catalog, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) that process teh data using teh SQL Activity.
  • Developed Spark applications usingScalaandSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analysing & transforming teh data to uncover insights into teh customer usage patterns.
  • Managed host Kubernetes environment, making it quick and easy to deploy and manage containerized applications wifout container orchestration expertise.
  • Undertake data analysis and collaborated wif down-stream, analytics team to shape teh data according to their requirement.
  • Used Azure Event Gridfor managing eventservice that enables you to easily manage events across many differentAzureservices and applications.
  • Used Service Busto decouple applications andservicesfrom each other, providing teh benefits like Load-balancing work across competing workers.
  • Scalable metadata handling, Streaming and batch unification are offered by Delta Lake.
  • Used Delta Lakes for time travelling as Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  • Delta lake supports merge, update and delete operations to enable complex use cases.
  • Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
  • Used Databricks to integrate easily wif teh whole Microsoft stack.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in InAzure Databricks.
  • UsedAzure Data Catalogwhich helps in organizing and to get more value from their existing investments.
  • Used Azure Synapseto bring these worlds together wif a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats wif Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all teh hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Collected teh Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
  • Responsible for resolving teh issues and troubleshooting related to performance of Hadoop cluster.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Analysed teh SQL scripts and designed it by using PySpark SQL for faster performance.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Provide guidance to development team working on PySpark as ETL platform
  • Utilized machine learning algorithms such aslinear regression, multivariate regression, PCA, K-means, & KNNfor data analysis.
  • Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of teh GIT Repositories, and teh access control strategies.

ENVIRONMENT: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure data grid, Azure Synapse analytics, Azure data catalog, Service bus ADF, Delta lake, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Confidential, Negaunee, MI

Data Engineer

Roles & Responsibilities:

  • Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, Kafka
  • Performing hive tuning techniques like partitioning and bucketing and memory optimization.
  • Worked on different file formats like parquet, orc, json and text files.
  • Worked on migratingMapReduce programsintoSparktransformations usingSparkandScala, initially done usingpython (PySpark).
  • Used spark SQL to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark SQL.
  • Worked on analysing Hadoop cluster using different big data analytic tools includingFlume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, SparkandKafka.
  • As aBig DataDeveloper implemented solutions for ingesting data from various sources and processing teh Data-at-Rest utilizing Big Data technologies such asHadoop, MapReduce Frameworks, MongoDB, Hive, Oozie, Flume, SqoopandTalendetc.
  • Explored wif theSparkimproving teh performance and optimization of teh existing algorithms inHadoopusingSparkContext,Spark -SQL, Data Frame,Pair RDD's,Spark, YARN, pyspark.
  • UsedOozieworkflow engine to manage interdependentHadoopjobs and to automate several types ofHadoopjobs such as Java map-reduceHive, Pig, andSqoop.
  • Teh Databricks platform follows best practices for securing network access to cloud applications.
  • Hands on experiences on git bash commands like git pull to pull teh code from source and developing it as per teh requirements, git add to add files, git commit after teh code build and git push to teh pre prod environment for teh code review and later used screwdriver.yaml which actually build teh code, generates artifacts which releases in to production
  • Performed data validation which does teh record wise counts between teh source and destination.
  • Involved in teh data support team as role of bug fixes, schedule change, memory tuning, schema changes loading teh historic data.
  • Worked on implementation of some check points like hive count check, Sqoop records check, done file create check, done file check and touch file lookup.
  • Worked on both Agile and Kanban methodologies

ENVIRONMENT: Hadoop, Map Reduce, HDFS, Hive, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, GitHub, Talend Big Data Integration, Impala.

Confidential

Hadoop Developer

Responsibilities

  • Involved in teh Complete Software development life cycle (SDLC) to develop teh application.
  • Worked wif different source data file formats like JSON, CSV, and ORC etc.
  • Worked on Spark for improving performance and optimization of existing algorithms in Hadoop using Spark-SQL and Scala.
  • Worked on creating Hive managed and external tables based on teh requirement.
  • Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.
  • Clusters are created in a single VPC that Databricks creates and configures.
  • Analysed theSQL scriptsand designed teh solution to implement usingPySpark.
  • Experience in importing data from various data sources like MySQL and Netezza using Sqoop, SFTP, performed transformations using Hive, Pig and loaded data back into HDFS.
  • Import and export data between teh environments like MySQL, HDFS and deploying into productions.
  • Worked on partitioning and used bucketing in HIVE tables and setting tuning parameters to improve teh performance.
  • Involved in developing Impala scripts to do Ad hoc queries.
  • Experience in Oozie workflow scheduler template to manage various jobs like Sqoop, MR, Pig, Hive, Shell scripts, etc.
  • Involved in importing and exporting data from HBase using Spark.
  • Involved in POC for migrating ETLS from Hive to Sparkin Sparkon Yarn Environment.
  • Utilized Hive tables and HQL queries for daily and weekly reports. Worked on complex data types in Hive like Structs and Maps.
  • Created Cassandra tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Collaborated wif teh infrastructure, network, database, application, and BI teams to ensure data quality and availability.
  • Designing ETL processes usingInformaticato load data from Flat Files, Oracle, and Excel files to target Oracle Data Warehouse database.
  • Worked on Hortonworks distribution and responsible for Data Ingestion, Data Cleansing, Data Standardization and Data Transformation.
  • Worked wif external vendors or partners to onboard external data into Target s3 buckets.
  • Worked on Oozie to develop workflows to automate ETL data pipeline.
  • Imported data from various sources into Spark RDD for analysis.
  • Configured Oozie workflow to run multiple Hive jobs which run independently wif time and data availability.
  • Supported code/design analysis, strategy development and project planning.
  • Assisted wif data capacity planning and node forecasting.

ENVIRONMENT: Hadoop, Kafka, Spark, Sqoop, Spark SQL, pyspark, Spark-Streaming, Horton works, MapReduce, Hive, Scala, pig, NoSQL, Impala, Oozie, HBase, Zookeeper, Oozie, Oracle, MySQL, Netezza, Kubernetes, Docker, CI/CD and UNIX Shell Scripting.

Confidential

Data Analyst

Responsibilities

  • Understand teh data visualization requirements from teh Business Users.
  • Writing SQL queries to extract data from teh Sales data marts as per teh requirements.
  • Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Designed and deploy rich Graphic visualizations wif Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Explored traffic data from databases connecting them wif transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging, and exporting teh dataset was done in Tableau Prep.
  • Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality to improve teh analysis.

ENVIRONMENT: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel, Tableau.

We'd love your feedback!