We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

5.00/5 (Submit Your Rating)

TX

SUMMARY

  • Around 7 years of technical experience in Analysis, Design, Development with Big Data technologies like Spark, MapReduce, Hive, Kafka and HDFS including programming languages such as Python, Scala andJava.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing teh data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Worked in developing a Nifi flow prototype for data ingestion in HDFS.
  • Expertise in Python and Scala, user - defined functions (UDF) for Hive and Pig using Python.
  • Experienced in creating shell scripts to push data loads from various sources from teh edge nodes onto teh HDFS.
  • Experience in developing Map Reduce programs using Apache Hadoop for analyzing teh big data as per teh requirement.
  • Hands of experience in GCP, Big Query, GCS bucket and Stack driver.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK's
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Experience working with NoSQL databases like Cassandra, HBase and MongoDB.
  • Good working noledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
  • Worked with Cloudera and Hortonworks distributions.
  • Extensive experience working on spark in performing ETL using Spark-SQL, Spark Coreand Real-time data processing using Spark Streaming.
  • Hands of experience in GCP, Big Query, GCS bucket and Stack driver.
  • Strong experience working with various file formats like Avro, Parquet, Orc, Json, Csv etc.
  • Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle).
  • Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling, granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Worked extensively on Sqoop for performing both batch loads as well as incremental loads from relational databases.
  • Experience in designing star schema, Snowflake schema for Data Warehouse and ODS architecture.
  • Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python.
  • Proficient SQL experience in data extraction, querying and developing queries for a wide range of applications.
  • Experience working with GitHub, Jenkins, and Maven.
  • Performed importing and exporting data into HDFS and Hive using Sqoop.
  • Strong experience in analyzing large amounts of data sets writingPySparkscripts and Hive queries.
  • Highly motivated, self-learner with a positive attitude, willingness to learn new concepts and accepts challenges.

TECHNICAL SKILLS

BigData Ecosystem: Hive, Spark, MapReduce, Hadoop, Yarn, HDFS, Hue, Impala, HBase, Oozie, Sqoop, Pig, Flume, Airflow

Programming Languages: Python, Scala, Shell Scripting and Java

Methodologies: Agile/Scrum development, Waterfall model, RAD

Build and CICD: Maven, docker, Jenkins, GitLab

Cloud Management: GCP, Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, Lambda, Atana, Microsoft Azure.

Databases: MySQL, Oracle, Teradata

NO SQL Databases: Cassandra, MongoDB and HBase

IDE and ETL Tools: IntelliJ, Eclipse, Informatica 9.6/9.1, Tableau prep.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential

Sr. Big Data Engineer

Responsibilities:

  • Responsible for ingesting large volumes of user behavioural data and customer profile data to Analytics Data store.
  • Developed PySpark and Scala based Spark applications for performing data cleaning, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Wrote Spark-Streaming applications to consume teh data from Kafka topics and write teh processed streams to HBase and MongoDB.
  • Hands of experience in GCP, Big Query, GCS bucket and Stack driver.
  • Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
  • Developed a POC for project migration from on prem Hadoop MapR system to GCP.
  • Compared Self hosted Hadoop with respect to GCPs Data Proc, and explored Big Table (managed HBase) use cases, performance evolution.
  • Used cloud shell SDK in GCP to configure teh services Data Proc, Storage, Big Query and DQF.
  • Worked on fine-tuning spark applications to improve teh overall processing time for teh pipelines.
  • Wrote Kafka producers to stream teh data from external rest APIs to Kafka topics.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations, and other capabilities.
  • Good experience with continuous Integration of application using Bamboo.
  • Worked extensively with Sqoop for importing data from Oracle.
  • Created private cloud using Kubernetes that supports DEV, TEST, and PROD environments.
  • Written HBase bulk load jobs to load processed data to HBase tables by converting to Hfiles.
  • Designing and customizing data models for Data warehouse supporting data from multiple sources on real time.
  • Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
  • Wrote Glue jobs to migrate data from HDFS to S3 data lake.
  • Involved in creating Hive tables, loading, and analysing data using hive scripts.
  • Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
  • Designed, documented operational problems by following standards and procedures using JIRA.
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction.
  • Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
  • Collaborated with teh infrastructure, network, database, application, and BA teams to ensure data quality and availability.

Environment: Hadoop, Spark, Scala, GCP, Python, Hive, HBase, MongoDB, Sqoop, Oozie, Kafka, Snowflake, Amazon EMR, Glue, YARN, JIRA, amazon S3, Shell Scripting, SBT, GITHUB, Maven.

Confidential, TX

Big Data Engineer

Responsibilities:

  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
  • Developed ETL data pipelines using Spark, Spark streaming and Scala.
  • Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
  • Has experience of working on Snowflake data warehouse.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, Azure Data CatLog, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Using Azure Databricks, created Spark clusters and configured high concurrency clusters to speed up teh preparation of high-quality data.
  • Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Defined job flows and developed simple to complex Map Reduce jobs as per teh requirement.
  • Developed PIG UDFs for manipulating teh data according to Business Requirements and worked on developing custom PIG Loaders.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Designing and Developing Apache NiFi jobs to get teh files from transaction systems into data lake raw zone.
  • Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with GCP.
  • Developed PIG Latin scripts for teh analysis of semi structured data
  • Experienced in Databricks platform where it follows best practices for securing network access to cloud applications.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on teh Hadoop cluster.
  • Analysed teh SQL scripts and designed it by using PySpark SQL for faster performance.
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, Azure, Azure Databricks, Azure data grid, Azure Synapse analytics, Azure data catalog, ETL, PIG, PySpark, UNIX, Linux, Tableau, Teradata, Pig, Snowflake, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, GIT HUB

Confidential, Lake Success, NY

Data Engineer

Responsibilities:

  • Responsible for building an Enterprise Data Lake to bring ML ecosystem capabilities to production and make it readily consumable for data scientists and business users.
  • Processing and transforming teh data using AWS EMR to assist teh Data Science team as per business requirement.
  • Developing Spark applications for cleaning and validation of teh ingested data into teh AWS cloud.
  • Working on fine-tuning Spark applications to improve teh overall processing time for teh pipelines.
  • Implement simple to complex transformation on Streaming Data and Datasets.
  • Work on analysing Hadoop cluster and different big data analytic tools including Hive, Spark, Python, Sqoop, flume, Oozie.6
  • Use Spark Streaming to stream data from external sources using Kafka service and responsible for migrating teh code base from Cloudera Platform to Amazon EMR and evaluated Amazon eco systems components like RedShift, Dynamo DB.
  • Perform configuration, deployment, and support of cloud services in Amazon Web Services (AWS).
  • Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from teh ground up on Confidential Redshift.
  • Design Develop and test ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Migrate an existing on-premises application to AWS.
  • Build and configure a virtual data centre in teh Amazon Web Services cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud, Security Groups, Elastic Load Balancer.
  • Implement data ingestion and handling clusters in real time processing using Kafka.
  • DevelopSparkPrograms usingScalaandJavaAPI's and performed transformations and actions onRDD's.
  • Develop Spark application for filtering Json source data in AWS S3 and store it into HDFS with partitions and used spark to extract schema of Json files.
  • Develop Terraform scripts to create teh AWS resources such as EC2, Auto Scaling Groups, ELB, S3, SNS and Cloud Watch Alarms.
  • Developed various kinds of mappings with collection of sources, targets and transformations using Informatica Designer.
  • Develop Spark programs with PySpark and applied principles of functional programming to process teh complex unstructured and structured data sets. Processed teh data with Spark from Hadoop Distributed File System (HDFS).
  • Implement Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.

Environment: Apache Spark, Scala, Java, PySpark, Hive, HDFS, Hortonworks, Apache HBase, AWS EMR, EC2, AWS S3, AWS Redshift, Redshift Spectrum, RDS, Lambda, Informatica Center, Maven, Oozie, Apache NiFi, CI/CD Jenkins, Tableau, IntelliJ, JIRA, Python and UNIX Shell Scripting

Confidential

Data Engineer

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gatheird requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Importing data from MS SQL server and Teradata into HDFS using Sqoop.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Working noledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • Involved in teh development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Created Oozie workflows to automate and productionize teh data pipelines.
  • Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).
  • Involved in developing spark applications to perform ELT kind of operations on teh data.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving teh production job failures in several scenarios.
  • Implemented UNIX scripts to define teh use case workflow and to process teh data files and automate teh jobs.
  • Knowledge on implementing teh JILs to automate teh jobs in production cluster.
  • Involved in creating Hive external tables to perform ETL on data that is produced on daily basis.
  • Utilized Hive partitioning, bucketing and performed various kinds of joins on Hive tables

Environment: Spark, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology, Teradata.

Confidential

Data Analyst

Responsibilities:

  • Involved in designing physical and logical data model using ERwin Data modeling tool.
  • Designed teh relational data model for operational data store and staging areas, Designed Dimension & Fact table’s fordata marts.
  • Extensively used ERwin data modeler to design Logical/Physical Data Models, DataStage, relational database design.
  • Created Stored Procedures, Database Triggers, Functions and Packages to manipulate teh database and to apply teh business logic according to teh user's specifications.
  • Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.
  • Creation of database links to connect to teh other server and Access teh required info.
  • Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.
  • Used Advanced Querying for exchanging messages and communicating between different modules.
  • System analysis and design for enhancements Testing Forms, Reports and User Interaction.
  • Developed dashboards for internal executives, board members and county commissioners that measured and reported on key performance indicators.
  • Utilized excel functionality to gather, compile and analyze data from pivot tables and created graphs/charts. Provided analysis on any claims data discrepancies in reports or dashboards.
  • Key team member responsible for requirements gathering, design, testing, validation, and approval of sole analyst charged with leading teh corporate efforts to achieve CQL (council on quality leadership) accreditation.
  • Developed an advanced excel spreadsheet for caseworkers to capture data from consumers.

We'd love your feedback!