We provide IT Staff Augmentation Services!

Sr. Big Data/data Engineer Resume

2.00/5 (Submit Your Rating)

Dallas, TexaS

SUMMARY

  • Around 8 years of professional experience involving project development, implementation, deployment, and maintenance using Big Data technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase.
  • Has Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer
  • 3+ years of experience using Talend Data Integration/Big Data Integration (6.1/5.x) / Talend Data Quality.
  • Experience wif Talend DI Installation, Administration and development for data warehouse and application integration.
  • Experienced in using debug mode of Talend to debug a job to fix errors.
  • Experience in working wif different Hadoop distributions like CDH and Hortonworks.
  • Experienced wif teh Spark improving teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Having proficient experience in various Big Data technologies like Hadoop, Apache NiFi, Hive Query Language, HBase NoSQL database, Sqoop, Spark, Scala, OOZIE and Pig. Oracle Database and Unix shell Scripting technologies.
  • Experience in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster related issues.
  • 3 Years of Experience ETL-Snowflake Developer.
  • Works on loading data into Snowflake DB in teh cloud from various sources.
  • Implemented Enterprise Data Lakes using Apache NiFi.
  • Developed and designed Microservices components for teh business by using Spring Boot.
  • Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended teh default functionality by writing User Defined Functions (UDFs), User Defined Aggregate Function (UDAFs) for custom data specific processing.
  • Strong Knowledge on Architecture of Distributed systems and parallel process ing, In-depth understanding of MapReduce programing paradigm and Spark execution framework.
  • Good experience in creating data ingestion Pipelines, Data Transformations, Data Management, Data Governance, and real time streaming at an enterprise level.
  • Hands on expertise wif AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis).
  • Involved in designing and deploying multi-tier applications using all teh AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Experience in using SDLC methodologies like Waterfall, Agile Scrum for design and development.
  • Expert in working wif Hive data warehouse tool-creating tables, data distribution by implementing Partitioning and Bucketing, Writing, and Optimizing teh HiveQL queries.
  • Experience in implementing Azure data solutions, provisioning storage account, Azure Data Factory, SQL server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.
  • In depth understanding of Hadoop Architecture and its various components such as Resource Manager, Application Master, Name Node, Data Node, HBase design principals etc.
  • Experience in various distributions: Cloudera distributions like (CDH4/CDH5).
  • Experience developing iterative Algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
  • Experience wif migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
  • Excellent Programming skills at a higher level of abstraction usingScala, AWS, and Python.
  • Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper.
  • Profound understanding of Partitions and Bucketing concepts in kaf and designed both Managed and External tables in Hive to optimize performance.
  • Experience in writing real time query processing using Cloudera Impala.
  • Strong working experience in planning and carrying out of Teradata system extraction using Informatica, Loading Process and Data warehousing, Large-scale Database Management and Reengineering.
  • Highly experienced in creating complex Informatica mappings and workflows working wif major transformations.
  • In depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Taskscheduler, Stages and Worked on NoSQL databases including HBase and Mongo DB.
  • Experienced wif performing CRUD operations using HBase Java Client API and Solr API.
  • Good experience in working wif cloud environment like Amazon Web Services (AWS) EC2 and S3.
  • Experience in Implementing Continuous Delivery pipeline wif Maven, Ant, Jenkins, and AWS.
  • Good understanding of Spark Architecture wif Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure wif Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks, Managing teh Machine Learning Lifecycle
  • Experience in creating Docker Containers leveraging existing Linux Containers and AMI’s in addition to
  • Creating Docker Containers from scratch.
  • Managed Docker orchestration and Docker containerization using Kubernetes.
  • Experience writing Shell scripts in Linux OS and integrating them wif other solutions.
  • Strong Experience in working wif Databases like Oracle 10g, DB2, SQL Server 2008 and MySQL and proficiency in writing complex SQL queries.
  • Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Experience in automation and building CI/CD pipelines by usingJenkinsandChef.
  • Experience on agile methodologiesScrum.
  • Experience wif Snowflake Multi - Cluster Warehouses.
  • Understanding of Snowflake cloud technology.
  • Experience wif Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
  • Researchandresolveissuesin regard wifScrum/Kanban methodologies/Process Improvement.
  • Strong knowledge of Data Warehousing implementation concept in Redshift. Has done a POC wif Matillion and Redshift for DWImplementation
  • Very good exposure and rich experience in ETL tools like Alteryx, Matillion & SSIS

TECHNICAL SKILLS

Hadoop Technologies: HDFS, MapReduce, YARN, Hive, Pig, HBASE, Impala, Zookeeper, Sqoop, OOZIE, Apache, Cassandra, Flume, Spark, AWS, EC2

Java Technologies: J2EE, JSP, JSTL, EJB, JDBC, JMS, JNDI, JAXB, JAX-WS, JAX-RPC, SOAP, WSDL

Web Technologies: HTML, CSS, JavaScript, AJAX, Servlets, JSP, DOM, XML, XSLT

Languages: C, Java, SQL, PL/SQL, Scala, Shell Scripts

Operating Systems: Linux, UNIX, Windows

Databases: NoSQL, Oracle, DB2, MySQL, SQL S server, MS Access, HBase

Application Servers: WebLogic, WebSphere, Apache Tomcat, JBOSS

IDE s: Eclipse, NetBeans JDeveloper, IntelliJ, IDEA

Version Control: CVS, SVN, Git

Reporting Tools: Jasper soft, Qlik Sense, Tableau, JUnit

PROFESSIONAL EXPERIENCE

Confidential -Dallas, Texas

Sr. Big Data/Data Engineer

Responsibilities:

  • Developed mat data pipelines using Spark and PySpark.
  • Analyzed SQL scripts and designed teh solutions to implement using PySpark.
  • Developing data processing tasks using PySpark such as reading data from external sources, merge data, perform data enrichment and load in to target data destinations.
  • Used Pandas, NumPy, Spark in Python for developing Data Pipelines.
  • Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in Python.
  • Part of team conducting logical Data analysis and Data modeling JAD sessions, communicated data-related standards.
  • Worked collaboratively to manage build outs of large data clusters and real time streaming wif Spark.
  • Implement teh Kafka to hive streaming process flow and batch loading of data into MariaDB using Apache NiFi.
  • Involved in teh analysis, design, and development and testing phases of Software Development Lifecycle (SDLC)
  • Involved in SDLC requirements gathering, analysis, design, development and testing of application, developed using AGILE/Scrum methodology
  • Implement end-end data flow using Apache NiFi.
  • Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.
  • Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
  • Created Talend jobs to copy teh files from one server to another and utilized Talend FTP components.
  • Analyzing teh source data to know teh quality of data by using Talend Data Quality.
  • Developed Talend jobs to populate teh claims data to data warehouse - star schema.
  • Used Talend to Extract, Transform and Load data into Netezza Data Warehouse from various sources like Oracle and flat files
  • Developed teh batch scripts to fetch teh data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Data Processing: Processed data using Map Reduce and Yarn. Worked on Kafka as a proof of concept for log processing.
  • Evaluate Snowflake Design considerations for any change in teh application
  • Build teh Logical and Physical data model for snowflake as per teh changes required
  • Define virtual warehouse sizing for Snowflake for different type of workloads.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development, and deployment focused to bring teh data driven culture across teh enterprises
  • Monitoring teh Hive Meta store and teh cluster nodes wif teh help of Hue.
  • Created Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File, RDBMS as part of a POC using Amazon EC2.
  • Created AWS EC2 instances and used JIT Servers.
  • Extensively worked on various GCP infrastructure design and implementation strategies and experienced in Designing, architecting, and implementing scalable cloud-based web applications using AWS and GCP.
  • Developed various UDFs in Map-Reduce and Python for Pig and Hive.
  • Data Integrity checks has been handled using Hive queries, Hadoop, and Spark.
  • Build Hadoop solutions for big data problems using MR1 and MR2 in s3.
  • Moved data from S3 bucket to Snowflake Data Warehouse for generating teh reports.
  • Bulk loading and unloading data into Snowflake tables using COPY command.
  • Created DWH, Databases, Schemas, Tables, write SQL queries against Snowflake.
  • Validate teh data feed from teh source systems to Snowflake DW cloud platform.
  • Integrated and automated data workloads to Snowflake Warehouse.
  • Ensure ETL/ELTs succeeded and loaded data successfully in Snowflake DB
  • Worked on performing transformations & actions on RDDs and Spark Streaming data wif Scala.
  • Implemented teh Machine learning algorithms using Spark wif Python.
  • Defined job flows and developed simple to complex Map Reduce jobs as per teh requirement.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed PIG UDFs for manipulating teh data according to Business Requirements and worked on developing custom PIG Loaders.
  • Used different AWS Data Migration Services and Schema Conversion Tool along wif Matillion ETL tool.
  • Responsible in handling Streaming data from web server console logs.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
  • Developed PIG Latin Scripts for teh analysis of semi structured data.
  • Used Hive and created Hive Tables and involved in data loading and writing Hive UDFs.
  • Used Sqoop to import data into HDFS and Hive from other data systems.
  • Installed and configured Apache Hadoop to test teh maintenance of log files in Hadoop cluster.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on teh Hadoop cluster.
  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Involved in NoSQL database design, integration, and implementation.
  • Loaded data into NoSQL database HBase.
  • Developed Kafka producer and consumers, HBase clients, Spark, and Hadoop MapReduce jobs along wif components on HDFS, Hive.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Extensive experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, S3, IAM, EBS, RDS, ELB, VPC, Route53, Ops Works, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS, AWS SQS, AWS SES, AWS SWF & AWS Direct Connect.
  • Leveraged Spark as ETL tool for building Data pipelines on various cloud platforms like AWS EMRs, Azure HD Insights and MapR CLDB architectures.
  • Experience wif designing, building, and operating solutions using virtualization using private hybrid/public cloud technologies.
  • Created Automation to create infrastructure for Kafka clusters different instances as per components in cluster using Terraform for creating multiple EC2 instances & attaching ephemeral or EBS volumes as per instance type in different availability zones & multiple regions in AWS.

Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, AWS, ETL, PIG, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, Alteryx, Matillion & SSIS, Azure Databricks, Talend, Snowflake, GCP.

Confidential -Scottsdale, AZ

Data Engineer

Responsibilities:

  • Experienced in development using Cloudera Distribution System.
  • Experience in Designing, Architecting, and implementing scalable cloud-based web applications usingAWSandGCP.
  • Can work parallelly in both GCP and Azure Clouds coherently.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Design and develop ETL integration patterns using Python on Spark.
  • Optimize teh PyCfFspark jobs to run on Secured Clusters for faster data processing.
  • Developed Spark scripts by using Python and Scala shell commands as per teh requirement.
  • Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
  • Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
  • Designing and Developing Apache NiFi jobs to get teh files from transaction systems into data lake raw zone.
  • Analyzed teh user requirements and implemented teh use cases using Apache NiFi.
  • Proficient working experience on big data tools like Hadoop, Azure Data Lake, and AWS Redshift.
  • Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
  • Excellent working wif Data modeling tools like Erwin, Power Designer and ER Studio.
  • As a Hadoop Developer, my role is to manage teh Data Pipelines and Data Lake.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Designed custom Spark REPL application to handle similar datasets.
  • Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
  • Performed Hive test queries on local sample files and HDFS files.
  • Used AWS services like EC2 and S3 for small data sets.
  • Managed user and AWS access using AWS IAM and KMS.
  • Deployed microservices into AWS - EC2.
  • Implementation of data movements from on-premises to cloud in Azure.
  • Designed relational and non-relational data stores on Azure
  • Strong working knowledge on Kubernetes and Docker.
  • Developed teh application on Eclipse IDE.
  • Developed Hive queries to analyze data and generate results.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.
  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
  • Used Scala to write code for all Spark use cases.
  • Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
  • Assigned name to each of teh columns using case class option in Scala.
  • Worked on migratingMap Reduce programsintoSparktransformations usingSparkandScala, initially done usingPython (PySpark).
  • Involved in converting teh hql’s in to spark transformations using spark RDD wif support of Python and Scala.
  • Developed multiple Spark SQL jobs for data cleaning.
  • Created Hive tables and worked on them using Hive QL.
  • Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.
  • Developed Spark SQL to load tables into HDFS to run select queries on top.
  • Developed analytical component using Scala, Spark, and Spark Stream.
  • Responsible for estimating teh cluster size, monitoring, and troubleshooting of teh Spark databricks cluster
  • Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
  • Worked on teh NoSQL databases HBase and mongo DB.

Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark, Managing Clusters Databricks, Azure, GCP

Confidential - Irvine, CA

Big Data Engineer

Responsibilities:

  • Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for teh analysis.
  • Migrated Existing MapReduce programs to Spark Models using Python.
  • Migrating teh data from Data Lake (hive) into S3 Bucket.
  • Done data validation between data present in Data Lake and S3 bucket.
  • Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
  • Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
  • Used Kafka for real time data ingestion.
  • Created different topic for reading teh data in Kafka.
  • Read data from different topics in Kafka.
  • Involved in converting teh hql’s in to spark transformations using Spark RDD wif support of python and Scala.
  • Implemented Azure Data Factory operations and deployment into Azure for moving data from on-premises into cloud.
  • Written Hive queries for data analysis to meet teh business requirements.
  • Migrated an existing on premises application to AWS.
  • Used AWS Cloud wif Infrastructure Provisioning / Configuration.
  • Developed PIG Latin scripts to extract teh data from teh web server output files and to load into HDFS.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
  • Good knowledge on Spark platform parameters like memory, cores, and executors
  • By using Zookeeper implementation in teh cluster, provided concurrent access for Hive Tables wif shared and exclusive locking.
  • Configured teh monitoring solutions for teh project using Data Dog for infrastructure, ELK for app logging.

Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP.

Confidential

Data Analyst

Responsibilities:

  • Document teh complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Recommended structural changes and enhancements to systems and Databases.
  • Conducted Design reviews and technical reviews wif other project stakeholders.
  • Was a part of teh complete life cycle of teh project from teh requirements to teh production support.
  • Created test plan documents for all back-end database modules.
  • Used MS Excel, MS Access, and SQL to write and run various queries.
  • Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
  • Worked wif internal architects and assisting in teh development of current and target state data architectures.
  • Coordinate wif teh business users in providing appropriate, effective, and efficient way to design teh new reporting needs based on teh user wif teh existing functionality.
  • Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
  • Add according to you're clients and business stories at client location

We'd love your feedback!