Big Data Developer Resume
MN
SUMMARY
- Around 7 years of professional experience as a Big Data Engineer designing, developing, implementing Data pipelines, Data Lake requirements in multiple enterprises using Big Data Technology stack, Python, PL/SQL, Java, SQL, REST API’s, AWS cloud platform.
- Experience in designing and implementing end to end Big Data ecosystem using HDFS, Kafka, Map Reduce, Python, Spark, PIG, HIVE, Sqoop, HBase, Oozie, Airflow and Zookeeper.
- Hands - on experience implementing and designing large scale data lakes, pipelines and efficient ETL (extract/load/transform) workflows to organize, collect and standardize data that helps generate insights and addresses reporting needs.
- Experience working with both Streaming and Batch data processing using multiple technologies.
- Hands-on experience with Spark, Databricks, AWS EMR, AWS Glue and Delta Lake.
- Hands-on experience developing data pipelines using Spark components, Spark SQL, Spark Streaming.
- Hands-on experience building, scheduling, and monitoring workflows using Apache Airflow with Python.
- Worked on AWS components and services particularly Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3), Redshift, Athena and Lambda functions.
- Hands-on experience working with Kafka streaming using K Tables, Global K Tables and K Streams and deploying these on Confluent and Apache Kafka environments.
- Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Developed and Deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and deployed Lambda Functions in Scala with custom Libraries.
- Experienced with AWS Cloud Formation templates on creating AM Roles & total architecture deployment end to end (creation of EC2 instances and its infrastructure).
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
- Experienced in working with Spark ecosystem using, Pyspark, Spark-SQL and Hive queries on different data file formats like Parquet, ORC, Sequence, Text, CSV, etc.
- Expertise in understanding AWS DNS Services through Route53. Understanding of Simple, Weighted, Latency, Failover & Geolocational Route types.
- Experienced in building Snow Pipes, migrating Teradata objects into Snowflake environment.
- In-depth knowledge of Data Sharing in Snowflake and experienced in Snowflake Database, Schema and Table structures.
- Hands-on experience interacting with REST APIs developed using the micro-services architecture for retrieving data from different sources.
- Experience with implementing CI/CD pipelines for DevOps - source code management using Git, Unit testing, build and deployment scripts.
- Hands-on experience working with DevOps tools such as Jenkins, Docker, Kubernetes, Gocd, Autosys scheduler.
- Expertise with RDBMS such as Oracle, MySQL in writing complex SQL queries and procedures, triggers.
- Implemented applications working with MongoDB, HBase, Cassandra and Redis.
- Experienced in Software methodologies such as Agile and SAFe, sprint planning, attending daily standups and story grooming.
- Worked in complete Software Development Life Cycle in Agile model.
- Strong problem-solving skills with an ability to isolate, deconstruct and resolve complex data challenges.
- Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
TECHNICAL SKILLS
Big Data Stack: Hadoop, HDFS, Map Reduce, Spark, PIG, HIVE, Sqoop, HBase, Ozzie, Flume, Storm, Kafka, Cloudera, and Zookeeper.
Programming Languages: Python, Java, SQL, Scala.
Databases: Oracle, MySQL, Cassandra, HBase, MongoDB.
Frameworks: Spring, Spring Boot, Hibernate.
CI/CD & DevOps: Go CD, Jenkins, Docker, Kubernetes.
Report & Development Tools: Postman, Eclipse, IntelliJ Idea, Visual Studio Code, Jupiter Notebook, Tableau, Power BI.
Cloud: AWS EMR, EC2, S3, Databricks, Athena, AWS Glue.
Repositories: GitHub, SVN.
Build Tools: Maven, Gradle.
Operating system: Windows, Linux, Macintosh.
Could Data Warehouse: Snowflake, Redshift.
PROFESSIONAL EXPERIENCE
Confidential, M N
Big Data Developer
Responsibilities:
- Developed real-time processing jobs using Spark Streaming with Kafka and Scala and persist data into Cassandra.
- Programming Languages experience in Python, PySpark and Spark for data ingestion.
- Developed Spark/Scala, Python for the project in the Hadoop/Hive environment.
- Developed ETL data pipelines using Hadoop big data tools - HDFS, Hive, Presto, Sqoop, Spark, Elastic Search, Kafka.
- Involved in gathering requirements from client and estimating timeline for developing complex queries using Hive for logistics application.
- Developed data pipeline using Spark, Sqoop, Hive to ingest data and customer histories into HDFS for analysis.
- Experience in designing and developing POCs in Spark using Scala and Python to compare the performance of Spark with Hive and SQL/Oracle.
- Implemented Map Reduce jobs in Hive by querying the available data and designed the ETL process by creating high-level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling and Error Handling.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Developed and designed a system to collect data from multiple portals using Kafka and then process it using Spark.
- Responsible for designing data pipelines using ETL for effective data ingestion from existing data management platforms to enterprise Data Lake.
- Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
- Used Oozie to orchestrate the map reduce jobs that extract the data on a timely manner.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Set up Build Infrastructure with Jenkins and Subversion server in AWS.
- Experience in Amazon Cloud (EC2) Hosting and AWS Administration and configuring IAM.
- Transferred data using Informatica tool from AWS S3 to AWS Redshift.
- Experience in integrating Jenkins with Docker container using Cloud bees Docker, Kubernetes pipeline plugin and provisioned the EC2 instance using Amazon EC2 plugin.
- Created S3 buckets and managed the policies for S3 buckets and then utilized S3 bucket and Glacier for storage and backup on AWS.
- Deploy Kubernetes in both AWS. Setup cluster, replicator. Deploy multiple containers in a pod.
- Worked with DevOps practices using AWS, Elastic Bean stalk and Docker with Kubernetes.
- Triggered Lambda once the data falls in S3 and sent Notifications to teams using SNS.
- Fetch data and generate monthly reports. Visualization of those reports using Tableau.
Environment: Hadoop, Java/J2EE, Spark, HDFS, Hive, HBase, Big Data, Airflow, Sqoop, Kafka, Zookeeper, Cassandra, Python, Scala, Linux, NoSQL, MySQL, PySpark, SQL Server, AWS, Kubernetes, Docker.
Confidential
Sr Data Engineer
Responsibilities:
- Evaluated suitability of Hadoop and its ecosystem to the above project and implementing / validating with various proof of concept (POC) applications to adopt them to benefit from the Big Data Hadoop Initiative.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Spark jobs in Python and Scala for data cleaning and preprocessing.
- Wrote complex SQL queries, PL/SQL stored procedures and convert them to ETL tasks.
- Created and maintained documents related to business processes, mapping design, data profiles and tools.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Worked on AWS S3 for transformations to business requirements.
- Plan, deploy, monitor, and maintain AWS cloud infrastructure consisting of multiple EC2 nodes and VMs as required in the environment.
- Experienced in managing and reviewing Hadoop log files.
- Developed Hive queries and UDFs to analyze/transform the data.
- Developed Hive scripts for implementing control tables logic in HDFS.
- Designed and Implemented Partitioning (Static, Dynamic), Buckets in Hive.
- Worked on writing complex Hive queries and Spark scripts.
- Moving data from Oracle to HDFS and vice-versa using Sqoop to supply the data for Business users.
- Designed and construct of AWS Data pipelines using various resources in AWS including AWS API Gateway to receives response from AWS lambda and retrieve data from snowflake using lambda function and convert the response into JSON format using Database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
- Wrote Spark jobs with RDD's, Pair RDDs, Transformations and actions, data frames for data transformations from relational sets.
- Integrated data quality plans as a part of ETL processes.
- Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the Spark jobs that extract the data on a timely manner.
- Involved in knowledge sharing sessions with teams.
- Implemented test scripts to support test driven development and continuous integration.
- Worked with loading and transforming large sets of structured, semi structured and unstructured data.
- Worked with Spark using Scala and Spark SQL for faster testing and processing of data.
- Used Hue browser for interacting with Hadoop components.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on Hortonworks distribution of Hadoop and used Ambari to monitor the cluster health
Environment: Hadoop, Hive, Scala, GitHub, Spark, Java/J2EE,Tableau, Sqoop, HDP, Python, Shell Scripting, AWS, Linux, Ambari.
Confidential
Sr Data Engineer
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytical and processing tools including Sqoop, Hive, Spark, Kafka and Pyspark.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Worked on MapR platform team for performance tuning of Hive and Spark jobs of all users.
- Utilized Hive on TEZ engine to increase the performance of the applications.
- Worked on incidents created by users for platform team on Hive and Spark issues by monitoring Hive and Spark logs and fixing it or else by raising MapR cases.
- Worked on Hadoop Data Lake for ingesting data from different sources such as Oracle and Teradata through Sqoop ingestion.
- Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 Data Lake and further processed it using PySpark.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Created scripts to read CSV, JSON and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
- Utilized Kubernetes and Docker for the runtime environment of CI/CD system to build, test and Octopus Deploy.
- Configured and maintained CI/CD Jenkins Master-Slave setup by enabling password-less SSH login between server and nodes to manage, distribute build workload evenly across different nodes.
- Worked on Python API for converting assigned group level permissions to table level permission using MapR ace by creating a unique role.
- Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.
- Migrated various Hive UDF's and queries into Spark SQL for faster requests.
- Configured to receive real time data from the Apache Kafka and store the stream data to HDFS using Kafka connect.
Environment: AWS Redshift, Lambda, EC2, EMR, S3, Glue, HDFS, Hadoop, Python, Hive, Sqoop, Spark, Map Reduce, Scala, Pyspark, Oracle, Oozie, YARN, Tableau, Spark-SQL, Spark-MLlib, Kafka, ARCADIA, VMs, PaaS, VNets, SQL Database, CI/CD, Terraform, Kubernetes, Docker, Jenkins, Bitbucket, GIT, Crontab
Confidential
Data Analyst
Responsibilities:
- Participated in all phases of research including data collection, data cleaning, data mining, developing models and visualizations.
- Collaborated with data engineers and operation team to collect data from internal system to fit the analytical requirements.
- Worked on migrating existing mainframe data and reporting feeds to Hadoop.
- Redefined many attributes and relationships and cleansed unwanted tables/columns using SQL queries.
- Integrated Hadoop with Tableau and SAS analytics to provide end users analytical reports
- Worked with AWS Cloud platform and its features which includes EC2, VPC, RDS, EBS, S3, CloudWatch, Cloud Trail, Cloud Formation and Autoscaling etc.
- Used AWS command line client and management console to interact with AWS resources and APIs.
- Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming
- Managing large-scale, geographically distributed database systems, including relational (Oracle, SQL server).
- Experience with financial and banking structure data
- Experience in Importing and exporting external raw data form Client and converting into structured format.
- Design data quality checks and work with data engineers to implement.
- Analyzing row data or flat file in XML format or JSON.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Visually plotted data using Tableau for dashboards and reports.
Environment: Python, AWS, SQL, Tableau, Spark, Cloud Formation.