Spark Developer Resume
Plano, TX
PROFESSIONAL SUMMARY:
- 4+years’ experience with Hadoop/Spark/Big Data processing
- Worked on Performance Tuning to Ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system availability uptime.
- Experience with distributed systems, large - scale non-relational data stores and multi-terabyte data warehouses.
- Firm grip on data modeling, data marts, database performance tuning and NoSQL map-reduce systems.
- Comprehensive hands-on experience Installing and implementing Big Data solutionsincludingApache Hadoop, Pig, Hive, HBase, Spark, Sqoop, Flumewhich are coordinated by Zookeeperon multiple projects.
- Experience with new Hadoop 2.0 architecture and developing YARN Applications on it
- Experience in managing and reviewing Hadoop log files
- Excellent working knowledge of other Hadoop ecosystem components such as Hadoop Distributed Filesystem and Hadoop Demonswhich includeResource Manager, Node Manager, Name Node, Data Node, Secondary Name Node, Containers etc.
- Experience working on Spark and Spark Streaming.
- In depth understanding of Apache spark job execution Components like DAG, lineage graph, DAG Scheduler, Task scheduler, Stages,Tasks, Logical plan, Physical plan and partitioning.
- Experience in setting up Hadoop/Hive/RDB/Spark clusters on Amazon Web Services (AWS) cloud platforms.
- Worked on Data serialization formats for converting complex objects into sequence bits forAvro, Parquet, JSON, CSVdata formats.
- Expertise in extending Hive and Pig core functionality by writing custom UDFs and UDAF’s.
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing.
- Proficient in NoSQL databases like HBase, Scylla and AWS DynamoDB.
- Experience in importing and exporting data using Sqoop between HDFS and Relational Database Systems.
- Populated HDFS with vast amounts of data using Apache Kafka and Flume.
- Knowledge in Kafka installation & integration with Spark Streaming.
- Hands-on experience building data pipelines using Hadoop components Sqoop, Hive, Pig, Spark, Spark SQL.
- Hands-on experience with orchestrating Hadoop/Spark components with Apache Airflow.
- Loaded and transformed large sets of structured, semi structured and unstructured data in various formats like TEXT, Zip , XML and JSON .
- Experience in designing both time optimized and data intensive automated workflows using Oozie/Airflow.
- Work experience with cloud infrastructure such as Amazon Web Services (AWS) EC2 and S3.
- Used Git for source code and version control management.
- Hands-on experience with building Spark/Scala Applications with SBT/Maven using Jenkins
- Experience with RDBMS and writing SQL and PL/SQL scripts used in stored procedures.
- Experience in working with small and large groups and successful in meeting new technical challenges and finding solutions to meet the needs of the customer.
- Have excellent problem solving, proactive thinking, analytical, programming and communication skills.
- Experience working both independently and collaboratively to solve problems and deliver high-quality results in a fast-paced, unstructured environment.
TECHNICAL SKILLS:
Big Data Frameworks: Hadoop (HDFS, MapReduce), Spark, Spark SQL, Spark Streaming, Hive, Impala, Kafka, HBase, Flume, Pig, Sqoop, Oozie, Cassandra.
Bigdata distribution: Cloudera, Hortonworks, Amazon EMR, Azure
Programming languages: Scala, Python, Shell scripting, R, Perl, Bash scripting
Operating Systems: Windows, Linux (Ubuntu, Cent OS)
Databases: SQL Server, MySQL, RDS
IDEs: Eclipse, NetBeans, PyCharm, IntelliJ
Configuration Management: Puppet, Chef, Ansible, SaltStack, Terraform, Vagrant
Continuous Integration: Jenkins, Travis CI, Bamboo, Hudson, TeamCity, Circle CI
Development methodologies: Agile, Waterfall
Cloud Hosting: Amazon Web Services (S3, EC2, EMR, IAM, Lambda,Redshift )
Containerization: Docker, Kubernetes
Messaging Services: ActiveMQ, Kafka, JMS
Source Code Management: Git, SVN, Bitbucket, GitHub
Build Tools: Ant, Maven, Grunt, Gradle, SBT
Planning & Collaboration: JIRA, Slack, Zoom, Clarizen, Asana
Others: Putty, WinSCP, Data Lake
EXPERIENCE:
Confidential, Plano, TX
Spark Developer
Responsibilities:
- Developed ETL data pipelines usingHDFS,Sqoop, Spark, Spark SQL, ScyllaDB with the help of Airflow.
- Used Sparkto execute interactive queries for processing of data streams and integrated with AWS RDS
- Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2, Lambda, Glue, Redshift,RDS.
- Developed the batch scripts to fetch the data from AWS S3 storageon to ad hoc EC2 and/or EMR instances to perform requiredtransformations and create Pythonvisualizations for analysts.
- Implemented Spark code using Scala and Spark-SQLto improvedata processing speed, increasing workflow efficiency.
- Created Airflow workflow engine to run multiple Spark jobs and periodically orchestrate ETL jobs.
- Experience with Terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
- De-normalizing the data as part of transformation which is coming from Netezza and loading it to NoSQL Databases and MySQL Databases.
- Developed Kafka consumer API in Scala for consuming data from various Kafka topics.
- Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.
- Loaded the data into Spark RDD tocarry out in memory data Computation to generate quicker output response.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.
- Support the data analysts and developers of BI and for Hive/Pig development with the help of Tableau
Environment: HDFS, Spark, Scala, Sqoop, AWS, Terraform, ScyllaDB, MySql, Oozie, AirFlow, Python, Kafka
Confidential
Hadoop Developer
Responsibilities:
- Worked with business teams and created Hive queries for ad hoc access.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager
- Involved in review of functional and non-functional requirements
- Created complex Hive tables and executed complex Hive queries on Hive warehouse
- Wrote MapReduce code to convert unstructured data to semi-structured data
- Cluster co-ordination services through Zookeeper
- Used Pig as ETL tool for transformations, event joins and some pre-aggregations before storing the data onto HDFS
- Support the data analysts and developers of BI and for Hive/Pig development with the help of Tableau
- Develop Hive queries for analysts
- Wrote Hive UDFs to application specific requirements
- Developed workflow in Oozie to automate loading the data into HDFS and pre-processing with Pig
- Design technical solution for real-time analytics using Kafka and HBase.
Environment: Apache Hadoop, HDFS, MapReduce, HBase, Kafka, MySQL, Linux, Apache Hive, Apache Pig, Python, Scala, NoSQL, Flume, Oozie
Confidential
Hadoop Developer
Responsibilities:
- Responsible for managing data ingestionfrom various sources.
- Loaded every day’s web scraped data to Hadoop cluster by using Flume.
- Involved in loading data from UNIX file system to HDFS.
- Creating Hive tables and working on them using Hive QL.
- Used Pig to extract, transformation & load semi-structured data.
- Collected data logs from web servers and integrated into HDFS using Flume.
- Used Sqoop to move data from RDBMS to HDFS
Environment: Apache Hadoop, HDFS, HBase, MySQL, Linux, Apache Hive, Apache Pig,Python, Flume