Hadoop/spark Developer Resume
San Francisco, CA
SUMMARY
- 6+ years of professional IT work experience in analysis, design, development, testing and implementation of Hadoop, Bigdata Technologies like Hadoop and spark ecosystems, Data Warehousing, and AWS on Object Oriented Programming.
- Having 4 years of comprehensive experience in Bigdata using Hadoopand its ecosystem components like HDFS, Spark with Scala and python, Zookeeper, Yarn, MapReduce, Pig, Sqoop, HBase, Hive, Flume, Oozie, Kafka, Flume, Spark streaming and TEZ.
- Hands on experience in using various Hadoop distributions (Hartonworks and Cloudera).
- Experienced with performing CRUD operations using HBase Java Client API.
- Expertise in implementing Ad - hoc queries using Hive QL and good knowledge in creating Hive tables and loading and analyzing data using hive queries.
- Experienced in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Involved in Debugging Hive scripts and used various optimization techniques in MapReduce jobs. Wrote custom UDFs and UDAF for Hive core functionality.
- Worked on relative ease with different working strategies like Agile, Waterfall, Scrum, and Test-Driven Development (TDD) methodologies.
- Experience with all stages of the SDLC using Agile Development model right from the requirement gathering to Deployment and production support.
- Hands on experience with AWS components like EC2, S3, Data Pipeline, RDS, RedShift and EMR.
- Imported the data from different sources like AWSS3, Local file system into Spark RDD and worked on cloud Amazon Web Services (EMR, S3, EC2, Lambda).
- Experience with developing and maintaining Applications written for Amazon Simple Storage, AWS Elastic Beanstalk, and AWS Cloud Formation.
- Hands on experience in working withFlumeto load the log data from multiple web sources directly into HDFS.
- Experience in importing and exporting data using Sqoop from RelationalDatabase Systems to HDFS.
- Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWScloud) using Sqoop.
- Had a very good exposure working with various File-Formats (Parquet, Avro & JSON) and Compressions (Snappy &Gzip).
- Hands on experience in creating and designing data ingest pipelines using technologies such as Apache Storm-Kafka.
- Good working experience on Spark (spark streaming, spark SQL) with Scala andKafka. Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Hands on experience with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
- Replaced existing map-reduce jobs and Hive scripts with Spark Data-Frame transformation and actions.
- Experienced working with Spark Streaming, SparkSQL and Kafka for real-time data processing.
- Created dataflow between SQL Server and Hadoop clusters using Apache Nifi.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming information with the help of RDD.
- Exposure to ETL tools like Talend.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Experience with version control tools like Git, CVS and SVN.
TECHNICAL SKILLS
Hadoop Ecosystem: HDFS, YARN, Spark, Scala, Map Reduce, Hive, Pig, Zookeeper, Sqoop, Oozie, Flume, Kafka, Impala, Nifi, MongoDB, Cassandra, HBase.
Databases: Oracle, MS-SQL Server, MySQL, PostgreSQL, NoSQL (HBase, Cassandra, MongoDB), Teradata.
IDE and Tools: Eclipse, IntelliJ, PyCharm, Maven, Jenkins.
Amazon Web Services: Amazon Web services (AWS), EMR, EC2, S3, Cloud Watch, IAM, Lambda and SNS
Operating Systems: Windows, Linux, UNIX.
Scripting: Python, shell
Version Control: GitHub, SVN, CVS.
Packages: MS Office Suite, MS Vision, MS Project Professional.
Languages: Java, Scala, and Python.
PROFESSIONAL EXPERIENCE
Confidential, San Francisco, CA
Hadoop/Spark Developer
Responsibilities:
- Developed data pipeline using Kafka, Sqoop, Hive and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Developed SQOOP scripts for importing and exporting data into HDFS and Hive.
- Developing design documents considering all possible approaches and identifying best of them.
- Responsible to manage data coming from different sources.
- Developing business logic using Scala.
- Responsible for loading data from UNIX file systems to HDFS
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources.
- Developed scripts and automated data management from end to end and sync up between all the clusters.
- Exploring withSpark forimproving the performance and optimization of the existing algorithms in Hadoop.
- Import the data from different sources like HDFS/HBase into Spark RDD.
- Experienced with Spark Context,Spark -SQL, Data Frame, Pair RDD's,SparkYARN.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Scala.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Configured connection between Hive and Tableau using Impala for BI development tool.
- Worked in Agile Methodology and used JIRA for maintain the stories about project.
- Experience in automated scripts using Unixshell scripting to perform database activities.
- Working experience with Linux lineup like Redhat and CentOS.
- Good analytical, communication, problem solving skills and adore learning new technical, functional skills.
Environment: Hadoop, Map Reduce, Hive, Java, Maven, Impala, Pig, Spark, Oozie, Oracle, Yarn, GitHub, Tableau, Unix, Cloudera, Kafka, Sqoop, Scala, HBase.
Confidential, Denver, CO
BigData Engineer
Responsibilities:
- Worked in AWS environment for development and deployment of Custom Hadoop Applications.
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, Hive, MapReduce, Spark andShellscripts (for scheduling of few jobs) extracted and loaded data into DataLake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
- Designed a data workflow model to create a data lake in hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
- Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Hadoop Cluster in Parquet format.
- Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Loading log data directly into HDFS using Flume.
- Leveraged AWS S3 as storage layer for HDFS.
- Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
- Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
- Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
- Used Confluence to store the design documents and the STMs
- Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
- Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology
Environment: SPARK, Hive, Flume, Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL
Confidential, Lafayette, LA
BigData Engineer
Responsibilities:
- Worked on analyzing Hadoop cluster and different big data analytic tools including Spark, HDFS, Hive
- Develop Spark code using Scala and Spark-SQL for faster testing and data processing.
- Involved in the development of Spark application for one of the data sources using Scala, Spark.
- Experience in managing and reviewing Hadoop log files.
- Experienced in Performing tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Involved in performance tuning where there was a latency or delay in execution of code
- Loaded the data into Spark RDD and do in memory data computation to generate the output response.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD’s.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Load and transform large sets of structured, semi structured and unstructured data.
- Developed various algorithms for generating several data patterns. Created Airflow workflows to run multiple MR, Hive.
- Implemented test scripts to support test driven development and continuous integration.
- Involved in loading data from LINUX file system to HDFS.
- Worked on tuning Hive to improve performance and solve performance related issues in Hive scripts with good understanding of Joins, Group and aggregation and how it does Map Reduce jobs
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Created POC on Cloudera and suggested the best practice in terms CDH platform
- Installed and configured CDH cluster, using Cloudera manager for easy management of existing Hadoop cluster.
- Worked on setting up high availability for major production cluster. Performed Hadoop version updates using automation tools.
- Responsible for building scalable distributed data solutions using Hadoop.
Environment: Hadoop, HDFS, Yarn, Sqoop, Hive, Cloudera Manager, Shell Scripting, Linux, Apache Airflow, Nifi, Replica, Spark, Scala, Java.
Confidential
Hadoop Developer
Responsibilities:
- Importing data from various DB systems like DB2, Oracle into HDFS using Sqoop1.
- Accepting and processing the material movement feeds in various formats like CSV, XML and flat files with fixed length format.
- Developed Shell scripts in transforming the file feeds before processing into the data mart.
- Developed Hive scripts and UDFs to transform and load the transportation feeds into Hive staging tables.
- Developed Hive scripts to validate the data feeds on HDFS and capturing the invalid transactions.
- Performed data transformation by joining Hive tables related to the master and transaction data in performing incremental loads to the data marts on Hive.
- Involved in Hive performance tuning by changing the Join strategies and by implementing Indexing, Partitioning and bucketing on the transactional data.
- Developed Sqoop jobs to export data back to DB2 tables for downstream front-end applications.
- Experience in handling VSAM files in mainframeand transforming to a different Code Page before moving them to HDFS using SFTP.
- Used crontab, Oozie workflows to automate the data feed processing from the various sources and for the incremental master data loads from the DB2 tables.
Environment: CDH4, HDFS, Hive, Sqoop, DB2, HBase, Shell scripting, Oozie.
Confidential
Data Analyst
Responsibilities:
- Using data mining to extract information from data sets and identify correlations and patterns
- Organising and transforming information into comprehensible structures
- Using data to predict trends in the customer base and the consumer population as a whole
- Performing statistical analysis of data
- Using tools and techniques to visualise data in easy-to-understand formats, such as diagrams and graphs
- Preparing reports and presenting these to management or clients
- Identifying and recommending new ways to save money by streamlining business processes
- Monitoring data quality and removing corrupt data
- Communicating with stakeholders to understand data content and business requirements
- Implemented Power BI solutions based on business requirements and plan to create interactive dashboard
- Created analytical dashboards using Power BI that allowed end-user to make filters.
