Big Data Engineer Resume
Phoenix, AZ
SUMMARY:
- 5 years of hands - on experience in Big Data technologies and Machine Learning algorithms which comprises of highly distributive and massive amount of data using MapR and Cloudera Hadoop distributions.
- Solid understanding of Hadoop and YARN architecture, working of Hadoop framework involving Hadoop Distributed File System and technologies like MapReduce, Pig, Hive, HBase, Flume, Sqoop, Zoo Keeper and Oozie, Storm, Spark, Kafka.
- Worked on real time data integration using Kafka data pipeline, Spark streaming wif NoSQL databases like HBase, Cassandra and Mongo DB.
- Experienced in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Extensive Knowledge on developing Spark Streaming jobs by using RDD's (Resilient Distributed Datasets) and leverage PySpark and Spark-Shell accordingly. Good noledge in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Familiarity wif Amazon Web Services along wif provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
- Implemented machine learning models such as Random Forests (Classification), K-Means Clustering, KNN (K-nearest neighbors), Naive Bayes, SVM (Support vector Machines), Decision Tree, Linear and Logistic Regression Methods.
- Experienced in all phases of Software Development Life Cycle (SDLC). Worked in Agile environment.
TECHNICAL SKILLS:
OPERATING SYSTEMS: Windows/UNIX
LANGUAGES: Python, Scala, Java, Shell, SQL
RELATIONAL DATABASES: MySQL, Oracle, SQL Server
NoSQL DATABASES: Cassandra, AWS DynamoDB
VERSION CONTROL: GIT, SVN
CLOUD: AWS EMR, AWS EC2, AWS S3, RDS
Big Data Ecosystem: HDFS, PIG, MapReduce, YARN, Hive, Sqoop, Flume, Oozie, HBase, Apache
DATA INGESTION: Sqoop, Kafka, Flume
DATA PROCESSING: Spark, Hive, MapReduce
MACHINE LEARNING: Spark MLib, TensorFlow, scikit learn, Keras, NLTK
Web Technologies: HTML, XML, JavaScript, jQuery
PROFESSIONAL EXPERIENCE:
Confidential, Phoenix, AZ
Big Data Engineer
Responsibilities:
- Designed and implemented big data ingestion pipelines to ingest data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats. Stored teh stream data to HDFS using Scala and Cassandra.
- Documented teh requirements including teh available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Built models using Statistical techniques and Machine Learning classification models like XG Boost, SVM, and Random Forest. Developed and design advanced predictive analysis models. Model and frame business scenarios dat are meaningful and impact critical business processes and/or decisions.
- Developed Scala scripts to read all teh Parquet tables in a Database and parse them as Json files.
- Extensively used Apache Sqoop for efficiently transferring bulk data between Apache Hadoop and relational databases (Oracle, MySQL) for predictive analytics.
Technologies Used: Scala, Python, PySpark, HDFS, HBase, Cassandra, REST API, Hive, Pig, Pandas, NumPy, Unix shell, Apache Spark, Kafka, AWS EC2, S3, Redshift, EMR, Elasticsearch.
Confidential
Big Data Engineer
Responsibilities:
- Worked on processing big volumes of data using different big data analytic tools like Hive, SQOOP, Pig, Flume, OOZIE, CDH5, HBase, Scala.
- Using SQOOP to import and export data from relational data source, MySQL into HDFS.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and semi - structured data.
- Used Apache Solr to search for specific products each cycle for teh business.
- Designing and documenting teh project use cases, writing test cases, leading offshore team, andinteracting wif client.
- Used Git as version control to checkout and check-in of files
- Load data to External tables by using Hive Scripts.
- Performed aggregate Joins, transformation using Hive queries.
- Implemented Partitions, Dynamic Partitions, Buckets in Hive.
- Optimized HIVE SQL queries and thus improved teh job performance.
- Developed Sqoop scripts to import and export teh data from relational sources and handled incremental loading on teh customer and transaction data by date.
- Performed Hadoop cluster environment administration dat includes adding & removing cluster nodes, cluster capacity planning, performance tuning, cluster monitoring, and trouble shooting.
- Written Unit Test Cases for Hive Scripts.