We provide IT Staff Augmentation Services!

Big Data Developer Resume

4.00/5 (Submit Your Rating)

Piscataway, NJ

SUMMARY

  • Extensive experience in Hadoop developer industry in various domains including Finance, Insurance and Media service.
  • Hands on experience in Apache Hadoop ecosystem such as HDFS, MapReduce, Spark, Hive, HBase, Yarn, Sqoop, Avro, Parquet, Flume, Kafka, and Zookeeper.
  • Have a deep understanding of workload management, schedulers, scalability and distributed platform architectures.
  • Good knowledge in distributed programming through spark, specifically Scala, Python.
  • Proficient knowledge on Apache Spark and programming Scala and Python to analyze large datasets using Spark Streaming and Kafka to process real time data.
  • Experience in writing HiveQL Queries for preprocessing and analyzing large volumes of data.
  • Experience in importing and exporting buck of data using Sqoop from HDFS/Hive/HBase to RDBMS.
  • Experience in working with RDBMS including Oracle and MySQL.
  • Experience in developing scalable solutions using NoSQL databases including Cassandra, HBase and MongoDB.
  • Knowledge of data serialization and familiar with data formats including SequenceFile, Avro, Parquet, XML and JSON.
  • Experience on commercial distribution of Hadoop including HortonWorks HDP and Cloudera CDH.
  • Experience in all the phases of Data warehouse life cycle involving requirement analysis, design, coding, testing, and deployment.
  • Developed optimized and scalable machine learning algorithms capable of performing predictive modelling using Spark, SparkMLlib, Scikit - learn and other packages. Perform feature engineering to come up with effective input features for predictive models using Python on Spark.
  • Proficient in Statistical Modeling, Data Mining and Machine Learning Algorithms in Data Science/Forecasting/Predictive Analytics such as Linear and Logistics Regression, Random Forest, K Means, Neural Network, Decision Trees, SVM, Naive bayes.
  • Proficient in Python programming for web scraping, NLP, optimizing, plotting with packages including Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn, Pyecharts, NLTK, etc.
  • Strong in Data Structure and Algorithms, and Object-Oriented Design.
  • Experience in Unit Testing with Scala Test.
  • Experience in Tableau reporting.
  • Familiar with software development tools like Git.
  • Expose to various software development methodologies like Agile and Waterfall.
  • A good team-player can work independently in a fast-paced multitasking environment, and a self-motivated learner

TECHNICAL SKILLS

  • Hadoop 2.7.3, MapReduce, Spark 2.x, \ Oracle 11g, MySQL 5.x, HBase 0.98, \
  • Hive 0.13+, HBase, HDFS, Zookeeper, \ Cassandra 2.1.x, MongoDB 3.2\
  • Kafka 0.10.x, Flume 1.6.0, Yarn, Sqoop\
  • Scala 2.11.8, Python 2.7\3.6, SQL, R, JAVA 8, \ Machine Learning \
  • HTML5, CSS3, Unix/Bash shell, \
  • Linux, Mac OS, Windows, CentOS\ Amazon Web Services EC2/EMR/S3 \
  • IntelliJ IDEA, Pycharm, Eclipse, Jupyter \ Git/Github, Agile/Scrum, Tableau, Putty \
  • Notebook, Spyder, RStudio \

PROFESSIONAL EXPERIENCE

Confidential, Piscataway, NJ

Big Data Developer

Responsibilities:

  • Designed, developed, implemented, testing and maintenance of data ingestion and integration ETL pipelines including Flume, Kafka, batch processing, Spark streaming and Big Data APIs.
  • Developed and optimized multi-thread scripts using Kafka producer and consumer API
  • Developed Spark Streaming programs to process real time data from Kafka, and process data with both stateless and stateful transformations.
  • Developed Spark programs with Scala and applied principles of functional programming to do batch processing.
  • Utilized Spark SQL with Data Frames API to provide efficiently structured data processing.
  • Stored the raw data in HBase for long-term storage and processed results in the Cassandra for future decision support and BI analytics.
  • Created multiple Hive tables with partitioning and bucketing for efficient data access.
  • Performed unit testing using Scala Test.
  • Responsible for building the code dependencies using Maven.
  • Used Git for version control and JIRA for project tracking.
  • Actively Participated in software development lifecycle including scope, design, implement, testing and code reviews.
  • Worked on Cloudera Hadoop ecosystem with Agile Development methodology.

Environment: Cloudera CDH 5.x, Hadoop 2.x, Linux, Scala 2.11.8, HDFS, HBase 1.2.x, Kafka 0.10.x, Spark 2.x, Spark Streaming, Spark SQL, Cassandra 2.1.x, Zookeeper 3.4.x, ScalaTest, Git, JIRA

Confidential, Hartford, CT

Data Analyst

Responsibilities:

  • Involved ETL processes including data cleaning, data processing and data storage.
  • Used Apache NiFi to transfer data from databases and big data lakes to HDFS.
  • Used Hive and Impala for batch reporting and managed HDFS long term storage.
  • Developed and optimized Spark Scala programs to perform data enrichment, transformation and wrangling.
  • Convert raw data with sequence data format, such as Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Worked with analytics team to build BI dashboards with Python and Tableau.
  • Performed Exploratory Data Analysis (EDA) on procure-to-pay and order-to-cash process.
  • Implement new statistical methodologies as needed to generate key decision making KPIs.
  • Documented systems process and procedures for future references.
  • Involved in application performance tuning and troubleshooting.
  • Performed unit testing using Pytest, Scala Test.
  • Collaborate and tracking the work with Git.
  • Actively participated and provided feedback constructively during daily Stand up meetings and weekly Iterative review meetings.

Environment: Hadoop 2.x, HDFS, Hive 0.14, Spark 2.x, Python 3.6, Scala 2.11.8, HBase 1.2.x, NiFi 1.x, Impala, Tableau, Git, Agile

Confidential, Hartford, CT

Big Data Developer

Responsibilities:

  • Experienced on loading and transforming of large sets of structured and semi-structured data, including product descriptions, image tags, categories, delivery information, etc.
  • Created web crawlers with Python to scrap the data about product and seller information from website and stored them to MySQL. Parsed the HTML pages with RE, regular expression and Beautiful Soup, extracting the genie and pathway information to build a local database.
  • Used PyMySQL API to acquire connections with MySQL database.
  • Integrated the data from different sources, cleaned, transformed and loaded to MySQL.
  • Designed the logical and physical data model, generated DDL, DML scripts.
  • Migrated data between RDBMS and HDFS/Hive with Sqoop for data backup.
  • Created Hive tables, analyzed data with Hive Queries, and written Hive UDFs.
  • Create shell scripts using Python for administration, maintenance and troubleshooting.
  • Involved in reviewing Functional requirements and designing solutions.
  • Involved in gathering the requirements, designing, development and testing.

Environment: Linux, MySQL 5.x, Python 3.6, Sqoop 1.4.6, Hive 0.14, HDFS

Confidential, Hartford, CT

Data Analyst

Responsibilities:

  • Moved Relational Database data using Sqoop into HDFS and Hive Dynamic partition tables using staging tables.
  • Extracted Application data into Spark and HDFS using Kafka Producer, Kafka Connect Sink from application server.
  • Experience with SPARK STREAMING to ingest data into SPARK ENGINE.
  • Performed aggregations using Spark by loading data from HDFS.
  • Worked with different file formats from HDFS.
  • Increased the job performance by implementing parallelism in Spark.
  • Transformed RDD’s to Data Frames for querying and analytical purposes.
  • Built machine learning model with MLlib and Python/Spark.
  • Preprocessed and normalized the data with One-Hot-Encoder.
  • Implemented Logistic Regression, SVM, Decision Trees and K Means algorithms on fraud detection model and deployed model stacking method which improve AUC by 5%.
  • Managed the Metadata associated with the ETL processes used to populate the Data Warehouse.
  • Partnered with other Analysts to Develop Data Infrastructure (data pipelines, reports etc.) and other tools to make Analytics easier and more Effective.

Environment: Apache Spark, HBase, YARN, Sqoop, HDFS, Hive, Python 3.6

We'd love your feedback!