Big Data Engineer Resume
White Plains, NY
SUMMARY:
- Data Enthusiast with broad experience in IT and Data Technology oriented solutions with extensive knowledge of SDLC and Data Modelling.
- Extensive experience of Big Data Ecosystem including Hadoop, HDFS, YARN, MapReduce, Mesos, NiFi, StreamSets, Kudu, Spark, Hive, Impala, Pig, HBase, Sqoop, Flume, Kafka, Mesos, Oozie and Zookeeper.
- In - depth understanding of Hadoop and Spark Architecture.
- Hand-on experience in using HIVE partitioning, bucketing and execute different types of joins on Hive tables.
- Hands-on experience in HiveQL and good understanding of Joins, Group and Aggregations, query optimization.
- Worked in various efficient storage formats like Avro, Parquet and ORC integrated with Hadoop ecosystem (Hive, Impala and Spark). Also used compression techniques Snappy and GZip.
- Experience in import/export of structured and non-structured data to HDFS and HIVE table using Sqoop and Flume.
- Experience in NoSQL Column-Oriented Databases like HBase, Cassandra and its Integration with Hadoop cluster.
- Strong understanding of Spark Core, spark-SQL, PySpark, Spark Streaming and Machine Learning (SVM, Linear and Logistic Regression, KNN, Decision Tree, Random Forest, Gradient Boosting, Naïve Bayes and Cross Validation).
- Experience in performing Exploratory Data Analysis (EDA), Dimensionality Reduction methods (PCA), missing value treatment and outlier treatment.
- Experience in collecting, aggregating and moving large amounts of streaming data using Flume, Kafka, Spark Streaming.
- Good knowledge of different AWS Services such as EC2, S3, EMR, RedShift, DynamoDB, Aurora, Athena.
- Strong experience in writing custom UDF s in Scala/Python/Java for HIVE and Pig to extend the functionality.
- Strong Database Experience on SQL Server 2008 R2/2017 with T-SQL programming skills in creating Stored Procedures, Functions, Triggers and Views.
- Experience in using data visualization and reporting using Dask, Matplotlib, Seaborn, Tableau.
- Skills for debugging application code and problem solving for various production issues.
- Enthusiastic team player dedicated to streamlining processes and efficiently resolving project issues.
TECHNICAL SKILLS:
Hadoop Ecosystem \ Web Technologies: Hadoop 2.1+, Spark 1.3+/2.1+, MapReduce, \ Oracle WebLogic 11g/12c, OHS 11g, JSF 2.1 Pig 0.11+, Flume 1.3+, HBase 0.98+, Oozie \ Flask 1.0 +, HTML 5, Splunk 6.5.X, CSS 3.3+, Sqoop 1.4+, HDFS, Kafka 0.8.1+, \ REST, JSON, XML, Tomcat 8. 0 +/9.0+, Zookeeper 3.4+, Airflow, Hive 0.10+/2.2+ \ JBOSS 6.X, Splunk 6.X Cloudera 4.X/5.X, Hortonworks \
Languages\ Cloud Technologies: Java 7/8, Scala 2.0+, Python 2.7+/3.3+, SQL \ AWS-EC2, S3, EMR, RedShift, DynamoDB \
Pig Latin, Cypher, Julia, Shell: Scripting\ VPC, Aurora, Athena, SQS, SNS Cloudcraft, \
HQL, T: SQL, CQL \ Databricks Cloud Community
Machine Learning \ Data Analysis and Visualization: Regressions, KNN, Random Forests, MLib \ Kibana 5.X, Tableau 10.2, Matplotlib, xlrd, SVM, Decision Tree, Ensemble and Stack \ Pandas, NumPy
Databases\ Others: MySQL 5.0+, MS SQL Server 2017/2008 R2, \ Git, GitHub, GitLab, JIRA, Jenkins, Maven 3 PostgreSQL 9.6, Cassandra 2.0+, Oracle 11g/12c \ Hibernate 2, SSIS 2008, Spring 3, MVC, Bonsai Express, Neo4J 3.5.6, MongoDB 3.6 Elastic \ Docker, Vagrant Search 2.X, \
PROFESSIONAL EXPERIENCE:
Confidential
Big Data Engineer
Responsibilities:
- Used NiFi to export flat files to Hive Tables.
- Used YARN as a Resource Manager and HDFS as distributed storage in Cluster.
- Running HiveQL scripts to get valuable insights.
- Check the Raw tables in database for correct Attribute file.
- Developed python scripts to check for approved product code in the attribute file and email notification for any discrepancy.
- Finally loaded the files in HBase tables for Downstream application.
- Got good experience with various NoSQL databases and Comprehensive knowledge in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation.
Environment: SQL Server 2017/2008 R2, Python 3.6, NiFi 1.9.0, Hadoop 2.7
Confidential, White Plains, NY
DataOps Engineer
Responsibilities:
- Involved in optimized Spark applications to perform data cleansing and data validation.
- Data pipeline using Spark , Hive, Cobol Copy Blocks and Sqoop and then transform and analyze data.
- Created Sqoop scripts to import/export data from RDBMS to S3 data store.
- Created Spark applications using Spark Data frames and Spark SQL API extensively.
- Collaborated with platform engineers in development of python-based Kafka producer API to capture live stream data into various Kafka topics.
- Developed Spark-Streaming application to consume the data from Kafka topics and to insert the processed streams to HBase .
- Application of Broadcast variables in Spark and efficient joins in Hive for data processing.
- Used spark-SQL to perform enrichment and to prepare different levels of behavioral summaries.
- Implemented Partitioning and Bucketing in Hive in order to enhance query efficiency and performance of joins.
- Experience in amazon cloud environment and using services EMR Cluster , S3 and Redshift .
Environment: Spark streaming/Scala 2.11.8, Spark 2.2, Hive 2.3.2, Kafka 2.0.0, Sqoop1.4.X, Hortonworks Distribution, Hadoop 2.7, EMR, Cobol Copy Blocks, Redshift
Confidential
Big Data Engineering
Responsibilities:
- Responsible for Exploratory Data Analysis (EDA) and Dimensional Reduction (PCA).
- Variable Identification, Missing value treatment, Outlier treatment, Variable transformation, Univariate and Bi-variate analysis.
- Loaded data from various formats flat file , JSON , Avro, Parquet to Spark Cluster.
- Apply Spark transformations and actions using Scala .
- Data cleaning and storing into Hive table for Analysis.
- Connected Hive tables with Tableau and performed data visualization for report.
- Plot the Trend and Pattern Analysis and compare companies market capitalization from historical data.
- Created a 14 Node Spark Cluster with 11 Executors and 1 Driver.
- Created UDF and added the function availability to each executor.
- Used GitHub for version control, JIRA for issue tracking.
Environment: CDH 5.X, Hadoop 2.6.X, Python 3.6, Scala 2.11.8, Spark 2.1, Hive 2.2.0, Tableau 10.2
Confidential
Hadoop Developer
Responsibilities:
- Extensively involved in Installation and configuration of Cloudera distribution Hadoop, Name Node, Secondary Name Node, Job Tracker, Task Trackers, and Data Nodes.
- Developed MapReduce programs in Java and Sqoop the data from ORACLE database.
- Responsible for building scalable distributed data solutions using Hadoop. Written various Hive and Pig scripts.
- Moved data from HDFS to HBase using Map Reduce and Bulk Output Format class.
- Experienced with different scripting language like Python and shell scripts .
- Developed various Python scripts to find vulnerabilities and Data Validation.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Experienced with handling administration activations using Cloudera manager.
- Expertise in understanding Partitions , Bucketing concepts in Hive.
- Analyzed the weblog data using the HiveQL , integrated Oozie with the rest of the Hadoop stack.
- Utilized cluster co-ordination services through Zookeeper.
- Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing Cassandra clusters.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Developed Shell scripts to automate routine DBA tasks.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring, troubleshooting, managing and reviewing data backups and Hadoop log files.
Environment: Hadoop 2.1.0, Pig 0.9.0, Python 3.3.0, Hive 0.10, Oozie 3.3.1, Sqoop 1.4.3, HBase 2.2.0, Java 7, Avro, CDH 4.0, Zookeeper 3.4.5, Cassandra 2.0 and Shell Scripting
Confidential
Associate Engineer
Responsibilities:
- Involved in system design, which is based on Spring Struts Hibernate framework.
- Implemented the business logic in standalone Java classes using core Java.
- Developed database (SQL Server) applications.
- Worked in Spring Hibernate Template to access the SQL Server database.
- Created Views, Functions and developed Stored Procedures for implementing application functionality at the database side for performance improvement
- Design, implementing, and test new features by using T-SQL programming.
- Optimize existing data aggregation and reporting for better performance.
- Perform varied analyses to support organization and client improvement.
Environment: SQL Server 2012/2008 R2, Spring 3.0, Maven 3.0, HTML, JavaScript 5.0, Hibernate 3.0, JSF 2.1
Confidential
Jr. Developer
Responsibilities:
- Analyzing different user requirements and coming up with specifications for the various database applications.
- Studied design documents and understood the business needs and the requirements for the project. Involved in discussion, peer review sessions to come up with an optimal design plan.
- Involved in project planning also schedule for database module with project managers.
- Enhanced performance using optimization techniques-normalization, indexing and transaction Isolation levels.
- Experience in creating jobs, alerts, SQL mail agent, and schedules for SSIS Packages in SQL Server Agent.
Environment: MS SQL Server 2008 R2, SSIS 2008, T-SQL, Software Development Life Cycle (SDLC), SQL Server Management Studio 2008, Windows Server 2008