- Around 8 years of professional IT experience which includes experience in Big data ecosystem experience in complete project life cycle (design, development, analytics, testing, implementation and Visualization) of which around 4 years of work experience in ingestion, storage, querying, processing and analysis of Big Data with hands on experience inHadoop Ecosystem (YARN, HDFS) and its components Hive, Pig, HBase, Sqoop, Hue, Kafka, Flume, Oozie, Zookeeper, Spark, SparkSQL andSparkStreaming.
- Worked Hands on with different distributions of Hadoop clusters like Hortonworks, AWS Elastic Map Reduce (EMR) and Cloudera.
- Capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitorHadoop/Spark jobs on AWS.
- Familiarity with Spark's machine learning (ML)library and Microsoft Azure, used linear, logistic and k - means, SVM, decision tree algorithms.
- Worked on creating Hadoop Cluster(HDP) on Microsoft Azure cloud and working with it.
- Experience In working with on-premise and cloud-based Hadoop clusters.
- Working Experience in Amazon Simple Storage service(S3), Elastic Compute Cloud(EC2) and Amazon Elastic Map Reduce(EMR).
- Implemented Spark using Pyspark API and utilizing Data frames and SparkSQL API for faster processing of data.
- Hands on experience in improving the performance and optimization of the existing algorithms in Hadoop usingSparkcontext,Spark-SQL, Data Frame, pair RDD's &Spark onYARN.
- Working experience on building spark applications using build tools like SBT, Maven and Gradle.
- Good experience in dealing with different file formats like text, Sequence, RCFILE, ORC, Parquet, Avro and JSON and different compression formats like GZip, LZO, BZip2 and snappy.
- Good knowledge on relational databases like MySQL, Oracle and NoSQL databases like HBase, MongoDB.
- Working Knowledge in using different version control systems and tools like Git and subversion.
- working knowledge on UNIX /Linux systems including Experience on shell scripting
- working experience in handling semi/un-structured data from different data sources.
- Working experience in developing Map Reduce programs using Combiners, Map side join, Reducer side join, Distributed Cache, Compression techniques, Multiple Input & output.
- working experience in performing ad-hoc analysis on structured data using HiveQL, joins and Hive UDF's good exposure to Counters, Shuffle & Sort parameters, Dynamic Partitions, Bucketing for performance improvement.
- worked in using IDE like Eclipse and Intellij IDEA
- working knowledge in Java and SQL in application development and deployment.
Big Data Associated: HDFS, MapReduce, Tez, Pig, Hive, Sqoop, Flume, HBase, Oozie, Apache Spark, Spark SQL, Spark ML lib, Spark Streaming, NiFi, Elasticsearch, Kafka.
Process/Data Modeling: MS Visio, UML Diagrams and ER Studio
Cluster Manager Tools: HDP Ambari, Cloudera Manager, Hue
ETL/ELT/Databases: Hive, Spark, HBase, MongoDB, Spark SQL, MS Access, Oracle, DB-II, MySQL, SQL Developer, SQL Server 2000/2005/2008 and Toad
Languages: Python, Scala, Shell Scripting, Java, PL/SQL
Web-Technologies: HTML, DHTML, XML, CSS
Cloud Components: Amazon S3, EC2, Redshift, Amazon RDS
Operating Systems: Linux, Ubuntu, RHEL, Windows 2000/2003/2008/XP/7/8/10.
IDE: Eclipse and Intellij IDEA
Confidential, Philadelphia, PA
Big Data Engineer/Spark Developer
- Worked in creating External and Managed Hive tables based on the requirement.
- Worked with lambda architecture in handling and processing batch and real-time data.
- Using Sqoop, ingested the Data from Relational data store to HDFS.
- Using Kafka, collected real-time streaming and log data from web applications and click stream data, analyzing a part of data using spark streaming and rest stored into HDFS for future use.
- Worked in writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HiveQL) and worked with Hive Tables, Hive queries, Partitioning, Bucketing.
- Used different pyspark APIs to perform necessary transformations and actions on the data which gets from Kafka in real time.
- Performed various Parsing technique's using pyspark (Python + Spark) API'S to cleanse the data from Kafka.
- PerformedDataProfiling, identifydataquality and validating rules regarding dataintegrity anddataquality as it relates to the impact on business requirements.
- Worked in creating Kafka + Spark streaming pipelines for speed layer and Sqoop + hive/ Spark for batch layer and submitting it to Spark or Hadoop cluster respectively.
- Built spark applications using SBT builds and submitting them to the cluster.
- Hands on experience on AWS infrastructure services Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2).
- Load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Creating and scheduling batch processing pipelines using Apache NiFi.
- Used Spark SQL to process the huge amount of structured data.
Environment: Apache Hadoop, Apache Sqoop, Apache NiFi, Apache Kafka Java (jdk1.8 Version), Hive, pyspark Spark, Spark-SQL, Spark-Streaming, Scala, Tableau.
Confidential, Minneapolis, MN
Big Data Engineer/Spark Developer
- Application of different machine learning algorithms like decision trees, regression models, neural systems, SVM, grouping to recognize deceitful profiles utilizing scikit-learn bundle in Pyspark. Utilized grouping procedure K-Means to recognize exceptions and to arrange unlabeled information.
- Worked in Ingesting flat files from local Unix file systems to HDFS and using Sqoop ingested structured data from legacy RDBMS systems to HDFS.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
- Worked using Apache Hadoop ecosystem components like HDFS, Hive, Sqoop and Worked with Spark, Scala and Python.
- Coordinating with the Data science team in creating PySpark jobs.
- Writing Hive join query to fetch info from multiple tables and collect output from Hive Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Utilizing Oozie workflow scheduler to run Hive Jobs Extracted files through Sqoop and placed in HDFS and processed.
Environment: Hadoop, Sqoop, Hive, Spark, HDFS, Scala, Pyspark, Spark SQL, JDBC, Kafka
Confidential, Columbus, IN
- Worked on analyzing data using different big data analytic tools of BigData ecosystem including Pig, Hive and MapReduce jobs on Hadoop cluster created on Amazon Web Services (AWS).
- Created Pig Latin scripts to sort, group, join and filter the enterprise data.
- Implemented Partitioning, Dynamic Partitions, and Buckets in Hive on Avro files to meet the business requirements.
- Implemented Data Integrity and Data Quality checks using Linux scripts.
- Used flume to tail the application log files into HDFS.
- Involved in scheduling of Hive and pig jobs using Oozie workflow.
- Involved in performance tuning and memory optimization of map-reduce and Hive applications.
- Worked on end to end automation of application.
- Responsible for continuous Build/Integration (CI/CD) with Jenkins and deployment using XL Deploy.
- Actively involved in code review and bug fixes and enhancements.
Environment: Hadoop, HDFS, Apache Sqoop, MySQL, Apache Hive, Pig, MapReduce, MySQL, Core Java, Shell Scripting, Eclipse, Git, Jenkins.
- Created custom PL/SQL procedures to read data from flat files to dump to Oracle database using SQL * Loader.
- Developed PL/SQL Procedures and database triggers for the validation of input data and to implement business rules.
- Created records, tables, collections for improving performance by reducing context switching.
- Created database objects like packages, procedures, and functions according to the client requirement.
- Used SSIS to create ETL packages to validate, extract, transform and load data to data warehouse databases, data mart databases to store data to OLAP databases.
- Created the PL/SQL packages, procedures, functions applying the business logic to load the data to relevant tables database and Converted different source system data into oracle format T-SQL.
- Created and manipulated stored procedures, functions, packages and triggers using TOAD.
- Responsible to tune ETL mappings to optimize load and query Performance
- Developed Oracle Forms for form end user using oracle form builder 10g.
- Extensively used the advanced features of PL/SQL like Records, Tables, Object types and Dynamic SQL.
- Tune ETL procedures and STAR schemas to optimize load and query Performance.
Environment: Oracle 10g, T-SQL, SQL*Plus, SQL*Loader, PL/SQL Developer, Web Services, SSIS, SSRS, TOAD.
- Involved in various phases of Software Development Life Cycle (SDLC) as design development and unit testing.
- Agile Scrum Methodology been followed for the development process.
- Involved in developing JSP for client data presentation and, data validation on the client side with in the forms.
- Developed the application by using the Spring MVC framework.
- Spring IOC being used to inject the parameter values for the Dynamic parameters.
- Developed JUnit testing framework for Unit level testing.
- Actively involved in code review and bug fixing for improving the performance.
- Documented application for its functionality and its enhanced features.
- Created connection through JDBC and used JDBC statements to call stored procedures.