- 5 years of programming experience involved in all phases of Software Development Life Cycle (SDLC) Platform.
- Expertise in Bigdata Development applications and experienced in Hadoop ecosystem components like Spark, Hive, Sqoop, Pig and Oozie.
- Hands on developing and debugging Spark Jobs to process large Datasets.
- Excellent knowledge and understanding of Distributed Computing and Parallel processing frameworks.
- Experience in working with Cloudera and Horton Works Hadoop Distributions.
- Worked on Importing and exporting data into HDFS and Hive using Sqoop.
- Experience in Creating Hive tables and load the tables using Sqoop and processed data using Hive QL.
- Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics.
- Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
- Good experience in job scheduling tools like Oozie.
- Experience in handling Hive queries using spark SQL that integrate with spark environment implemented in Scala.
- Hands on Experience in dealing with the different file formats like Json, Avro and Parquet.
- Experience in converting SQL queries into Spark Transformations using Spark RDDs, Data Frames and Scala, and performed map - side joins on RDD's.
- Experience in Hadoop administration activities such as installation and configuration of clusters using Apache and Cloudera.
- Adequate knowledge of Agile and Waterfall methodologies.
- Good experience working on Tableau and enabled the JDBC/ODBC data connectivity from those to Hive tables.
- Well versed with UNIX and Linux command line and shell script.
- Extensive experience in developing Stored Procedures, Functions and Triggers, Complex SQL queries using Oracle PL/SQL.
- Exhibited strong written and oral communication skills. Rapidly learn and adapt quickly to emerging new technologies and paradigms.
- Highly motivated with the ability to work independently or as an integral part of a team and Committed to highest levels of profession.
BigData Technologies: Hadoop, MapReduce 2.0, Pig, Hive, Sqoop, Oozie, Spark, Kafka.
Databases: Oracle 11g/10g.
Cloud Platforms/Version Control: AWS/ Git.
Programming/Scripting Languages: Scala, Python, Unix.
Operating System: Mac OS, Linux (Various Versions), Windows 2003/7/8/8.1/XP.
Development Tools: Pycharm, Eclipse, Intellij.
Confidential, Atlanta, GA
- Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
- Developed shell scripts to perform Data Quality validations like Record count, File name consistency, Duplicate File and for creating Hive Tables and views.
- Creating the views by masking PHI Columns for the table, so that data in the view for the PHI columns cannot be seen by unauthorized teams.
- Worked on Parquet File format to get a better storage and performance for publish tables.
- Worked with NoSQL databases like HBase in creating HBase tables to store the audit data of the RAWZ and APPZ tables.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Developed shell scripts for performing transformation logic and loading the data from raw zone to app zone.
- Responsible for developing Spark wrapper scripts using python to perform the transformations on the data.
- Responsible for creation of mapping document from source fields to destination fields mapping.
- Created Different data Pipelines using Stream sets to land the data from source to Raw zone.
- Worked on different files like csv, txt, fixed width to load the data from source to rawz tables.
- Experienced in using Kafka as a data pipeline for the Json data between source and destination
- Responsible for creating the Jobs using CONTROL M.
- Responsible for production activities and production support.
- Responsible for resolving the production issues.
- Worked in Agile Scrum model and involved in sprint activities.
- Worked with Bitbucket, Jira, for the deployed the projects into production environments
Environment: Apache Hive, HBase, spark, python, Agile, Stream sets, Bitbucket, Cloudera, Kafka, Hadoop, Shell Scripting.
Confidential, Madison, WI .
- Applied several Spark APIs to perform necessary transformations and actions on the data came from mainframe files.
- Created and worked on large data frames with a schema of more than 300 columns.
- Ingestion of data into Amazon S3 using Sqoop and apply data transformations using python.
- Developed UDFs when necessary to use in PIG and HIVE queries.
- Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions in HIVE.
- Deployed and analyzed large chunks of data using HIVE as well as HBase.
- Worked on querying data using Spark SQL on top of spark engine.
- Used Amazon EMR to perform the Pyspark Jobs on the Cloud.
- Created HBase tables as a centralized PIT table which stores the all the information from the remaining tables data and used to incrementally load the data into the Hive tables.
- Created Hive tables to store various data formats of PII data coming from the raw hive tables.
- Developed Sqoop jobs to import/export data from RDBMS to S3 data store.
- Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
- Worked 10 Nodes cluster in AWS for Dev & QA Environment.
- Used Bit Bucket for version control.
Environment: Amazon EMR, Amazon S3, Apache Hive, Sqoop, spark, python, Agile, PyCharm, Bitbucket, Hortonworks.
Confidential, Bothell, WA
- Used different Scala APIs to perform necessary transformations and actions on the data came in Batches form different sources.
- Performed various Parsing technique’s using spark API’S to cleanse the data from Kafka.
- Experienced in working with Spark SQL on different file formats like Avro and Parquet.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Implemented to run Hive on spark and analyzed the data using SparkSQL Queries.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Implemented Incremental Imports of analyzed data into MYSQL tables using Sqoop.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Exported the analyzed data to relational databases using Sqoop for visualization and to generate reports.
Environment: Hadoop, HDFS, Apache Hive, Sqoop, Apache Spark, Scala,Shell Scripting, Agile, Oracle, Cloudera.
- Understanding the requirements, interacting with Client/Onsite team for Clear understanding of the Requirements.
- Participate in order to define and implement project level standards and guidelines and ensure adherence to enterprise level policies
- Extracted data from various sources across the organization (Oracle, SQL Server and Flat files) and loading into staging area.
- Used techniques like source query tuning, single pass reading and caching lookups to achieve optimized performance in the existing sessions.
- Developed test cases and tested the reports.
- Created and scheduled Sessions and Batch Process based on demand, run on time, or run only once using Informatica Workflow Manager and monitoring the data loads using the Workflow Monitor.
- Developed various daily and monthly ETL load jobs using Control-M and modified the existing Control-M jobs on business requirement.
- Work with testing team to define a robust test plan and support them during the functional testing of the application.
- Contribute to performance tuning and volume testing of the application.
- Impact analysis for change requests.
- Review and deploy the code.
- Involved in fixing the UAT defects raised by the testing team within the timelines.
- Track and Report the status of the project in frequent intervals.
Environment: Informatica Power Centre 9.x, Oracle10g, SQL, UNIX, Control-M, Waterfall methodology.