- Over 7 Years of experience in the I.T industry, SQL database development, query performance tuning, ETL and data analysis in python and SAS.
- 4 years of hands on experience with Big Data Ecosystems including Hadoop Spark, Scala, MapReduce, Pig, Hive, Impala, Sqoop.
- Hands on experience in importing and exporting data from different databases like Oracle, MySQL into HDFS using Sqoop.
- Good knowledge on relational databases like Microsoft SQL, MySQL.
- Hands on experience in Application Development using Scala Spark, RDBMS.
- Good working knowledge of Apache spark for fast large scale in memory MapReduce.
- Expertise in developing scripts using Hive Query Language. Used Spark as ETL tool to do transformations, event joins, filter and aggregation.
- Capable of processing large sets of structured, semi - structured, unstructured data.
- Extensive knowledge of Linux OS, Utilities and Shell scripts.
- Strong analytical skills with problem solving and root cause analysis experience.
- Strong exposure to the Software SDLC and development Agile methodologies
- Data ingestion using sqoop.
Big Data: Hadoop, Spark HBase, Hive, sqoop, Flume, Kafka, MapReduce.
Programming Language: Python, PL/SQL, Scala, Unix Shell Scripting, NoSQL Language.
SQL Server Tools: SQL Server Management Studio, SSIS, SSRS, SSAS, SQL Server Optimization wizard, Mongo No SQL.
Databases: MS SQL Server, MYSQL, Mongo, Oracle
Languages: HTML, T-SQL, Mongo No SQL
Operating Systems: Linux, Unix, Windows, Mac
Tools: Maven, JIRA, Slack, Microsoft Office.
Big Data Developer
Confidential, Dallas, TX
- Importing data using sqoop from RDBMS into HDFS file system applying different options like append, delete target directory, where condition, select the columns to import, increase number of mappers, split by if no primary key, query. Incremental, last value, last modified.
- Worked with NoSQL databases like Mongo in creating tables to load large sets of semi structured data coming from various sources.
- Processed data through spark dataframes by using sql context: register temp tables and then apply sql logics to process data and develop ETL Logic like select, aggregation, group by, order by, operations, joins on different temp tables.
- Created sqoop jobs with incremental load to populate hive external tables.
- Optimizing of existing algorithms in hadoop using spark context, spark-SQL, data frames and RDD's.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Involved in creating hive tables, and loading and analyzing data using hive queries
- Developed hive (version 0.12) scripts for end user/analyst requirements to perform adhoc analysis
- Create table partitions either manual or dynamic partitions and bucketing for better query performance.
- Imported data using sqoop to load data from MySQL to HDFS on regular basis.
- Read data in spark from different sources like hive, RDBMS, cassandra, files of different formats like ORC, parquet, JSON using sql context.
- Saving processed files in spark applying different file options like, file formats (AVRO, parquet, textfiles), mode (overwrite, append, ), save(file, hive table)
- Use different hive options to write hive ETL like overwrite tables, changing data type add columns, changing serde properties. Substring.
- Worked on migrating SQL ETL programs into Spark transformations using Spark and Scala.
- Load and transform large sets of JSON files into HDFS for further processing using spark
- Managed and reviewed Hadoop log files. Tested raw data and executed performance scripts.
- Create data frames from CSV files in spark 2.0 and use of case classes and struct type to create metadata
- Loading files into spark RDD and data frames and do transformations like splitting the records, reduceByKey, filter, sortByKey.
- Participated in POC for real time streaming of data using spark streaming and Kafka. Flume.
- Developing Scala programs in Maven for deployment.
- Used AVRO, Parquet file formats for serialization of data.Developed Hive UDF’s to bring all the customers email-id into a structured format.
- Create Store Procedures, functions, temp tables, derived tables, cursors, views for use in different tasks in the sql database.
- Participate in data modelling of hive tables to ensure optimal outcomes.
- Export data from HDFS into RDBMS using options like staging table to prevent inconsistency in production data.
- Perform database optimization using indexes, query plans, and query hints.
- Use different hive options to write Hive ETL like overwrite tables, changing data type add columns, changing serde properties. Substring, used impala in running queries for optimization.
- Import data into hive using sqoop from RDBMS applying options like hive table, overwrite, hive partition column, direct.
- Restore, backup and configuration of databases in UAT, production environments.
Environment: Cloudera CDH5.1, Hadoop, Sqoop, MapReduce, Hive, Oozie, Tableau, PuTTY, Eclipse IDE, HP QC, Java, Oracle, MySQL, Kafka. Impala
Big data/Database Developer
Confidential, Dallas, TX
- Created stored procedures to transform the data and worked extensively in T-SQL for various needs of the transformations while loading the data.
- Implemented Proofs of Concept on Hadoop stack and different big data analytic tools, migration from different databases (i.e. MySQL, SQL) to hadoop.
- Successfully loaded files to hive and HDFS from MongoDB, Cassandra, HBase.
- Loaded the dataset into Hive for ETL Operation.
- Involved in writing SQL queries, complex stored procedures, triggers, views, constraints, joins, DDL, DML and user defined functions (UDF) to implement the business logic.
- Designing and creating SQL Server tables.
- Capturing data from existing databases that provide SQL interfaces using sqoop.
- Assisting in transferring databases, including Temp DB, to new servers using TSQL scripts.
- Analyze end user database needs and provide efficient solutions
- Managed indexes, statistics and optimized queries by using execution plan for tuning the database.
- Responsible for performance tuning and Optimization of stored procedures using SQL Profiler and Database tuning wizard.
Environment: Production, SQL Server 2012, SSIS, Cloudera, Hadoop, Sqoop, Hive, Oozie, Putty, NOSQL databases.
- Created and Implemented triggers in T-SQL to facilitate consistent data entry into the database.
- Created parameterized and drill through reports in SSRS
- Creating parameterized Stored Procedures, SSIS packages, triggers, cursors, tables, and views and other SQL joins for building various applications.
- Created stored procedures and triggers for data processing of huge volume of data.
- Used various transformations in SSIS to load data from flat files and FTP to the SQL databases.
- Designed and developed data load jobs using SSIS package and scheduled in SQL Agent.
- Created SSIS packages (.dtsx files) using SSIS to validate, extract, transform and load data to data warehouse databases and data mart databases.
- SSIS configuring the data flow, configuring the individual data flow elements and Monitoring the Performance of the Package.
- Wrote stored procedures and User Define Scalar Functions (UDFs) to be used in SQL scripts.
- Extensively used Joins and Sub-Queries to simplify complex queries involving multiple tables.
- Create and Modify database, Table creation, data manipulation and report generation.
- Used T-SQL Language and System Functions in querying database.
- SSRS report development according to the Business Requirement Document (BRD).
Environment: Python, Microsoft SQL server 2008, 2012, SSRS, SSIS.