- Experienced in Python, Apache Spark and Hadoop ecosystem products (HDFS, MapReduce, Hive, Pig, Tez, Sqoop, Flume, Kafka, Zookeeper, Oozie).
- Experience working on AWS Cloud Computing Infrastructure products (EC2, RDS, S3, EMR, Lambda, Redshift, Athena).
- Experienced with Linux and shell scripting.
- Experience in Designing Data Lake Architecture, Dimensional Modelling and Machine Learning Algorithms.
- Proficient in Workload Schedulers such as Airflow, TWS, and Oozie.
- Experienced working on NoSQL databases such as HBase.
- Using NiFi to automate the data movement between different hadoop systems.
- Experience Loading Log data that is collected from web services and is ingested in to HDFS using Flume.
- Extensive Knowledge on real time data streaming technologies such as Kafka and spark streaming.
- Importing data from Kafka consumer into HBase using spark streaming.
- Using Spark streaming, dividing streaming data into batches as an input to spark engine for batch processing.
- Experienced Loading data in Spark and performing in - memory data computation to generate the output response.
- Worked on SparkSQL to Load tables into HDFS to run select queries on top.
- Experienced with Agile projects involving Scrum and jira software for projects, issues and incident tracking.
- Experience with source control technologies such as Git.
- Proficient working on Talend and Tableau
Big Data Eco System: Hadoop, Spark, HDFS, Scala, Kafka, MapReduce, Hive, Pig, Sqoop, Oozie, NiFi, Flume, HBase, Zookeeper, Hue, Cloudera, Hortonworks Sandbox, Databricks
Programming Skills: Shell Scripting, Scala, Python, Java, and SQL
Operating System: Linux, Unix, Ubuntu
Design: Data Lake and data platform design, E - R Modelling, Dimensional modelling, UML
Tools: Pycharm/Intellij, GIT, Amazon AWS, JIRA, Mobaxterm/Putty, Postman, Talend for Big Data, Tableau, Eclipse, Github, Jenkins, Kubernetes
Database: HBase,MongoDB, Redshift, DynamoDB, RDS, MySQL, PostgreSql, Oracle 11g/10g, Teradata
Web Technologies: HTML, CSS, XML, ReactJs, Jquery, Rest Framework - Django, Flask
- Working as Cloud Big Data Engineer, building data platform and shared services for precision medicine solution.
- Design and implement cloud scale distributed ETL/ELT data pipelines.
- Designing systems, services and frameworks to address high volume and complex data collection, processing, transformation and reporting for analytical purposes.
- Working with dockers, Kubernetes, CircleCI and Jenkins.
- Working on airflow scripts for scheduling jobs on kubernetes.
- Writing code to consume from kinesis stream and setting up stream in kinesis to dump data to s3 and using Athena to query s3.
- Writing Spark EMR jobs to get data from s3 into Redshift and writing scripts to find similar patients.
- Setting up rules engine for data de-identification.
- Migrating redshift data and implementing insert/update/delete logic on redshift.
- Handling knowledge base web ontology data and RDF.
- Worked on newrelic to create application metric dashboard and application performance monitoring and alerts on slack channel.
- Understanding of semver, semantic versioning for software releases.
- Limited hands-on with packagecloud, bumpversion.
- Working as a Big Data Developer.
- Developed job processing scripts using Oozie workflow.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Worked on Cluster of size 120 nodes.
- Worked extensively with Sqoop for importing metadata from Oracle.
- Analyzed the SQL scripts and designed the solution to implement using Pyspark
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Involved in creating Hive tables, and loading and analyzing data using hive queries
- Hadoop application tuning of Hive and Spark program to optimize the latency and throughput.
- Developed Hive queries to process the data and generate the data cubes for visualizing
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
- Collaborated with the infrastructure, database, application and BI teams to ensure data quality and availability.
- Attending daily scrums and collaborating with the team.
- Architected the ELT data pipelines for handling continuous processing of data files of high size data. Created the wrapper script which invoked series of scripts to load data into HDFS iteratively for every source file found on the inbound (edge nodes) location.
- Created Shell scripts to create internal and external tables by getting columns and data types from metadata file.
- Extracted sqoop data to a CSV file and FTP it to the target server so that the downstream applications can consume it.
- Created indexes and tuning the SQL queries in Hive and involved in database connection by using Sqoop.
- Identifying performance issues and bottlenecks for Talend jobs and hiveqls.
- Extensively used Hive/HQL or Hive queries to query data in Hive tables and loading data in to other Hive tables in the data pipeline. Used ORC file format to improve the performance of hive queries. Implemented dynamic partitioning and bucketing techniques in Hive for performance improvement.
- Extended HIVE core functionality by using custom User defined function’s (UDF), User defined table generating functions (UDTF) and User defined aggregate functions (UDAF) for Hive using Python.
- Responsible for analyzing and cleansing raw data by performing Hive/Impala queries and also running Pyspark scripts on data. Also, Used Pyspark for analysis of large data sets and storing the data back to Hbase using pyspark.
- Worked on release & Go-live of the project and Worked production issues and fixing the code defects by analysis Talend job logs.
- Working on re-deployment of code fixes to talend and scripts to hdfs.
- Task automation on stats collection of the batch load of EDL pipeline using shell script and hive.
- Working on manual data reprocessing for data catch up, missing files, & for data issues.
- Working on Talend administration, creating projects, job scheduling etc.
- Extensively worked on Talend open studio for Big Data to design ELT Jobs for Processing of data by using various Talend Big Data components like thdfsexist, tHiveCreateTable, tHiveRow, thdfsinput, tHdfsoutput, tHiveload, thdfscopy, thdfsdelete, thdfslist, thdfsconnection, thiveconnection, tHiveClose, tHdfsget, tHdfsPut, tHdfsproperties, tHdfscompare, tHdfsrename, tHbaselnput, tHbaseOutput, tSqoopExport, tSqooplmport, tSqooplmportAllTables, tSqoopMerge. Also, Used various Talend components like tFilterRow, tMap, tJoin, tPreJob, tPostJob, tFileList, tSplitRow, tAddCRCRow, tJava, tAggregateRow, tDie, tWarn, tLogRow etc.
- Performing some admin activities such as JobServer Configuration, Command Line, Project Creation, Assigning user Access and Job Scheduling etc.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.