We provide IT Staff Augmentation Services!

Big Data Developer Resume

2.00/5 (Submit Your Rating)

New, YorK

SUMMARY

  • 8+ Years of experience in Application Development, Enhancement, and Implementation. I am looking forward to get involved in engagements helping organization in deriving the useful insight from the data for profitability, starting from data sourcing, transformation & analytics on the transformed data.
  • Hands on experience on NoSql databases like HBASE, HIVE.
  • Hands on experience on streaming data using kafka & spark Scala streaming API
  • Hands on experience in processing Avro, Json, xml files
  • Hands on experience in kafka, producer, consumer, schema registry & offset management
  • Expert level skills in Big Data Tools like Pig, HDFS, Hive, HBASE, Impala, Map - Reduce, SQOOP, OOZIE, Spark- core, Spark-Sql.
  • Having understanding of PL/SQL & constructing the triggers, tables, collections, functions & procedure.
  • Hands on experience working onNoSQLdatabases including HBase,MongoDB.
  • Hands on experience in writingAd-hoc Queriesfor moving data from HDFS to HIVE and analyzing the data using HIVEQL.
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
  • Good understanding on Cloud Based technologies like AWS.
  • Good knowledge on Bit Bucket and GitHub Enterprise.
  • Knowledge on Docker to create Containers using Docker file and orchestrate using Docker Compose and Kubernetes.
  • Hands on experience in versioning using bitbucket
  • Hands on experience in Application Deployment using CICD pipeline
  • Hands on experience in AWS and Azure technologies.
  • Hands-on experience developing Teradata PL/SQL Procedures and Functions and SQL tuning of large databases.
  • ETL from databases such as SQL Server, Oracle11G to Hadoop HDFS in Data Lake.
  • Expertise inAmazon Web ServicesincludingElastic Cloud Compute (EC2) and Dynamo DB.
  • Hands on experience in python & Scala
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • UNIX Scripting, scheduling jobs using Crontab
  • Good understanding of Machine Learning Algorithms
  • Having hands on Implementation of project on Agile Methodology
  • Having hands on experience in azure databricks, datafactory, blob, data lake storage, cosmos DB, synapse, azure sql server, event hub, stream analytics, data flow
  • Good understanding of AWS Models
  • Worked on Azure AD B2C for providing Authentication and Authorization management of user.

TECHNICAL SKILLS

Big data Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Airflow, Flume, Ambari, Hue

Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR

Database: Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2

Language: C, C++, Scala, Python

Cloud & DevOps: Continuous Integration & Delivery: Atlassian Bamboo, GitHub Actions

AWS Components: IAH, S3, EMR, EC2, Lambda, Redshift

Methodologies: Agile, Waterfall

Build Tools: Maven, Gradle, Jenkins

Databases: NO-SQL, HBase, Cassandra, MongoDB, DynamoDB

IDE Tools: Eclipse, Net Beans, IntelliJ

Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML

BI Tools: Tableau

Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X

PROFESSIONAL EXPERIENCE

Big Data Developer

Confidential, New York

Responsibilities:

  • Created framework for the data Ingestion from various sources to Hadoop using Spark & Python
  • Created ETL framework to hydrate the data lake using pyspark
  • Created unix shell scripts to call the spark jobs
  • Used Kafka consumer’s API in Scala for consuming data from Kafka topics
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark.
  • Developed custom aggregate functions usingSparkSQL and performed interactive querying.
  • Created framework in Python to perform data cleanup.
  • Deploying Spark jobs in Amazon EMR and running the job on AWS clusters.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement usingPySpark.
  • Implemented Kafka model which pulls the latest records into Hive external tables.
  • Loaded all datasets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.
  • Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
  • Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift; Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
  • Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS.
  • Created ETL Framework using spark on AWS EMR in Scala/Python
  • Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
  • Created ingestion framework using kafka, EMR, Aurora, Cassandra in Python/Scala
  • Migrated the computational code in HQL toPySpark.
  • Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
  • Completed data extraction, aggregation, and analysis in HDFS by usingPySparkand store the data needed to Hive.
  • Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
  • Sound knowledge in programming Spark using Scala.
  • Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in java for data cleaning and pre-processing.
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to Hive and Impala.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.

Environment: SQL server, Teradata, Hive, Looker, Cassandra, EMR, Presto, Aurora, Oracle Putty, WINSCP, SFTP, Hadoop (Cloudera) cluster, mapR cluster, Jupyter Notebook, PyCharm, IntelliJ, bitbucket, Azure data bricks, data factory, cosmos DB, unravel. Avro, json, XML, Unix, Python, Scala, Ansible, spark, HDFS, hive, HBase, Sqoop, Kafka, spark streaming

Data Developer

Confidential, Englewood, CO

Responsibilities:

  • Created framework for the data Ingestion from various sources to Hadoop using Spark & Python
  • Creating Test Automation Framework in Python
  • Created control structures
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Developed PySpark programs and created the data frames and worked on transformations.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
  • Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
  • Implement POC with Hadoop. Extract data with Spark into HDFS.
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Implemented applications with Scalaalong with Akka and Play framework.
  • Optimized the code using PySpark for better performance
  • Worked on Spark streaming using Apache Kafka for real time data processing.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in ETL, Data Integration and Migrationby writing Pig scripts.
  • Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
  • Ran data formatting scripts in Python and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
  • Processed metadata files into AWS S3 and Elastic search cluster.
  • Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
  • Designed and implemented MongoDB and associated RESTful web service.
  • Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate Terabytes of data and stored it in AWS HDFS.

Environment: SQL Server, Teradata, hive, HBase Putty, WINSCP, SFTP, Hadoop (Cloudera) cluster, mapR cluster, Jupyter Notebook, PyCharm, IntelliJ, bitbucket, Azure databricks, datafactory, cosmos DB, unravel Avro, json, XML, Unix, Python, Scala, Ansible, spark, hdfs, hive, HBase, Sqoop, kafka, spark streaming, spark-sql

Data Engineer

Confidential, Schaumburg, IL

Responsibilities:

  • Creating framework for the data Ingestion from various sources to Hadoop using python and spark
  • Creating Test Automation Framework using Scala
  • Creating utility method to flatten json events to granular level using Scala & Spark
  • Performed end-to-end delivery of pyspark ETL pipelines on Azure-databricks to perform the transformation of data orchestrated via Azure Data Factory (ADF) scheduled through Azure automation accounts and trigger them using Tidal Schedular.
  • Data modeledHBase tablesto load large sets of structured, semi-structured and unstructured data coming fromUNIX,NoSQLand a variety of data sources.
  • Solved performance issues in Hive scripts with understanding of Joins, Group and aggregation and translate to MapReduce jobs.
  • DevelopedUDFsinJavaas and when necessary to use in HIVEqueries.
  • Coordinated with various stakeholders such as the End Client, DBA Teams, Testing Team and Business Analysts.
  • Involved in gathering requirements and developing a project plan.
  • Involved in understanding requirements, functional specifications, designing documentations and testing strategies.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Created Azure functions and configured it to receive events from your Synapse warehouse.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • Migrated an existing on-premises application to Azure.
  • Used spark sql to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark Sql.
  • Involved in UI designing, Coding, Database Handling.
  • Used Azure services like ADLS and Synapse analytics for small data sets.
  • Involved inUnit TestingandBug Fixing.
  • Created Azure functions and configured it to receive events from your Synapse warehouse.
  • Worked over the entireSoftware Development Life Cycle (SDLC)as a part of a team as well as independently.
  • Written SQL queriesto query the database and providing data extracts to users as per request

Environment: SQL server, Teradata, Mongo DB, Oracle Putty, WINSCP, SFTP, Hadoop (Cloudera) cluster, mapR cluster, Jupiter notebook, PyCharm, IntelliJ, bitbucket, bamboo, AWS/Azure Avro, json, XML, Unix, Python, Scala, PL/SQL, Ansible, spark, hdfs, hive, HBase, Sqoop, kafka, spark streaming, spark-sql, Jenkins, AWS/Azure models

Bigdata Engineer

Confidential

Responsibilities:

  • Creating data store, Datasets and Virtual Warehouse in the lake and then creating Spark and Hive refiners to implement the existing SQL Stored Procedures.
  • Implementing BDF Data Lake provides a platform to manage data in a central location so that anyone in the firm can rapidly query, analyze or refine the data in a standard way.
  • Creating framework for moving legacy data from RBMS, Mainframes, Teradata & External source systems data warehouse to Hadoop Data Lake and migrating the data processing to lake using Python & Spark.
  • Creating Test Automation Framework using Scala
  • Creating Audit Framework using Spark, Python, HBASE
  • Creating reusable User defined functions in java/python
  • Shell scripting for Automating jobs Orchestration
  • Providing Technical assistance or training to the team.
  • Creating reconciliation jobs for validating data between source and lake.
  • Optimizing Applications.
  • Implementing Kafka as pub sub engine.
  • Performing real time streaming of data using Kafka and spark.
  • Performing unit testing and integration testing using Junit framework.
  • Creating Reusable modules for spark streaming and data movement between different zone using Python.
  • Training module to generate recommendation report using k -means clustering algorithm

Environment: Oracle,sql server,Hive,Hbase Putty, WINSCP, SFTP, Hadoop (Cloudera) cluster, mapR cluster, Jupyter notebook, PyCharm, NumPy, SciPy, scikit learn Pl/SQL, XML, SQL*Plus, Unix, Python, Scala, Ansible, K-means clustering, spark, HDFS, Hive, HBase, Sqoop, kafka

Software Engineer

Confidential

Responsibilities:

  • Created a POC in Big Data from end-to-end using Pig, HIVE, HDFS, Tableau
  • End to End development of Hadoop Project
  • Prepared PL/SQL packages and procedures for the back end processing of the proposed database design
  • Delivered PL/SQL training session for co-workers to educate about the latest PL/SQL features, PL/SQL performance tuning
  • Drafted tables, synonyms, sequences, views, PL/SQL stored procedures and triggers
  • Facilitated testing and code review
  • Performed performance tuning of the overall system by eliminating redundant joins, creating indexes, removing redundant code
  • Developed UNIX shell scripts for part processing
  • Utilized Oracle Designer 6i to perform data modelling
  • Documented Tech Specs for the proposed database design
  • Devised PL/SQL packages and procedures for the back-end processing of the proposed database design
  • Delivered PL/SQL training session for co-workers to educate about the latest PL/SQL features, PL/SQL performance tuning
  • Facilitated management of database
  • Designed tables, synonyms, sequences, views, PL/SQL stored procedures and triggers
  • Performed testing and code review
  • Conducted performance tuning of the overall system by eliminating redundant joins, creating indexes, removing redundant code. Developed UNIX shell scripts to perform a nightly refresh of the test system from Production databases. Monitored user profiles, roles and privileges for the Sybase database
  • Maintaining technical, functional documentation and all the deliverables.

Environment: Oracle Forms Developer, Report Developer, HP UX 11i, UNIX SUN Solaris 5.8, Oracle 9i, Putty, WINSCP Pl/SQL, SQL*Plus,Unix scripting

We'd love your feedback!