- Overall 8+ years of experience in IT industry and expertise in Big Data/ Hadoop Development framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Solid experience in Big Data Analytics using HDFS, Hive, Impala, Kafka, Pig, Sqoop, MapReduce, HBase, Spark, Spark SQL, YARN, Spark Streaming, Zookeeper, Hue, Flume, Oozie.
- Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
- Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Experience in design and development of Ingestion framework from multiple sources to Hadoop using Spark framework withPySpark and PyCharm
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience in installing, configuring and administratingHadoop clusterof major Hadoop distributions.
- Experiencein development, implementation and testing ofBusiness Intelligence and Data Warehousing solutions
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction
- Experience in IBM DataStage design and development of parallel ETL jobs
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Experience in Azure Cloud Platform-Data Lake, Data Storage, Data Factory, Data Bricks, Azure SQL Data Base and Migration experience from SQL Databases to Azure.
- Experience in developing a data pipeline through Kafka-Spark API.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Proficient in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experience in working with Excel Pivot and VBA macros for various business scenarios.
- Expertise in SQL Server Analysis Services (SSAS) and SQL Server Reporting Services (SSRS).
- Ability to program in varies languages such as Python, Java, C++, and Scala
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS, CSV, etc.
- Creative skills in developing elegant solutions to challenges related to pipeline engineering
Big Data Tools/ Hadoop Ecosystem: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop
Databases: Oracle 12c/11g/10g, Teradata R15/R14, MY SQL, SQL Server, No SQL-Mongo DB, Cassandra, Hbase.
ETL/Data warehouse Tools: Informatica and Tableau.
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Cloud Platform: Amazon Web Services (AWS), Microsoft Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, MS Azure- Data Lake, Data Storage, Data Bricks, Data Factory
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Operating System: Windows, Unix, Sun Solaris
Methodologies: System Development Life Cycle (SDLC), Agile
Confidential, Phoenix, AZ
Senior Big Data Engineer
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, Nifi, DevOps, Spark YARN.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
- Responsible to manage data coming from different sources through Kafka.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
- Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
- Written the Map Reduce programs,HiveUDFsin Java
- Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Sparkwith Scala.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Installed application on AWS EC2 instances and configured the storage on S3 buckets.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored.
- Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Using Spark Data frame API in Scala for analyzing data.
- Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
- Evaluating client needs and translating their business requirement to functional specifications thereby onboarding them onto Hadoop ecosystem.
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts
- Worked on AWS CLI Auto Scaling and Cloud Watch Monitoring creation and update.
- Worked with various HDFS file formats like Parque, IAM, Json for serializing and deserializing.
- Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.
- Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
- Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
- Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
- Implemented many Kafka ingestion jobs to consume the real time data processing and batch processing.
- Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
- Good experience in using Relational databasesOracle, MY SQL, SQL Server andPostgreSQL
- Experienced Good understanding of NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and MongoDB.
- Very good implementation experience of Object-Oriented concepts, Multithreading and Java/Scala
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using SparkContext, Spark -SQL, Pair RDD's, Spark YARN
- Working on designing the Map Reduce and Yarn flow and writing Map Reduce scripts, performance tuning and debugging.
- Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.
- Developed Oozie workflow engine to run multiple Hive, Pig, Tealeaf, Mongo DB, Git, Sqoop and Spark jobs.
Environment: Hadoop (HDFS, Map Reduce), Kafka, Scala, Mongo DB, Java, Pig, Sqoop, Flume DevOps, HBase, AWS Services (Lambda, EMR, Auto scaling, EC2, S3, IAM, CloudWatch, DynamoDB), Yarn, PostgreSQL, Spark, Impala, Oozie, Hue, Oracle, NIFI, Git.
Confidential, Blue Ash, OH
Big Data Engineer
- Created Notebooks using Databricks, Scala and spark and capturing the data from Delta tables in Delta lakes.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in inAzure Databricks.
- Creation of data aggregation and pipelining using Kafka and Storm
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
- Have real-time experience ofKafka-Stormon HDP platform for real time analysis.
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Monitored cluster health by Setting up alerts using Nagios and Ganglia
- Working on tickets opened by users regarding various incidents, requests
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster
- DevelopedKafka producers and consumersefficient ingested data from various data sources
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
- Writing a Data Bricks code and ADF pipeline with fully parameterized for efficient code management.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Zookeeper, Azure, Data Bricks, Data Lake, Data Factory, Unix/Linux Shell Scripting, Python, PyCharm, Informatica, Linux, Shell Scripting, Informatica PowerCenter.
Confidential, Jersey City, NJ
Big Data Developer
- Designed solutions to process high volume data stream ingestion, processing and low latency data
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
- Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
- Optimized the TensorFlow Model for efficiency
- Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
- Strong understanding of AWS components such as EC2 and S3
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid
- Designed and implemented big data ingestion pipelines to ingest multi TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on Multi-Terabyte datasets from various data sources for a variety of downstream purposes such as analytics using PySpark.
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Defined and deployed monitoring, metrics, and logging systems on AWS.
- Connected to Amazon Redshift through Tableau to extract live data for real time analysis.
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Built performant, scalable ETL processes to load, cleanse and validate data
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
- Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Measured Efficiency of Hadoop/Hive environment ensuring SLA is met
Environment: Oracle, Kafka, Python, Redshift, Informatica, AWS, SQL Server, Erwin, RDS, NOSQL, Snow Flake Schema, MySQL, Dynamo DB, Docker, PostgreSQL, Tableau, Git Hub
- Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts
- Assisting in designing the overall ETL solutions including analyzing data, preparation of high level and detailed design documents, test and data validation plans and deployment strategy.
- Experienced in querying data using SparkSQL on top of Spark engine
- Involved in managing and monitoringHadoopcluster using Cloudera Manager.
- Used Python and Shell scripting to build pipelines.
- Extensively worked in database components like SQL, PL/SQL, Stored Procedures, Stored Functions, Packages and Triggers.
- Supporting other ETL developers, providing mentoring, technical assistance, troubleshooting and alternative development solutions
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka
- Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting
Environment: Oozie, Pig, Airflow, Hadoop, HDFS, Spark, HiveQL, Informatica, Oracle, PL/SQL, Sql Server, Linux, Shell Scripting, Unix
- Create/Modify shell scripts for scheduling various data cleansing scripts and ETL load process.
- Developed testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
- Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations.
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop, Map Reduce developed in python, pig, Hive.
- Experience in designing and developing applications in PySpark using python to compare the performance of Spark with Hive
- Written and executed Test Cases and reviewed with Business & Development Teams.
- Worked on debugging, performance tuning and Analyzing data using Hadoop components Hive Pig.
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
- Automated Regression tool (Qute) and reduced manual effort and increased team productivity
Environment: Hadoop, Map Reduce, HDFS, Pig, HiveQL, MySQL, UNIX Shell Scripting, Java, Spark, SSIS, Spark, JSON, Hive, Sqoop.