- Over 10+ years of extensive hands - on experience in IT industry including deployment of Hadoop Ecosystems and Google cloud computing like MapReduce, Yarn, Sqoop, Flume, Pig, Hive, Big Query, Big Table and 5+ years’ experience on Spark, Storm, Scala, Python.
- Experience in OLTP and OLAP design, development, testing, implementation and support of enterprise Data warehouses.
- Strong Knowledge in Hadoop Cluster Capacity Planning, Performance Tuning, Cluster Monitoring.
- Extensive experience in business data science project life cycle including Data Acquisition, Data Cleaning.
- Experience in Cloud computing on Google Cloud Platform with various technology like Dataflow, Pub/Sub, Big Query and all related tools.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Experienced Good understanding of NoSQL databases and hands on work experience inwriting applications No SQL Databases HBase, Cassandra and MongoDB.
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Pair RDD's, Spark YARN.
- Experienced in installation, configuration, supporting and managing Hadoop Clusters using Apache Cloudera distributions, Horton works, Cloud Storage and Amazon webservices (AWS) and related technologies DynamoDB, EMR, S3, ML.
- Experience in deploying NiFi Data flow in Production team and Integrating data from multiple sources like Cassandra, MongoDB.
- Deploying templates to environments can be done via NiFi Rest API integrated with other automation tools
- Experience in bench marking Hadoop cluster for analysis of queue usage
- Experienced in working with Mahout for applying machine learning techniques in the Hadoop Ecosystem.
- Good Experience on Amazon Web Services like Redshift, Data Pipeline, ML.
- Good experienced on moving the data in and out of Hadoop RDBMS, No-SQL and UNIX from various systems using SQOOP and other traditional data movement technologies.
- Experience on Integration of Quartz scheduler with Oozie work flows to get data from multiple data sources in parallel using fork.
- Experience in installation, configuration, support and management of a Hadoop Cluster using Cloudera Distributions.
- Experience creating Visual report, Graphical analysis and Dashboard reports using Tableau, Informatica of historical data saved in Hdfs and data analysis using Splunk enterprise edition.
- Good experience in utilizing Cloud Storage Services like Git. Extensive knowledge in using GitHub and Bit Bucket.
- Experienced in job scheduling and monitoring using Oozie, Zookeeper.
Big Data Ecosystems: Spark,HDFS and Map Reduce, Pig, Hive, Pig, Impala, YARN, Oozie, Zookeeper, Apache Spark, Apache Crunch, Apache NiFi, Apace STORM, Apache Kappa, Apache Kafka, Sqoop, Flume
Cloud Technologies: Google Cloud Platform, Pub/Sub, Dataflow,BigQuery
Scripting Languages: Python, shell
Programming Languages: Python, Java
Databases: MongoDB, Netezza, SQL Server, MySQL, ORACLE, DB2
IDEs / Tools: Eclipse, JUnit, Maven, Ant, MS Visual Studio, Net Beans
Methodologies: Agile, Waterfall
Confidential, San Antonio, TX
Google cloud platform & Big Data Engineer
- Involved in the process of designing Google Cloud Architecture.
- Designed, automated the dataflow pipelines which will ingest data from real time and batch processing.
- Configured Kubernetes cluster for deployment and execution of code.
- Experience in upgrading the existing Cassandra cluster to latest releases.
- Experience in writing dataflow pipelines and transformation in preprocessing layer
- Performed Stress and Performance testing, benchmark on the cluster.
- Tuned the cluster to achieve maximum throughput and execution time based on the benchmarking results
- Migrated the data from one datacenter to another datacenter.
- Configured, Documented and Demonstrated inter node communication between Cassandra nodes and client using SSL encryption.
Confidential, Houston Tx
Big Data Engineer / Hadoop developer
- Used Hive Queries in Spark-SQL for analysis and processing the data
- Responsible for handling different data formats like Avro, Parquet and ORC formats
- Worked on Import & Export of data using ETL tool Sqoop from MySQL to HDFS using Teradata studio and DBeaver
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
- Implemented Optimized Map Joins to get data from different sources to perform cleaning operations before applying the algorithms
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
- Involved in Developing a Restful service using Python Flask framework
- Used Python modules such as requests, urllib, urllib2 for web crawling
- Experienced in managing and reviewing Hadoop log files
- Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications.
- Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy.
- Created and monitored sessions using workflow manager and workflow monitor.
- Involved in loading data from UNIX file system to HDFS
- Responsible for design & development of Spark SQL Scripts based on Functional Specifications
- Design and develop extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
- Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Worked on Informatica Source Analyzer, Mapping Designer & Mapplet, and Transformations
- Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data
- Created highly optimized SQL queries for MapReduce jobs, seamlessly matching the query to the appropriate Hive table configuration to generate efficient report
- Worked closely with Quality Assurance, Operations and Production support group to devise the test plans, answer questions and solve any data or processing issues
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, Sqoop, Hive and NoSQL databases
- Worked in writing Spark Sql scripts for optimizing the query performance
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming
- Implemented Hive UDF's and did performance tuning for better results
- Tuned, and developed SQL on HiveQL, Drill and SparkSQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
Confidential, Bentonville, AR
Senior Software Engineer - Big Data Engineer
- Participated in Agile Ceremonies and provide status to the team and product owner
- Experience in designing and building ETL pipelines to automate the ingestion of structured and unstructured data
- Implemented and configured big data technologies as well as tune processes for performance at scale
- Proficiency and knowledge of best practices with the Hadoop (YARN, HDFS, MapReduce)
- Created Spark jobs to process TBs of data every day for daily analytics
- Developed and build frameworks/prototypes that integrate Big Data and advanced analytics to make business decisions
- Assisted application development teams during application design and development for highly complex and critical data projects
- Created data management policies, procedures, and standards
- Working with the end-user to make sure the analytics transform data to knowledge in very focused and meaningful ways
Confidential, Philadelphia, PA
Data analyst/ Big Data Engineer
- Created bash and python scripts for automation of data ingestion
- Prepared delivery prerequisites to procure approvals from the management.
- Used python lettuce and behave for BDD testing for defect-free delivery
- Migrated files from On-prem to AWS S3 to enable data for API consumption
- Used Jenkins, Git, and deployed to enable versioning, build pipelines, and deployed into production
- Used Best practices in Hadoop to optimize storage and processing - Partitioning, Bucketing, ORC and Parquet files
- Monitored jobs using Hue for debugging and resolving issues
- Created Impala Scripts to quickly retrieve ad-hoc results for customers
- Consumed Kafka streams into Spark from processing batch streams for applying analytics
- Used NoSQL Hbase to perform CRUD operations in maintaining customer data
- Delivered reports that saved customers $1M in costs. 2. Achieved 97% Customer satisfaction on the work delivered. 3. Optimized hive and
Confidential, McLean, VA
Sr. Hadoop/Spark Developer
- Involved in deploying systems on Amazon Web Services (AWS) Infrastructure services EC2.
- Experience in configuring, deploying the web applications on AWS servers using SBT and Play.
- Migrated Map Reduce jobs into Spark RDD transformations using Scala.
- Used SparkAPI over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data.
- Performed configuration, deployment and support of cloud services including Amazon Web Services (AWS).
- Working knowledge of various AWS technologies like SQS Queuing, SNS Notification, S3 storage, Redshift, Data Pipeline, EMR.
- Responsible for all Public (AWS) and Private (Open stack/VMWare/DCOS/Mesos/Marathon) cloud infrastructure
- Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS and configuring Data Pipelining.
- Used Hive data warehouse tool to analyze the unified historic data in HDFS to identify issues and behavioral patterns.
- Involved in Developing a Restful service using Python Flask framework.
- Expertised in working with Python GUI frameworks - PyJamas, Python.
- Experienced in using Apache Drill data-intensive distributed applications for interactive analysis of large-scale datasets.
- Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data.
- Used Python modules such as requests, urllib, and urllib2 for web crawling.
- Developed tools extensively include Spark, Drill, Hive, HBase, Kafka & MapR Streams, PostgreSQL, Stream Sets
Confidential, Chesterfield, Mi
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, TaskTracker, Name Node, Data Node, YARN and Map Reduce programming .
- Developed Map-Reduce programs to get rid of irregularities and aggregate the data.
- Developed Cluster coordination services through Zookeeper.
- Implemented Hive UDF's and did performance tuning for better results
- Developed Pig Latin Scripts to extract data from log files and store them to HDFS. Created User Defined Functions (UDFs) to pre-process data for analysis
- Implemented Optimized Map Joins to get data from different sources to perform cleaning operations before applying the algorithms.
- Created highly optimized SQL queries for MapReduce jobs , seamlessly matching the query to the appropriate Hive table configuration to generate efficient report.
- Used other packages such as Beautifulsoup for data parsing in Python.
- Tuned, and developed SQL on HiveQL , Drill and SparkSQL.
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE, HBase.
- Implemented CRUD operations on HBase data using thrift API to get real time insights.
- Identified data sources for various reports for senior management, wrote complex SQL queries.