- 8+ years of professional Software development experience with 6+ years of expertise in Hadoop Ecosystem, Cloud Engineering and Data Warehousing.
- Sound Experience with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.
- Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
- Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.
- Experience in large scale application development using Big Data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume and Nifi.\
- Proficiency in multiple databases like Teradata, MongoDB, Cassandra, MySQL, ORACLE and MS SQL Server
- Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
- Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Proficient in handling and ingesting terabytes of Streaming data (Kafka, Spark streaming, Strom), Batch Data, Automation and Scheduling (Oozie, Airflow).
- Profound knowledge in developing production-ready Spark applications using Spark Components like Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML and Spark Streaming.
- Expertise in developing multiple confluent Kafka Producers and Consumers to meet business requirements. Store the stream data to HDFS and process the data using Spark.
- Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
- Experienced in scripting with Python (PySpark), Java, Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet and ORC.
- Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Pig, Impala, Sqoop, Oozie, Flume, Mahout, Storm, Tableau, Talend big data technologies.
- Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming, Kafka and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using Spark-SQL with various data sources like JSON, ORC, Parquet and Hive.
- Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data. Involved in loading the structured and semi structured data into spark clusters using SparkSQL and DataFrames Application Programming Interface (API).
- Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded. Skilled in using Amazon Redshift to perform large scale database migrations.
- Ingested data into Snowflake Cloud Data Warehouse using Snowpipe. Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.
- Worked in developing Hive scripts for extraction, transformation, loading of data into Data warehouse.
Big Data Ecosystem: HDFS, YARN, Map Reduce, Sqoop, Hive, Oozie, Pig, Spark, Zookeeper, Cloudera Manager, Kafka, Flume, Nifi, Connect, Airflow, StreamSets, Kafka connect
CloudTechnologiesand Services: Amazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, Kinesis, Microsoft Azure - Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory), Google Cloud Platform
NoSQL Database: HBase, Cassandra, Dynamo DB, Mongo DB
Programming & Scripting: Python, Scala, Java, SQL, HiveQL, PowerShell and BASH Scripting
Databases and ORM: Oracle, MySQL, Teradata, Postgres, Django ORM, SQL Alchemy
Version Control: Git, SVN
IDE Tools: Eclipse, Pycharm, Visual Studio Code, Sublime Text, Intellij, Jupyter Notebook.
Operating Systems: Ubuntu, Mac OS-X, CentOS, Windows 10/8/ 7
Sr. Big Data Engineer (Azure)
Confidential, Bentonville, Arkansas
- Installed and designed with Apache bigdata Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari and Nifi.
- Work with different data sources like HDFS, Hive and Teradata for Spark to process the data.
- Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.
- Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
- Developed Python programs for manipulating the data reading from various Teradata and convert them as one CSV Files.
- Involved in migration of Teradata queries into the snowflake Data warehouse queries.
- Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.
- Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Pyspark and shell scripting.
- Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services. Developed spark Scala notebook to perform data cleaning and transformation on various tables.
- Implemented Spark using Scala utilized Spark SQL heavily for faster development, and processing of data.
- Building and creating scripts for data modelling, mining for easier access to Azure Logs, App Insights to PMs & EMs.
- Handled bringing in enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.
- Worked on data analysis and reported using Power BI on customer usage metrics. I used this analysis to present to the leadership towards a product growth to motivated team of engineers and product managers.
- Worked on various execution enhancements like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.
- Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Services) and also on Microsoft Azure.
- Extensively leveraged the Talend Big Data components (tHDFSOutput, tPigmap, tHive, tHDFSCon) for Data Ingestion and Data Curation from several heterogeneous data sources
- Perform ongoing monitoring, automation, and refinement of data engineering solutions.
- Created connected help to land the data from SFTP location to Azure Data Lake.
- Wrote several Teradata SQL Queries using Teradata SQL Assistant for Ad Hoc Data Pull request.
- Created a few Databricks Spark jobs with Pyspark to perform several tables to table operations.
- Experience in dealing with both agile and waterfall methods in a fast pace manner.
Environment: Azure Data Factory (ADF v2), Azure Databricks (PySpark), Azure Data Lake, PySpark, Hive, Apache Nifi 1.8.0, Jenkins, Kafka, Azure functions Apps, BLOB Storage, SQL server, Spark SQL, SQL, Agile Methodology.
Big Data Engineer
Confidential, Boston, Massachusetts
- Hands-on experience in working with the Hadoop Cloudera Distribution platform, AWS Cloud services.
- Processed numerous terabytes of data stored in AWS S3 using Elastic Map Reduce (EMR).
- Ingested incremental/full load data from Teradata, SQL server etc from various sources to S3 using Spark JDBC connection.
- Created Hive queries that helped market analysts spot emerging trends by comparing incremental data with Teradata reference tables and historical metrics.
- Created Athena, Hive external tables using s3 files and created partitioned data for adhoc analysis.
- Responsible for managing incoming data from various sources and involved in HDFS maintenance and loading of structured data.
- Worked on AWS EC2, EMR and S3 to create clusters and manage data using S3.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Python using Spark framework.
- BuiltS3buckets and managed policies for S3 buckets and usedS3 bucketandGlacierfor storage and backup on AWS.
- Using the Sqoop framework to load batch process data from distinctive data sources into Hadoop.
- Outline lifecycle from analysis to production implementation, with emphasis on identifying data validation, developing logic and transformations as per requirements, and creating notebooks to load the data into Delta Lake.
- Operated on the transformation layer using Apache Spark RDD, Data frame APIs, and Spark SQL and applied various transformations and aggregations provided by the Spark framework.
- Acted on Spark integration with Hive and DB2 at ingestion layers to work with different file formats like Parquet, JSON.
- Designing & creating ETL jobs through Talend to load huge volumes of data into MongoDB, Hadoop Ecosystem and relational databases.
- Created an Automated Databricks workflow notebook to run multiple data loads (Databricks notebooks) in parallel using Python.
- Developed ETL Pipelines using Apache PySpark - Spark SQL and Data Frame APIs.
- Develop and maintain various data ingestion & continuous integration (CI/CD) pipelines as per the design architecture and processes: source to landing, landing to curated & curated to process.
- Write Databricks notebooks (Python) for handling large volumes of data, transformations, and computations to operate with various types of file formats.
- Creating Hive tables as per requirement were internal (or) external tables are defined with appropriate static/dynamic partitions and bucketing intended for efficiency.
- Designed and implemented HIVE queries, functions for evaluation, filtering, loading, and storing of data.
- Executing Hive queries on the tables stored in Hive to perform data analysis to meet the business requirements.
- Worked on integrating GIT into the continuous Integration (CI) environment along with Jenkins.
- Using JIRA for issues and project tracking, GIT for version control, and Airflow for scheduling the jobs.
Environment: AWS Cloud services - S3, EMR, Athena, IAM roles, AWS Glue, HDFS, Sqoop, Hive, Spark, Databricks, PySpark, Python.