- Over 8 years of experience in Spark and Hadoop Developer using Scala and Python cross platform technologies using Big data with Cloudera and Hortonworks platform.
- In depth knowledge on Big Data Stack like Hadoop ecosystem Hadoop, Map Reduce, YARN, Sqoop, Flume, Kafka, Spark, Spark Data Frames, Spark SQL, Spark Streaming, etc.
- Exploring with Spark using Scala improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark - SQL, Data Frame, pair RDD's, Spark YARN.
- Good knowledge in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
- Sound knowledge in data ingestion using Kafka and Flume.
- Experience as Azure Data Engineer in Azure Cloud, Azure Data Lake Storage, Azure Analytical services, Azure Cosmos, Big Data Technologies (Hadoop and Apache Spark) and Data bricks.
- Excellent understanding of relational databases. Created normalized databases, wrote stored procedures, used JDBC to communicate with database. Experienced with MySQL, and SQL Server.
- Understanding of using S3 and Data storage buckets in AWS.
- Good knowledge of AWS services like EC2, S3, Cloud Front, RDS, Dynamo DB, Elastic Search.
- Working knowledge in PostgreSQL and NOSQL databases like Cassandra and HBase.
- Experienced in running query - using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Good experience in Oozie Framework and Automating daily import jobs.
- Experienced in managing Hadoop clusters and services using Cloudera Manager.
- Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive and map Reduce.
- Highly experienced in importing and exporting data between HDFS and Relational Database Management systems using Sqoop.
- Collected logs data from various sources and integrated in to HDFS using Flume.
- Assisted Deployment team in setting up Hadoop cluster and services.
- Good experience in Generating Statistics/extracts/reports from the Hadoop.
- Good understanding of NoSQL Data bases and hands on work experience in writing applications on No SQL data bases like Cassandra and Mongo DB.
- Good knowledge in querying data from Cassandra for searching grouping and sorting.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.
- Having good knowledge in Benchmarking & Performance Tuning of cluster.
- Good experience in Generating Statistics and reports from the Hadoop.
- Experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, and programming languages like R and Python including Big Data technologies like Hadoop, Spark.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, dimensionality reduction using Principal Component Analysis, testing and validation using ROC plot, K- fold cross validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
- Experience in using various packages in R and python like scikit-learn ggplot2, caret, dplyr, plyr, pandas, numpy, seaborn, scipy, matplotlib, Beautiful Soup, Rpy2.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
Hadoop Eco System: Hadoop, Spark, Impala, Hive, Oozie, Ambari, Sqoop, Map-Reduce, HDFS
Machine Learning Algorithms: Regression, Classification, Azure Machine Learning, PySpark, Spark ML lib.
Programming Languages: Python, Scala, R.
Reporting and Visualization: Tableau, Power BI
Databases and Query Languages: Cassandra, SQL and MySQL, Spark SQL, HiveQL.
Streaming Frameworks: Flume, Kafka, Spark Streaming.
Tools: R Studio, PyCharm, Jupyter Notebook, IntelliJ, Eclipse, NetBeans, Data bricks.
Platforms: Linux, Windows and OS X.
Methodologies: Agile and Waterfall Models.
Confidential, Dallas, TX
Sr. Data Engineer / ML Engineer
- Setup Spark cluster with AKS (Azure Kubernetes Service) in Linux Virtual Machine
- Development and maintenance of Machine Learning Model pipelines
- Migrate Machine Learning Models from Dev to OnPrem in AKS using Build and Release Pipelines
- Debugging Errors and Connectivity in Existing Pipelines
- Analyzed, Strategized & Implemented Azure migration of Application & Databases to cloud
- Used Kibana to generate logs and Compare the Model features from OnPrem to Dev
- Used Azure DevOps to update and deploy Pipelines
- Sink Aggregated Data to Azure Cache / Redis
- Capture Model Logs and Features from MongoDB
- Data Ingestion and Conversion from ORC to Parquet using Azure Data Factory
- Created notebooks in Azure DataBricks using PySpark
Environment: Microsoft Azure, Jira Align, Hive, HBase, PySpark, MongoDB Robo 3T, Kafka, Azure Kubernetes, FluentD, Redis, Azure DevOps, Linux, Azure Databricks, Azure Data Factory
Confidential, Phoenix, AZ
Sr. Data Engineer
- Design and develop ETL integration patterns using Python on Spark
- Importing and exporting data into HDFS and Hive using Sqoop
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data
- Load and transform large sets of structured, semi structured and unstructured data
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance. Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, Sqoop.
- Thorough understanding and hands on experience with Portfolio/Resource/Financial Management applications/process such as MS Project Server 2016/Planview 15/17.5
- Develop and prepare reports utilizing available analytics and data mining tools, including but not limited to- MS BI Suite (SSRS, SSIS, SSAS), SQL server 2008/2008R2, 2012, 2014, Oracle 9I/10g/11g, Azure, MS Visual Studio 2010/2012, Report Builder 2012/2016, MS PowerBI Desktop/Online, Tableau Desktop/Server/Reader
- Provide guidance to development team working on PySpark as ETL platform
- Extensive knowledge/hands on experience in architecting or designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers and indexes
- Create/write complex TSQL queries using complex joins, CTEs, derived tables, subqueries and complex aggregations
- Created notebooks in Azure Data Bricks using PySpark
- Analyzed, Strategized & Implemented Azure migration of Application & Databases to cloud.
- Configured SQL Server Master Data Services (MDS) in Windows Azure IaaS
- Possess strong knowledge of business process, data or information management, and data quality standards and processes
- Handled very large data sets in efficient way to process and transform them in Data Bricks.
- Build SSRS tabular reports, matrix, charts, parameters, sub reports, indicators/ gauges using Visual Studio 2012 and Report Builder 2016 utilizing multiple data sources/datasets
- Experience working with SharePoint BIC center to deploy SharePoint integrated/native reports/Dashboards and integrating them with the Report App
- Collaborate with the business stakeholders to understand and gather requirements and demonstrate reports/dashboard functionality
- Experience in building PowerBI dashboards through ODATA (Accessing Datasets), cubes and tabular models by using data modelling/ data blending, calculated measures and columns with advanced DAX/MDX queries
- Hands on experience in working with Power Query Editor to transform/blend data and create/modify parameters/filters.
- Support the Corporate & CFO communities in designing and developing Financial & Management Reports from the SAP Business Warehouse on HANA
- Hands on experience in building Tableau dashboards/visualizations by creating cross tabs, Heat Maps, Bar chart, calculations, groups and sets.
- In-depth knowledge on ETL concepts and troubleshooting issues with SSIS packages and working closely with ETL developers as part of data analysis/configuration process
- Point of contact for evaluating report requests to determine requestor’s needs and objectives, identifying correct methodology for extracting data including data sources and criteria, and ensuring the delivered report is accurate, timely, and formatted appropriately.
Environment: MS Project Server 2013/2016, Enterprise One 15/17, JIRA, SharePoint 2013/2016, O365, OFSAA
Confidential, Bellevue, WA
Sr. Data Engineer
- Development and maintenance of data pipeline on Azure Analytics platform using Azure Databricks, PySpark, and Python.
- Responsible for ingestion of data from various APIs and writing modules to store data in S3 buckets.
- Transformation of batch and stream data to encrypt fields and store in data warehouse for ad-hoc query and analysis.
- Validating data fields from downstream source to ensure uniformity of data.
- Converting ingested data (csv, xml, Json) to parquet file format in compressed form.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
- Collaborated with and across Agile teams to design, develop, test, implement, and support technical solutions in full-stack development tools and technologies
- Worked with a team of developers with deep experience in machine learning, distributed microservices, and full stack systems
- Utilized programming languages like Java, Scala, Python and Open Source RDBMS and NoSQL databases and Cloud based data warehousing services such as Snowflake
- Performed unit tests and conducted reviews with other team members to make sure the code is rigorously designed, elegantly coded, and effectively tuned for performance
- Worked on Hadoop ecosystem in PySpark on Amazon EMR and Databricks.
- Responsible for writing Unit Tests and deploy production level code through the help of Git version control.
- Building Data Pipeline which involved ingesting of data from disparate data sources to a unified platform.
- Constructed robust, high volume data pipelines and architecture to prepare data for analysis by client.
- Designed a custom ETL and data warehouse solution to centrally store, associate and aggregate data from across multiple domains and analytics platforms.
Environment: Microsoft Azure, Spark 1.6, H Base 1.2, Tableau 10, Power BI, Python 3.4, Scala, PySpark, HDFS, Flume 1.6, Cloudera Manager, MongoDB, SQL, GitHub, Linux, Spark SQL, Kafka, Sqoop 1.46, AWS (S3)
Confidential, Austin, TX
Sr. Data Engineer
- Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
- Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
- Worked on building end to end data pipelines on Hadoop Data Platforms.
- Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
- Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
- Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD’s.
- Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
- Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
- Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.
Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.
- Created and maintained optimal data pipeline architecture,
- Assembled large, complex data sets that meet functional / non-functional business requirements.
- Identified, designed and implemented internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.
- Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and Azure ‘big data’ technologies.
- Built analytics tools that utilize the data pipeline to provide actionable insights into customer acquisition, operational efficiency and other key business performance metrics.
- Created data tools for analytics and data scientist team members that assist them in building and optimizing our product into an innovative industry leader
- Perform quality assurance and testing of SQL server environment.
- Develop new processes to facilitate import and normalization, including data file for counterparties.
- Work with business stakeholders, application developers, and production teams and across functional units to identify business needs and discuss solution options.
- Ensure best practices are applied and integrity of data is maintained through security, documentation, and change management.
Environment: SQL Server 2005 Enterprise Edition, T-SQL, Enterprise manager, VBS.