Sr. Big Data Engineer/ Cloud Data Engineer Resume
Dallas, TX
SUMMARY
- 8 years of experience in IT Industry in the Big data platform having extensive hands on experience in Apache Hadoop ecosystem and enterprise application development.
- Good knowledge on extracting the models and trends from the raw data collaborating with the data science team.
- Experience in Hadoop ecosystem experience in ingestion, storage, querying, processing and analysis of big data
- Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight
- Performed the migration of Hive and MapReduce Jobs from on - premise MapR to AWS cloud using EMR and Qubole
- Experience in installation, configuration, supporting and managing Hadoop Clusters using HDP and other distributions
- Hands on expertise with AWS Databases such as RDS(Aurora), Redshift, DynamoDB and Elastic Cache(Memcached & Redis)
- Extensive experience withReal-timestreaming technologies Spark, Storm, Kafka
- Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
- Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts
- Proficiency in analyzing large unstructured data sets using PIG and developing and designing POCs using Map-reduce and Scala and deploying on the Yarn cluster
- Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data
- Good understanding of Apache Spark High level architecture and performance tuning pattern
- Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing
- Good understanding of AWS Sagemaker
- Extract, transform and load the data from different formats like JSON, a Database, and expose it for ad-hoc/interactive queries using Spark SQL
- Proficient in big data tools like Hive and Spark and relational data ware house tool Teradata etc.
- Responsible for data engineering functions including, but not limited to: data extract, transformation, loading, integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management.
- A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
- Developing ETL pipelines in and out of data warehouse using combination of Python and Snowsql.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
- Substantial experience in Spark 3.o integration with Kafka 2.4
- Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
- Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
- Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier and EC2 Instance with EMR cluster.
- Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA,Time Series,Principal Component Analysis,Factor Analysis,Cluster Analysis,Discriminant Analysis.
- Expertise in transforming business resources and requirements intomanageable data formats and analytical models,designing algorithms,building models,developing data miningandreporting solutionsthat scale across a massive volume of structured and unstructured data.
- Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive and Mongo DB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in working with NoSQL databases like HBase and Cassandra.
- Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database,Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
- Worked with Cloudera and Hortonworks distributions.
- Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Good knowledge of integrating Spark Streaming with Kafka for real time processing of streaming data
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
Databases: Oracle, My Sql, SQL Server, Teradata
No SQL Databases: MongoDB, Cassandra, HBase
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
Sr. Big Data Engineer/ Cloud Data Engineer
Responsibilities:
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance and auto-scaling in AWS Cloud Formation
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
- Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop
- Developed Hive queries to pre-process the data required for running the business process
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios
- Implementations of generalized solution model using AWS Sagemaker
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
- Conduct root cause analysis and resolve production problems and data issues
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Leading the team to design, develop, testing and delivering end -end deliverables
- Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify workable items for further development.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Performance tuning, code promotion and testing of application changes
- Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment Conduct performance analysis and optimize data processes. Make recommendations for continuous improvement of the data processing environment.
- Develop a data platform from scratch and took part in requirement gathering and analysis phase of the project in documenting the business requirements
- Processed some simple statistic analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Design/Implement large scale pub-sub message queues using Apache Kafka
- Worked on Configuring Zookeeper, Kafka and logstash cluster for data ingestion and Elasticsearch performance and optimization and Worked on Kafka for live streaming of data.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
- Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Utilized Agile and Scrum methodology for team and project management.
- Used Git for version control with colleagues.
- Experienced in day - to-day DBA activities includingschema management, user management(creating users, synonyms, privileges, roles, quotas, tables, indexes, sequence),space management(table space, rollback segment),monitoring(alert log, memory, disk I/O, CPU, database connectivity),scheduling jobs, UNIX Shell Scripting.
- Expertise in usingDocker to run and deploy the applications in multiple containers likeDocker SwarmandDocker Wave.
- Worked on designing ETL pipelines to retrieve the dataset fromMySQLandMongoDB into AWS S3bucket, managed bucket and objects access permission
- Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka
- Developed complexTalend ETL jobsto migrate the data fromflat filesto database. Pulled files frommainframe into Talendexecution server using multipleftpcomponents.
- Developed complexTalend ETL jobstomigratethe data from flat files to database. DevelopedTalend ESBservices and deployed them onESBservers on different instances.
- Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
- Developed stored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
- Developed merge scripts to UPSERT data into Snowflake from an ETL source
Environment: Hdfs, Hive, Spark (PySpark, SparkSQL, SparkMLIib), Kafka, Linux, Python 3.x(Scikit-learn, NumPy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig. Json and Parquet File systems. Map Reduce Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, MySQL, MongoDB.
Confidential, Rochester, MN
Sr. Big Data Engineer
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
- Responsible for wide-ranging data ingestion using Sqoopand HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
- Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solution.
- Develop Spark streaming application to read raw packet data from Kafka topics, format it to JSON and push back to Kafka for future use cases purpose.
- Developer high fidelity spark Kafka streaming application - which consume Json format packet messages and return geo location data to mobile application for requested IMEI.
- Start working with AWS for storage and handling for tera byte of data for customer BI Reporting tools
- Monitored cluster health by Setting up alerts using Nagios and Ganglia
- Working on tickets opened by users regarding various incidents, requests
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
- Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
- Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
- Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
- Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
- Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Datatechnologies.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Optimizealgorithmwithstochastic gradient descent algorithmFine-tuned thealgorithm parameterwith manual tuning and automated tuning such asBayesian Optimization.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook,Hive and NoSql.
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Azure, Docker, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, Hbase, Kafka, Data Lake, ZookeeperPython 3.6, AWS(Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline,EventBridge, Athena), Unix/Linux Shell Scripting, PyCharm, Informatica PowerCenter
Confidential, Jersey City, NJ
Big Data Developer
Responsibilities:
- Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
- Worked on Hadoop cluster which ranged from 4-8 nodes during pre-production stage and it was sometimes extended up to 24 nodes during production.
- Built APIs that will allow customer service representatives to access the data and answer queries.
- Designed changes to transform current Hadoop jobs to HBase.
- Handled fixing of defects efficiently and worked with the QA and BA team for clarifications.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Extending the functionality of Hive with custom UDF s and UDAF's.
- The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports and established self-service reporting model in Cognos for business users.
- Implemented Bucketing and Partitioning using hive to assist the users with data analysis.
- Used Oozie scripts for deployment of the application and perforce as the secure versioning software.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Develop database management systems for easy access, storage, and retrieval of data.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in the hive and Map Side joins.
- Expert in creating Hive UDFs using Java to analyze the data efficiently.
- Responsible for loading the data from BDW Oracle database,Teradata into HDFS using Sqoop.
- Implemented AJAX, JSON, and Java script to create interactive web screens.
- Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as MongoDB.Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
- Wrote documentation for each report including purpose, data source, column mapping, transformation, and user group.
- Utilized Waterfall methodology for team and project management
- Used Git for version control with Data Engineer team and Data Scientists colleagues. Involved in creating CreatedTableaudashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts etc. using show me functionality.Dashboards and stories as needed usingTableauDesktop andTableauServer
- Responsible for daily communications to management and internal organizations regarding status of all assigned projects and tasks.
- Executed quantitative analysis on chemical products to recommend effective combinations
- Performed statistical analysis using SQL, Python, R Programming and Excel.
- Worked extensively with Excel VBA Macros, Microsoft Access Forms
- Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
- Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
- Manipulated and summarized data to maximize possible outcomes efficiently
- Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand the data on the fly with the usage of quick filters for on demand needed information.
- Analyzed and recommended improvements for better data consistency and efficiency
- Designed and Developeddata mapping procedures ETL-Data Extraction,Data Analysis and Loading process for integratingdata using R programming.
- Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope.
Environment: Cloudera CDH4.3, Hadoop, Pig, Hive, Informatica, HBase, MapReduce, HDFS, Sqoop, Impala, SQL, Kafka, Tableau, Python, SAS, Flume, Oozie, Linux.
Confidential, Malvern, PA
Data Engineer
Responsibilities:
- Gathered business requirements, definition and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
- Created sophisticated visualizations, calculated columns and custom expressions anddeveloped Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
- Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
- Automated Diagnosis of Blood Loss during Emergencies and developed Machine Learning algorithm to diagnose blood loss.
- Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
- Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
- Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into HDFS.
- Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process usingpythonscripts.
- DevelopedSparkjobs using Scala for faster real-time analytics and usedSparkSQL for querying
- Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
- Developed predictive models using Decision Tree, Random Forest, and Naïve Bayes.
- Used pandas, NumPy, seaborne, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms. Expertise inR, Matlab, pythonand respective libraries.
- Research on Reinforcement Learning and control (TensorFlow, Torch), andmachinelearning model (Scikit-learn).
- Hands on experience in implementing Naive Bayes and skilled inRandom Forests, Decision Trees, Linear,and Logistic Regression, SVM, Clustering, Principal Component Analysis.
- Performed K-means clustering, Regression andDecision Treesin R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
- Implemented various statistical techniques to manipulatethe datalike missingdataimputation, principal component analysis and sampling.
- Worked on R packages to interface with Caffe Deep Learning Framework. Perform validation on machine learning output from R.
- Applied different dimensionality reduction techniques like principal component analysis (PCA) and t-stochastic neighborhood embedding(t-SNE) on feature matrix.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Responsible for design and development of Python programs/scripts to prepare transform and harmonize data sets in preparation for modeling.
- Worked with Market Mix Modeling to strategize the advertisement investments to better balance the ROI on advertisements.
- Implemented clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer profiling to design insurance plans according to their behavior pattern.
- Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation technique to train my model for best results.
- Worked with Customer Churn Models including Random forest regression, lasso regression along with pre-processing of the data.
- Used Python 3.X(NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python and build models using deep learning frameworks
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model
- Implemented Univariate, Bivariate, and Multivariate Analysis on the cleaned data for getting actionable insights on the 500-product sales data by using visualization techniques in Matplotlib, Seaborn, Bokehand created reports in Power BI.
- Decommissioning nodes and adding nodes in the clusters for maintenance
- Adding new users and groups of users as per the requests from the client
- Working on tickets opened by users regarding various incidents, requests
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.
Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.
Confidential
Hadoop/ Spark Developer
Responsibilities:
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig and Hive.
- Built pipelines to move hashed and un-hashed data from XML files to Data lake.
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.
- Experience with Cloud Service Providers such as Amazon AWS, Microsoft Azure, and Google GCP
- Data Analysis: Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
- Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
- Wrote Junit tests and Integration test cases for those Microservice.
- Worked in Azure environment for development and deployment of Custom Hadoop Applications.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Developed NiFi workflow to pick up the multiple files from ftp location and move those to HDFS on daily basis.
- Scripting: Expertise in Hive, PIG, Impala, Shell Scripting, Perl Scripting, and Python.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Proven experience with ETL frameworks (Airflow, Luigi, or our own open sourced garcon)
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Used Hadoop YARN to perform analytics on data in Hive.
- Developed and maintained batch data flow using HiveQL and Unix scripting
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.
- Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and Pig UDF’s.
- Used windows Azure SQL reporting services to create reports with tables, charts and maps.
Environment: Hadoop 3.0, Spark, Hive, Azure, Micro services, AWS, Java 8, MapReduce, Agile, HBase 1.2, JSON, Spark 2.4, Kafka, JDBC, Hive 2.3,, Pig 0.17