We provide IT Staff Augmentation Services!

Gcp Big Data Engineer Resume

4.00/5 (Submit Your Rating)

St, LouiS

SUMMARY

  • Over 7+ Years of experience as Data Engineer and Data Modeler with excellent knowledge on BI, Data warehouse, ETL, Cloud and Big - Data technologies and hands on experience in IT data analytics projects.
  • Professional Healthcare Client working experience for more than 2+ years and Financial/Banking domain experience of 3+ years.
  • In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
  • Extensive experience in Hadoop development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, Nifi, Kafka, Zookeeper, and YARN.
  • Have more than 2+ years of hands-on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud service tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer, Data Flow, Pub/Sub and Cloud SQL
  • Hands of experience inGCP - GCS bucket, G - cloud function, cloud dataflow, GSUTIL, BQ command line utilities, and Stack driver.
  • Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow
  • Strong experience in working on AWS, S3, Glue, EMR, Redshift, RDS.
  • In-depth knowledge of Snowflake Database, Schema and Table structures. To define virtual warehouse sizing for Snowflake for different type of workloads.
  • Experience with Snowflake cloud data warehouse and AWS S3 bucket.
  • Responsible for implementing monitoring solutions in Docker, and Jenkins.
  • Experience with infrastructure-as-a-service (preferably AWS) and Docker.
  • Hands-on experience on Performance tuning of Big Data workloads.
  • Basic knowledge in and experience in database security (Oracle, MySQL, Sybase, DB2).
  • Expert knowledge and understanding of Oracle database technologies.
  • Experienced withOpenshiftplatform in managingDockercontainers.
  • Create develop and test environments of different applications by provisioning Kubernetes clusters on AWS usingDocker.
  • Hands-on experience with Snowflake utilities, SnowSQL, SnowPipe, Big Data model techniques using Python.
  • Expertise in maintaining and enhance existing services, applications, and platforms including, but not limited to, bug fixes, feature enhancements and performance tuning.
  • Experience in resolving on-going maintenance issues and bug fixes, monitoring Informatica sessions as well as performance tuning of mappings and sessions.
  • Experience in Developing Spark applications using Spark - SQL in Data bricks for data extraction, transformation, and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
  • Experience in Data analysis, Database programming (Stored procedures; Triggers, Views), Table Partitioning, performance tuning.
  • In depth Knowledge ofAWScloud service likeCompute, Network, Storage,andIdentity & access management.
  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR, MapR distribution.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, Pair RDD 's and worked explicitly on PySpark and Scala.
  • Experienced working on Big-Data technologies like Hadoop (MapReduce & Hive), Sqoop, HDFS and Spark streaming Kafka, and No-SQL Databases like MongoDB, Cassandra.
  • Experience with Apache Airflow, including installing, configuring and monitoring Airflow cluster.
  • Good exposure on cloud technologies like AWS services like Athena, Lambda, Step Function and SQL.
  • Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors.
  • Implemented the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, creating realm /domain, managing.
  • Highly proficient in Data Modelling retaining concepts of RDBMS, Logical and Physical Data Modelling until 3NormalForm (3NF) and Multidimensional Data Modelling Schema (Star schema, Snow-Flake Modelling, Facts and dimensions).
  • Expertise in the Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE and other relational and non-relational databases and experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, Spark-Sql, and Pig.
  • Experience with commercial BI tools, such as Tableau, Cognos, as well as open source and experienced working with various data sources such as Oracle, SQL Server, DB2, and Teradata & Netezza.
  • Demonstrable architecture experience, specifically within both Standard CRM packages & ideally within the financial services industry.
  • Sound knowledge of object-oriented programming in Python.
  • Highly Skilled at Python coding using SQL, NumPy, Pandas and Spark for Data Analysis and Model building, deploying, and operating available, scalable, and fault-tolerant systems using Amazon Web Services (AWS).
  • Experience with various ETL engines, such as Oracle OWB, and Microsoft (SQL Server Integration Services), as well as strong experience writing the complete ETL process with PL/SQL, along with scripting languages (Bash/Python).
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
  • Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
  • Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
  • Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
  • Experienced in Consolidating and auditing Metadata from disparate tools and sources, including business intelligence (BI), extract, transform, and load (ETL), relational databases, modelling tools, and third-party metadata into a single repository.
  • Experienced in using distributed computing architectures such as AWS products (e.g. S3, EC2, Redshift, and EMR), Hadoop and effective use of Map-Reduce, SQL and Cassandra to solve Big Data type problems.
  • Expertise in writing SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors and extensive experience in advanced SQL Queries and PL/SQL stored procedures.
  • Experienced in deploying and scheduling Reports using SSRS to generate all daily, weekly, monthly and quarterly Reports including current status and experienced in designing and deploying reports with Drill Down, Drill Through and Drop-down menu option and Parameterized and Linked reports.
  • Experienced in fact dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension)
  • Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
  • Developed web-based applications using Python, DJANGO, QT, C++, XML, CSS3, HTML5, DHTML, JavaScript and jQuery.
  • Extensive working experience in Normalization and De-Normalization techniques for both OLTP and OLAP systems in creating Database Objects like tables, Constraints (Primary key, Foreign Key, Unique, Default), Indexes.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Building and productionizing predictive models on large datasets by utilizing advanced statistical modelling, machine learning, or other data mining techniques.
  • Developed intricate algorithms based on deep-dive statistical analysis and predictive data modelling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.
  • Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, Nifi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos

Programming: Python, PySpark, Scala, C, C++, Shell script, Perl script, SQL

Cloud Technologies: AWS, Google Cloud Platform(GCP Cloud Storage, Big Query, Composer, Cloud Dataproc, Cloud SQL, Cloud Functions, Cloud Pub/Sub, Data Flow)

Frameworks: Django REST framework, MVC, Hortonworks

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, PostmanETL Tool: Informatica Power Center 9.1/8.6/8.5/8.1/7.1

Versioning tools: SVN, Git, GitHub

Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS

Network Security: Kerberos

Database Modelling: Dimension Modelling, ER Modelling, Star Schema Modelling, Snowflake Modelling

Monitoring Tool: Apache Airflow

Visualization/ Reporting: Tableau, ggplot2, matplotlib, MS Office, SSRS and Power BI

Build Tools: Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M,Oozie, Hue, SOAP UI

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering

Web Technologies: HTML, JavaScript, CSS, J2EE, jQuery

Methodologies: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential

GCP BIG DATA ENGINEER

Responsibilities:

  • Working on building data pipelines, end to end ETL process for ingesting data in GCP.
  • Working on running data flow jobs using Apache beam integrated in python for performing heavy historical loads of data.
  • Designed and automated the pipeline to transfer the data to Stakeholders which provides centralized KPIs and reports for CVS.
  • Building multiple programs with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Big query tables
  • Loading the historical data of multiple resources to Big Query table.
  • Working on Cleansing the raw data for Immunization, provenance, patient, and practitioner resources to load into FHIR viewer.
  • Optimized and fine-tuned the scripts to achieve efficient processing in cloud data platform and FHIR load.
  • Have used whistle language along with pandas in python and we showcase our data in Bigtable for stake holders for ad hoc querying.
  • Monitoring Big query, Data proc and cloud Data flow jobs via Stack driver for all the environments.
  • Analyze various type of raw file like Json, Csv, Xml, RRF with Python using Pandas, NumPy etc.
  • Process and load bound and unbound Data from Google pub/sub to Big query using cloud Dataflow with Python.
  • Creating a pub sub topic and Configuring in the codebase.
  • Reading the data from Google Big Query tables and publishing it to Pub/Sub topic.
  • Developing python scripts to validate flat file with control files.
  • Performing parallel processing to load the data into FHIR and Big Query.
  • Perform data profiling, analysis and mining activities by writing queries using DB tools like SQL Developer/Sever and PostgreSQL.
  • Developed guidelines for Airflow cluster and DAGs. Performance tuning of the DAGs and task implementation.
  • Leveraged Google Cloud Platform Services to process and manage the data from streaming and file-based sources.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
  • Using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD and YARN.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
  • Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
  • Developed Kafka consumer API in python for consuming data from Kafka topics.
  • Consumed Extensible Mark-up Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
  • Hands on experience running data flow jobs and on data proc cluster.
  • Created Spark Jobs to extract data from Hive tables and process the same using Data proc
  • Historical data load and incremental data load to Cloud Storage using Hadoop utilities and load to Big Query using BQ tools
  • Monitoring Big query, Data proc and cloud Data flow jobs via Stack driver for all the environments.
  • Writing python program to extract the zip files from NIH website to update the latest data every month and apply required filters as well as joining using SQL query and finally write the output to json file.
  • Scheduling multiple jobs using Airflow.
  • Build data pipelines in airflow in GCP for ETL related jobs/Data flow jobs for daily incremental loads using different airflow operators.
  • Used Hive QL to analyse the partitioned and bucketed data, Executed Hive queries on Parquet tables.
  • Stored in Hive to perform data analysis to meet the business specification logic.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for Data analysis and engineering type of roles.
  • Worked in Implementing Kafka Security and boosting its performance.
  • Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
  • Setting up and worked on Kerberos authentication principals to establish secure network communication.
  • On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
  • Experience in using Avro, Parquet, RC File and JSON file formats, developed UDF in Hive.
  • Developed Custom UDF in Python and used UDFs for sorting and preparing the data.

Confidential, St. Louis

Sr. Data Engineer

Responsibilities:

  • Responsible to Build the ETL Pipelines (Extract, Transform, Load) from data lake to different databases based on the requirements.
  • The current project involves cloud migration from oracle(on-prem) to GCP (cloud). Developed and automated data migration process using python and shell. Dealt with complex applications and frameworks considering the main run-time characteristics such as low-latency, availability etc.
  • Worked on data extraction using python to csv files. These files are encrypted using pgp encryption and moved to GCS cloud storage using file mover which is integrated using Java. These data files are downloaded, decrypted and imported to Cloud SQL.
  • Worked on Data Replication using Striim application.
  • To ensure successful creation of the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using Spark, SQL, HDFS, Hive, MapReduce, Druid, Python, Unix, Hue and Shell Scripting.
  • Understanding of airflow rest services and worked on integration of airflow of platform eco-system.
  • Worked on orchestrating the Airflow / workflow in hybrid cloud environment.
  • Worked on creating data ingestion processes to maintain Global Data Lake on the GCP cloud and Big Query.
  • Automated multiple data flow jobs using Shell Scripting which are running and integrated in python.
  • Expertise in handling the huge volume (TB’s) of data and writing the code in an efficient way using spark SQL, beam and load the data into google cloud storage to lower the run time during cloud migration.
  • Built the catalog tables using batch processing, multiple complex joins to combine multiple dimension tables of the store transactions and the E-commerce transactions which has millions of records every day.
  • Participated in marinating Oracle Dataguard Databases using OEM 12C GRID CONTROL and DataGurad Broker with database administration team.
  • Developed proof-of-concept prototype with faster iterations to develop and maintain design documentation, test cases, monitoring and performance evaluations using Git, Putty, Maven, Confluence, ETL, Automatic, Zookeeper, Cluster Manager.
  • Used the Airflow Scheduler for turning the Python files contained in the DAGs folder into DAG objects that contain tasks to be scheduled.
  • Worked on building the data applications on Snowflake Cloud Data Platform,to Develop Strategies like selecting the strategic virtual warehouse sizes by service or feature, to adjust the cluster numbers to match the expected workloads and target the workloads to the right services.
  • Worked on tuning the Snowflake database to maximise the query performance.
  • Possesses good domain knowledge working with one of the largest credit reporting agency and knowledge of analytics and building custom frameworks to support data-engineering needs in the team.
  • Modified and implemented the existing data flow jobs to solve data related issues and validated.
  • Performed data quality checks to ensure high quality of data and data cleansing using python.
  • Designed the whole test strategy for implementing data importing jobs to GCP and solved importing issues.
  • Involved in automating read, extract, transform, and load data using data flow jobs written using python and shell.
  • Performed data analysis according to the requirement to troubleshoot data related issues and assist to resolve the issues.
  • Implemented multiple configurations by filtering the data according to the requirement. Specifically, to clean the data to remove errors and check consistency and completeness of the data.
  • I have been involved in data extraction, data migration, data validation, data encryption, data decryption, and data replication from on-prem to GCP and Bi-directional replication.
  • Extracting data using python script and parallelizing the extraction process to optimize the extraction process using multiprocessing module in python by taking the no. of processors from the machine. To reduce the complexity and time consuming and handle huge data. Also involved in splitting the large csv files into chunks.
  • Automated the data quality rules and de-duplication processes to keep the data lake more accurate.
  • Used shell scripting to automate the validations between different databases of each module, and report to the users to show the data quality, using frame works Aorta and Unified Data Pipeline.
  • I have predominantly worked on Google Cloud Platform GCP Services: Compute Engine for hosting .Net App on IIS (app server), Cloud SQL PostgreSQL (SSS DB and Lightbox DB) - Database, Internal Load Balancer - Load balance application server endpoints, Http Load Balancer, Stack driver - Logging and Monitoring, VPC, Other shared services VPC, IAM, DNS, KMS.
  • Modifying and creating new workflows in Automatic to schedule the hive queries, Hadoop commands and shell scripts on daily basis.
  • Worked on a migration project to migrate data from different sources to Google Cloud Platform (GCP) using UDP framework and transforming the data using Spark Scala scripts.
  • Troubleshoot and analyze complex production problems related to data, network file delivery, application issues independently identified by the business owners and provide solutions to recovery.
  • Improving the performance and optimizing existing algorithms in Hadoop using Spark context, Spark- SQL, Data Frames, Pair RDD’s & Spark YARN.
  • Successfully creating the Linked Services on the source and as well for the destination servers
  • Created automated workflows with the help of triggers.
  • Use DDL and DML for writing triggers, stored procedures, and data manipulation.
  • Setup SQL Server Linked Server to connect multiple Servers/databases.

Environment: Hadoop, GCP, Google cloud storage, Big Query, Python, Shell Scripting, Data proc, Spark, Airflow, Scala, Teradata, Hive, Aorta, Sqoop, SQL, DB2, UDP, GitHub, Striim software.

Confidential, Greater NYC, NY

Sr. Data Engineer

Responsibilities:

  • Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).
  • Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
  • We have builtS3buckets and managed policies for S3 buckets and usedS3 bucketandGlacierfor storage and backup onAWS.
  • Worked on integrating services likeGitHub, AWSCode Pipeline, Jenkins and AWS Elastic Beanstalk to create a deployment pipeline.
  • Managed Servers on the Amazon Web Services (AWS) platform instances usingPuppet configuration management
  • Developed many Data warehouse solutions in AWS Redshift Involved in Redshift Database development by inserting Bulk Records, Copying Data from S3, Created and Managed Clusters, Tables and Perform Data Analysis Queries. Also performed Tuning and Query Optimization in AWSRedshift
  • Used AWS lambda as a serverless backend using python 3.6 boto3 libraries andImplemented lambda concurrency in my company to use DynamoDB streams to triggers multiple lambdas parallelly.
  • Developed data warehouse model in snowflake for over 80 datasets using whereScape.
  • Worked on AWS code pipeline, code deploy, code build, code commit for setting up continuous integration & deployments.
  • Worked on integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Redesigned the Views in snowflake to increase the performance.
  • Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
  • Using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD and Spark YARN.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
  • Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
  • Developed Kafka consumer API in python for consuming data from Kafka topics.
  • Consumed Extensible Mark-up Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
  • In this project the Data Warehouse is build using Informatica Power Center 8.6.1 for extracting data from various sources including flat-files, SAP-ABAP, Teradata.
  • Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file.
  • Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipe-line system.
  • Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.
  • Experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elastic search and loaded data into Hive external tables.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR.
  • Configured Snow pipe to pull the data from S3 buckets into Snowflakes table.
  • Stored incoming data in the Snowflakes staging area.
  • Created numerous ODI interfaces and load into Snowflake DB. worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
  • Have Used the Airflow GUI to define the connections.
  • Worked on establishing the connection to Redshift database using Conn id for the applications that utilizes Airflow for work management.
  • Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
  • Maintained stored definitions, transformation rules and targets definitions using Informatica repository Manager.
  • Used various transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Stored Procedure, and Union to develop robust mappings in the Informatica Designer.
  • Created and Configured Workflows and Sessions to transport the data to target warehouse Oracle tables using Informatica Workflow Manager
  • Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
  • Used the Spark Data Cassandra Connector to load data to and from Cassandra.
  • Worked from Scratch in Configurations of Kafka such as Managers and Brokers.
  • Worked on using Informatica PowerCenter created mappings and mapplets to transform the data according to the business rules.
  • Involved in extracting the data from Flat files, Oracle, SQL and DB2 into the Operational Data Source (ODS) and the data wfrom Operational Data Source was extracted, transformed and applied business logic to load them in the Global Data Warehouse Using Informatica PowerCenter 9.1.0 tools.
  • Experienced in creating data-models for Clients transactional logs, analyzed the data from Cassandra.
  • Tables for quick searching, sorting and grouping using the Cassandra Query Language.
  • Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
  • Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.
  • Stored in Hive to perform data analysis to meet the business specification logic.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for Data analysis and engineering type of roles.
  • Worked in Implementing Kafka Security and boosting its performance.
  • Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
  • Developed Custom UDF in Python and used UDFs for sorting and preparing the data.
  • Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig etc.
  • Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
  • Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
  • Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
  • Setting up and worked on Kerberos authentication principals to establish secure network communication.
  • On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, MapR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, PySpark, Shell scripting, Snowflake, Apache Airflow, Informatica Power Center 9.1/8.6/8.5/8.1/7.1, Workflow Manager, Workflow Monitor, Informatica Power Connect / Power Exchange, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.

Confidential, Fort Worth, TX

Sr. Data Engineer

Responsibilities:

  • Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.
  • Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
  • Implemented logical and physical relational database and maintained Database Objects in the data model using Erwin.
  • Design, Implement and maintain Database Schema, Entity relationship diagrams, Data modelling, Tables, Stored procedures, Functions and Triggers, Constraints, clustered and non-clustered indexes, partitioning tables, Schemas, Functions, Views, Rules, Defaults, and complex SQL statement for business requirements and enhancing performance.
  • Developed data pipeline using Flume, Sqoop, Pig and Java map reduce and Spark to ingest customer behavioural data and purchase histories into HDFS for analysis.
  • Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using Erwin and subsequent deployment to Enterprise Data Warehouse.
  • Worked on Big data on AWS cloud services i.e., EC2, S3, EMR and DynamoDB
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Advanced knowledge on Confidential Redshift and MPP database concepts.
  • Migrated on premise database structure to Confidential Redshift data warehouse.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Created and managed a Docker deployment pipeline for custom application images in the AWS cloud using Jenkins.
  • Experience using Docker containers in implementing a high-level API to provide lightweight containers that run the process.
  • Worked on the creation of custom Docker container images, tagging and pushing the images to the Docker repository.
  • Extensively worked on creating Docker file, build the images, running Docker containers and manage Dockerized application by using Docker Cloud. Used Docker Swarm for clustering and scheduling Docker container.
  • Implemented Workload Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer running ad hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Designed the data marts using the Ralph Kimball's Dimensional Data Mart modelling methodology using Erwin.
  • Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Worked on designing, building, deploying, and maintaining Mongo DB.
  • Design SSIS packages to bring data from existing OLTP databases to new data warehouse using various transformations and tasks like Sequence Containers, Script, for loop and for each Loop Container, Execute SQL/Package, Send Mail, File System, Conditional Split, Data Conversion, Derived Column, Lookup, Merge Join, Union All, OLE DB source and destination, excel source and destination with multiple data flow tasks.
  • Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging) to useful data.
  • Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
  • Improve the performance of SSIS packages by implementing parallel execution, removing unnecessary sorting, and using optimized queries and stored procedures.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Developed pipeline for POC to compare performance/efficiency while running pipeline using the AWS EMR Spark cluster and Cloud Dataflow on GCP.
  • Configure and manage data sources, data source views, cubes, dimensions, mining structures, roles, defined hierarchy, and usage-based aggregations with SSAS.
  • Responsible for maintaining and tuning existing cubes using SSAS and Power BI.
  • Worked on cloud deployments using maven, docker and Jenkins.
  • Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
  • Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using Cloud watch.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Used python Boto 3 to configure the services AWS glue, EC2, S3.

Environment: Erwin 9.6, Oracle 12c, MS-Office, SQL, SQL Loader, Docker, PL/SQL, DB2, SharePoint, Talend, MS-Office, Redshift, SQL Server, Hadoop, Spark, AWS.

Confidential

Sr. Data Analyst

Responsibilities:

  • Work with users to identify the most appropriate source of record required to define the asset data for financing.
  • Performed data profiling in Target DWH.
  • Experience in using OLAP function like Count, SUM, and CSUM
  • Performed Data analysis and Data profiling using complex SQL on various sources systems including Oracle and Teradata.
  • Familiarity with GitHub for project management and versioning.
  • Strong programming skills in Python.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency.
  • Developed normalized Logical and Physical database models for designing an OLTP application.
  • Developed new scripts for gathering network and storage inventory data and make Splunk ingest data.
  • Developed timed reports, alerts and managed Splunk applications.
  • Standardize Splunk forwarder deployment, configuration and maintenance across a variety of UNIX and Windows platforms.
  • Leverage programming skills in Perl or Python to automate various aspects of the Splunk environment.
  • Create user interfaces that will allows customers to manage their own Splunk instances.
  • Imported the customer data into Python using Panda’s libraries and performed various data analysis - found patterns in data which helped in key decisions for the company.
  • Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data.
  • Using HiveQL developed many queries and extracted the required information.
  • Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
  • Design and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Extracted data from the database using SAS/Access, SAS SQL procedures and create SAS data sets.
  • Created Teradata SQL scripts using OLAP functions like RANK () to improve the query performance while pulling the data from large tables.
  • Worked on MongoDB database concepts such as locking, transactions, indexes, replication, schema design, etc. Performed Data analysis using Python Pandas.
  • Good experience in Agile Methodologies, Scrum stories, and sprints experience in a Python-based environment, along with data analytics and Excel data extracts.
  • Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
  • Worked on data manipulation and analysis with Python or R.
  • Involved in defining the source to target data mappings, business rules, and business and data definitions.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Responsible for defining the functional requirement documents for each source to target interface.
  • Hands on Experience on Pivot tables, Graphs in MS Excel
  • Using advanced Excel features like Pivot tables and Charts for generating Graphs.
  • Designed and developed weekly, monthly reports by using MS Excel Techniques (Charts, Graphs, Pivot tables) and PowerPoint presentations.
  • Strong Excel skills, including pivots, VLOOKUP, conditional formatting, large record sets. Including data manipulation and cleaning.

Environment: SAS/Access, SAS SQL, MS Excel, Python Pandas, RDBMS, Python

Confidential 

Data Analyst

Responsibilities:

  • Responsible for design and development of advanced R/Python programs to prepare to transform and harmonize data sets in preparation for modelling.
  • Experience in identifying, profiling, and mapping the data in meaningful ways across the enterprise data landscape.
  • Have worked on identifying data requirements and building solutions that leverage data assets is ideal.
  • Developed large data sets from structured and unstructured data. Perform data mining.
  • Partnered with modelers to develop data frame requirements for projects.
  • Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
  • Tracked various campaigns, generating customer profiling analysis and data manipulation.
  • Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
  • Analysed large datasets to answer business questions by generating reports and outcome- driven marketing strategies.
  • Extracted data from Twitter using Java and Twitter API Parsed JSON formatted twitter data and uploaded to the database the existing system.
  • Used Python2.7 to apply time series models, the fast growth opportunities for our clients
  • Analysed the traffic queries of Baidu search engine using classification algorithm.
  • Assisted to improve the liquidity of our ads model.
  • Based on the data of clients and traffic, the strengths, and weaknesses of products.
  • Involved in fixing bugs and minor enhancements for the front-end modules.
  • Class model, Sequence diagrams, and Activity diagrams for SDLC process of the application
  • Maintenance in the testing team for System testing/Integration/UAT.
  • Guaranteeing quality in deliverables.
  • Conducted Design reviews and technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support Involved in loading data from RDBMS and weblogs into HDFS using Sqoop.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.

Environment: Python, R, Tableau 6.1, MDM, QlikView, MLlib, PL/SQL, HDFS, Teradata14.1, JSON, MapReduce, MySQL, Spark, R Studio, MAHOUT

We'd love your feedback!