We provide IT Staff Augmentation Services!

Lead Gcp Data Engineer Resume

Nyc, NY

SUMMARY:

  • 8+ years of professional experience in information technology with an expert hand in the areas of BIG DATA, HADOOP, SPARK, HIVE, IMPALA, SQOOP, FLUME, KAFKA, SQL tuning, ETL development, report development, database development, data modeling and strong knowledge of oracle database architecture.
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Well knowledge and experience in Cloudera ecosystem (HDFS, YARN, Hive, SQOOP, FLUME, HBASE, Oozie, Kafka, Pig), Data pipeline, data analysis and processing with hive SQL, IMPALA, SPARK, SPARK SQL.
  • Using Flume, Kafka and Spark streaming to ingest real time or near real time data in HDFS.
  • Analyzed data and provided insights with R Programming and Python Pandas
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Have good Programming experience with Python and Scala.
  • Hands in experience on No SQL database like Hbase, Cassandra.
  • Analyzing the way to migrate oracle database to redshift.
  • Experience with scripting languages like PowerShell, Perl, Shell, etc.
  • Expert knowledge and experience in fact dimensional modeling (Star schema, Snow flake schema), transactional modeling and SCD (Slowly changing dimension)
  • Extensive experience in writing MS SQL, T-SQL procedures, ORACLE TOAD functions and queries
  • Effective team member, collaborative and comfortable working independently
  • Proficient in achieving oracle SQL plan stability, maintaining baselines with SQL plans, ASH, AWR, ADDM, Sql Advisor for pro-active follow up and SQL rewrites.
  • Experience on Shell scripting to automate various activities.
  • Application development with oracle forms and report with OBIEE, discoverer, report builder and ETL development.

PROFESSIONAL EXPERIENCE:

Confidential, NYC, NY

Lead GCP Data Engineer

Responsibilities: -

  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
  • Design and architect various layer of Data lake.
  • Design star schema in Big Query
  • Loading salesforce Data every 15 min on incremental basis to BIGQUERY raw and UDM layer using SOQL, Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil And Shell Script.
  • Using rest API with Python to ingest Data from and some other site to BIGQUERY.
  • Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables.
  • Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery.
  • Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environment.
  • Open SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
  • Submit spark jobs using gsutil and spark submission get it executed in Dataproc cluster
  • Write a Python program to maintain raw file archival in GCS bucket.
  • Analyze various type of raw file like Json, Csv, Xml with Python using Pandas, Numpy etc.
  • Write Scala program for spark transformation in Dataproc.

Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Salesforce Soql, Python, Scala, Spark, Hive, Sqoop, Spark-Sql.

Confidential, NYC, NY

GCP Data Engineer

Responsibilities:

  • Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
  • Write a program to download a SQL Dump from there equipment maintenance site and then load it in GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
  • Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
  • Create firewall rules to access Google Data proc from other machines.
  • Write Scala program for spark transformation in Dataproc.

Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql.

Confidential, Dallas, TX

Big Data Engineer

Responsibilities:

  • Analyzing client data using Scala, spark, spark SQL and define an end to end data lake presentation towards the team
  • Design the transformation layers to write the ETL using Scala and spark and distribute among the team including me.
  • Keep the team motivated to deliver the project on time and work side by side with other members as a team member
  • Design and develop spark job with Scala to implement end to end data pipeline for batch processing
  • Do fact dimensional modeling and proposed solution to load it
  • Processing data with Scala, spark, spark SQL and load in hive partition tables in parquet file format
  • Develop spark job with partitioned RDD (like hash, range, custom) for faster processing
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Develop near real time data pipeline using flume, Kafka and spark stream to ingest client data from their web log server and apply transformation
  • Develop SQOOP script and SQOOP job to ingest data from client provided database in batch fashion on incremental basis
  • Use DISTCP to load files from S3 to HDFS and Processing, cleansing and filtering data using Scala, Spark, Spark SQL, HIVE, Impala Query and Load in Hive tables for data scientists to apply their ML algorithms and generate recommendations as part of data lake processing layer.
  • Define the data pipeline for various clients
  • Building part of oracle database in Redshift
  • Loading data in No SQL database (Hbase, Cassandra)
  • Combine all the above steps in oozie workflow to run the end to end ETL process
  • Using YARN in CLOUDERA manager to monitor job processing
  • Developing under scrum methodology and in a CI/CD environment using Jenkin.
  • Do participate in architecture council for database architecture recommendation.
  • Deep analysis on SQL execution plan and recommend hints or restructure or introduce index or materialized view for better performance
  • Deploy EC2 instances for oracle database

Environment: Hadoop Ecosystem (HDFS, Yarn, Pig, Hive, Sqoop, Flume, Oozie, Kafka, Hive Sql, Impala, Spark, Scala, Python, Hbase, Cassandra, EC2, EBS Volume, VPC, S3, Oracle 12c, Oracle Enterprise Linux, Shell Scripting.

Confidential, Dallas, TX

Data Analyst

Responsibilities:

  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed logistic regression models (using R programming and Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests and hobbies, etc.
  • Created Tableau dashboards/reports for data visualization, Reporting and Analysis and presented it to Business.
  • Created Data Connections, Published on Tableau Server for usage with Operational or Monitoring Dashboards.
  • Knowledge in Tableau Administration Tool for Configuration, adding users, managing licenses and data connections, scheduling tasks, embedding views by integrating with other platforms.
  • Worked with senior management to plan, define and clarify dashboard goals, objectives and requirement.
  • Responsible for daily communications to management and internal organizations regarding status of all assigned projects and tasks.

Hire Now