Gcp Data Engineer Resume
NyC
SUMMARY
- A Data Engineer with over all 8+ years of experience working with ETL, Big Data, Python/Scala, Relational Database Management Systems (RDBMS), and enterprise - level cloud-based computing and applications.
- Comprehensive experience on Hadoop ecosystem utilizing technologies like MapReduce, Hive, HBase, Spark, Sqoop, Kafka, Oozie, Zookeeper and AWS.
- Created partitions and bucketing as well as designed tables in Hive to optimize performance.
- Experience with developing User Defined Functions (UDFs) in Apache Hive using Java, Scala, and Python.
- Hands on experience working in GCP services like Big Query, Cloud Storage (GCS), cloud function, cloud dataflow, Pub/sub, Cloud Shell, GSUTIL, Big Query, Data Proc, Operations Suite (Stack driver).
- Performed in-memory data processing for batch, real-time, and advanced analytics using Apache Spark (Spark Core, Spark SQL, and Streaming).
- Ingested data into Hadoop from various data sources like Oracle, MySQL, and Teradata using Sqoop tool.
- Experienced inAgileandWaterfallmethodologies in Project execution.
- Strong knowledge in NoSQL column-oriented databases like HBase and their integration with Hadoop.
- Experience in setting up Hadoopclusters on cloud platforms like AWS and GCP.
- Extensive experience working with AWS Cloud services and AWS SDK’s to work with services like AWS API Gateway, Lambda, S3, IAM and EC2.
- Customized the dashboards and managed user and group permissions on Identity and Access Management (IAM) in AWS.
- Expertise in database performance tuning data modeling.
- Experienced in providing security to Hadoop cluster withKerberosand integration with LDAP/AD at Enterprise level.
- Involved in best practices forCassandra, migrating applications toCassandradatabase from the legacy platform for Choice.
- Experienced in developing MapReduce programs using Apache Hadoop for Big Data workloads.
- Good understanding of XML methodologies (XML, XSL, XSD) including SOAP.
- Used Spark CassandraConnector to load data to and fromCassandra and analyze the data with Apache Spark.
- Hands on experience in Apache Spark creating RDDs and Data Frames, applying Transformations and Actions and converting RDDs to Data Frames.
- Experience in developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experience working with Snowflake Multi cluster and virtual warehouses in Snowflake.
- Expertise in creating Spark Applications using Python (PySpark) and Scala
- Experience in using Presto SQL Engine for implementing serverless SQL tool, Python for high-performance query in large versatile Big Data
- Experience in working on implementing CRUD operations using NoSQL Rest APIs.
- Additionally, have substantial experience and a deep knowledge of relational and non-relational databases.
- Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud.
- Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.
- Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
- Can work parallelly in both GCP and Azure Clouds coherently.
- Exceptional Experience in Python, SQL server reporting services, analysis services, Tableau, Power BI, and data visualization tools.
- Proficient in SQLite, MySQL, and SQL databases with Python.
- Demonstrated experience in building and maintaining reliable and scalable ETL pipelines on big data platforms.
TECHNICAL SKILLS:
Big Data Frameworks: Hadoop (HDFS, MapReduce), Spark, Spark SQL, Spark Streaming, Hive, Impala, Kafka, HBase, Flume, Pig, Sqoop, Oozie, Cassandra.
Cloud Technologies: GCP, AWS
Programming languages: Core Java, Scala, Python, Shell scripting
Operating Systems: Windows, Linux (Ubuntu, Cent OS)
Databases: Oracle, SQL Server, MySQL
Designing Tools: UML, Visio
IDEs: Eclipse, NetBeans
Java Technologies: JSP, JDBC, Servlets, Junit
Web Technologies: XML, HTML, JavaScript, jQuery, JSON
Linux Experience: System Administration Tools, Puppet
Development methodologies: Agile, Waterfall
Logging Tools: Log4j
Application / Web Servers: Apache Tomcat, WebSphere
Messaging Services: ActiveMQ, Kafka, JMS
Version Tools: Git and CVS
PROFESSIONAL EXPERIENCE
Confidential, NYC
GCP Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop
- Builddatapipelines using airflow inGCPfor ETL related jobs using different airflow operators.
- Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.
- DesignedGCPCloud composer DAG to loaddatafrom on-prem csv files toGCPBig QueryTables. Scheduled DAG to load incremental mode.
- Configured Snow pipe to pull the data from Google Cloud buckets into Snowflakes table.
- Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.
- Stored in Hive to perform data analysis to meet the business specification logic.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
- Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked in Implementing Kafka Security and Boosting its performance.
- Migrated an existing on-premises application to Aws using various services
- Experienced in Maintaining the Hadoop cluster on GCP using the Google cloud storage, Big Query and Dataproc.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Used cloud shell SDK inGCPto configure the services Data Proc, Storage, BigQuery.
- Used theGCPenvironment to perform the following: Cloud Function’s for event-based triggering, Cloud Monitoring and Alerting.
- Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
- On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.
- Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS and exposed port to run spark streaming job.
- Using G-cloud function with Python to load data into Big query for on arrival csv files in GCS bucket.
- Work on Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
- Developed Kafka consumer API in python for consuming data from Kafka topics.
- Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates.
Environment: Spark, Spark-Streaming, Spark SQL, AWS, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra & Agile Methodologies.
Confidential, New York
Senior Data Engineer
Responsibilities:
- Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.
- Developed multiple POCs using Pyspark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using Pyspark.
- Loaded the data into Spark dataframes and perform in-memory data computation to generate the output as per the requirements.
- Worked on AWS Cloud to convert all on premise, existing processes and databases to AWS Cloud.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Used AWS Redshift, S3, Spectrum and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process.
- Developed pyspark job to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Developed a daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Analyzed the SQL scripts and designed the solution to implement using Spark.
- Worked on importing metadata into Hive using Python and migrated existing tables and the data pipeline from Legacy to AWS cloud (S3) environment and wrote Lambda functions to run the data pipeline in the cloud.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Designed, developed and created ETL(Extract, Transformand Load)packagesusing Python to load data into Data warehouse tools (Teradata) from databases such as Oracle SQL Developer, MS SQL Server.
- Utilized inbuilt Python module JSON to parse the member data which is in JSON format using json. loads or json. dumps and load into a database for reporting.
- Consumed REST APIs using Python requests such as POST & GET operations to fetch,
- Experience with building data pipelines in python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
- Used Pandas API to put the data as time series and tabular format for central timestamp data manipulation and retrieval during various loads in the DataMart.
- UsedPythonlibraries and SQL queries/subqueries to create several datasets which produced statistics, tables, figures, charts and graphs and has good experience of software development using IDEs: pycharm,JupyterNotebook.
- Worked on bash scripting to automate the Python jobs for day-to-day administration.
- Performed data extraction and manipulation over large relational datasets using SQL,Python, and other analytical tools.
- Extensively worked with Teradata utilities like BTEQ, Fast Export, Fast Load, Multi Load to export and load Claims & Callers data to/from different source systems including flat files.
- Used Power BI, Power Pivot to develop data analysis prototype, and used Power View and Power Map to visualize reports.
- Developed SSIS packages bringing data from diverse sources such as Excel, SQL server, flat files, Oracle DB for the daily load to create and maintain a centralized data warehouse.
- Designed and configured SSIS packages to migrate the data from Oracle, legacy System using various transformations.
- Published Power BI Reports in the required originations and Made Power BI Dashboards available in Web clients and mobile apps.
Environment: AWS EMR, AWS Glue, Redshift, Hadoop, HDFS, Teradata, SQL, Oracle, Hive, Spark, Python, Hive, Sqoop, MicroStrategy, Excel.
Confidential, New Jersey, NJ
Data Engineer
Responsibilities:
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources and destinations
- Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
- Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
- I have written shell script to trigger data Stage jobs.
- Assist service developers in finding relevant content in the existing reference models.
- Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Pre-processing using Hive and Pig.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
- Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy the project requirements cost and schedule
- Worked on a direct query using PowerBI to compare legacy data with the current data and generated reports and stored and dashboards.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
- SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Subreports, ad-hoc reports, parameterized reports, interactive reports & custom reports
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
- Developed visualizations and dashboards using PowerBI
- Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data warehouse.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
- Created dashboards for analyzing POS data using Power BI
Environment: MS SQL Server 2016, T-SQL, SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), SQL Server Analysis Services (SSAS), Management Studio (SSMS), Advance Excel (creating formulas, pivot tables, Hlookup, Vlookup, Macros), Spark, Python, ETL, Power BI, Tableau, Hive/Hadoop, Snowflakes, Power BI, AWS Data Pipeline, IBM Cognos 10.1, Data Stage, Cognos Report Studio 10.1, Cognos 8 & 10 BI, Cognos Connection, Cognos office Connection, Cognos 8.2/3/4, Data stage and Quality Stage 7.5
Confidential
Data Engineer
Responsibilities:
- Designed stream processing job used by Spark Streaming which is coded in Scala.
- Ingested information from several sources like Kafka, Flume, and TCP sockets.
- Processed data using advanced algorithms expressed with high-level functions like MapReduce, join and window.
- Installed, configured, and maintained Apache Hadoop clusters for application development and major components of Hadoop Ecosystem: Hive, Pig, HBase, Sqoop, Flume, Oozie and Zookeeper.
- Implemented six nodes CDH4 Hadoop Cluster on CentOS.
- Importing and exporting data into HDFS and Hive from different RDBMS using Sqoop.
- Experienced in defining job flows to run multiple Map Reduce and Pig jobs using Oozie.
- Importing log files using Flume into HDFS and load into Hive tables to query data.
- Monitoring the runningMap Reduceprograms on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Used HBase-Hive integration, written multiple Hive UDFs for complex queries.
- Involved in writing APIs to ReadHBasetables, cleanse data and write to anotherHBasetable.
- Created multiple Hive tables, implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access.
- Written multiple Map Reduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
- Experienced in running batch processes using Pig Scripts and developed Pig UDFs for data manipulation according to Business Requirements.
- Set up VirtualBox to gain access to a Linux environment. Also set up Vagrant which is crucial for setting up and installing the required software for running the Spark job.
- The package job was first extracted to a deployment folder and then deployed to yarn so that yarn can take care of the scheduling and resource management.
- Fed inbound events into the Scala-project-inbound topic in order to check if the window summary event functions as intended or not.
Environment: Scala 2.12.3, Spark Streaming, Apache Hadoop 2.7.2, HDFS, YARN, slf4j 1.7.7, Kafka 0.11.0.1, json4s 3.2.11, jodaTime 2.3, VirtualBox, Vagrant, Cassandra 3.11
Confidential
SQL Developer
Responsibilities:
- Gathered business requirements and converted them into new T-SQL stored procedures in visual studio for database project.
- Performed unit tests on all code and packages.
- Analyzed requirement and impact by participating in Joint Application Development sessions with business client online.
- Performed and automated SQL Server version upgrades, patch installs and maintained relational databases.
- Performed front line code reviews for other development teams.
- Modified and maintained SQL Server stored procedures, views, ad-hoc queries, and SSIS packages used in the search engine optimization process.
- Updated existing and created new reports using Microsoft SQL Server Reporting Services. Team consisted of 2 developers.
- Created files, views, tables and data sets to support Sales Operations and Analytics teams
- Monitored and tuned database resources and activities for SQL Server databases.
Environment: Visual Studio 2013 Dev12, SSMS .0.x.x, Power BI Desktop 2.19