Sr Gcp Data Engineer Resume
San Jose, CA
SUMMARY:
- Overall 8+ years of experience in IT industry, including big data environment, Hadoop ecosystem, Java and Design, Developing, Maintenance of various applications like Pig, Hive, HDFS,MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka,Yarn, Oozie, and Zookeeper.
- Experience in developing custom UDFscloud for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
- Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
- Company provides Tableau dashboards and data sources to healthcare and hospital corporations in order to understand demand, pricing, geospatial claim concentrations, deidentified and reidentified customer claims across multiple claims networks, consultation on hospital expansion and location, etc. as requested.
- Good working experience in Application and web Servers like JBoss and Apache Tomcat.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Expertise in Big Data architecture like Hadoop ( Hortonworks, Cloudera) distributed system, MongoDB, NoSQL.
- Experience in spark - based application to load streaming data with low latency, using Kafka and Pyspark programming.
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data.
- Expertise on programming in different technologies i.e., Python, Spark, SQL.
- Utilized Spark SQL API in Pyspark to extract and load data and perform SQL queries.
- Involved in Data Warehouse design, data integration and data transformation using Apache Spark and Python.
- Worked related to downloading Big Query data into pandas or Spark data frames for advanced ETL capabilities.
- Developed highly complex Python code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
- Experience in creating and executing Data pipelines in GCP and AWS platforms.
- Hands on experience in GCP, Big Query, GCS, cloud functions, Cloud dataflow, Pub/Sub, cloud shell, GSUTIL, bq command- line utilities, Data Proc.
- Experience in data architecture and design.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and MapReduce open-source tools.
- Experience in Developing Spark applications using Pyspark, Scala and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experienced in using Pig scripts to do transformations, event joins, filters, and pre-aggregationsbefore storing the data into HDFS.
- Solid Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud, AWS Redshift, Informatica Intelligent Cloud Services (IICS - CDI) & Informatica PowerCenter integrated with multiple Relational databases (MySQL,Teradata, Oracle, Sybase, SQL server, DB2)
- Experience in installation, configuration, supporting and managing Hadoop clusters.
- Experience in working with MapReduce programs using Apache Hadoop for working with Big Data.
- Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
- Strong hands-on experience with AWS services, including but not limited to EMR, S3, EC2, route53, RDS, ELB,DynamoDB, CloudFormation, etc.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Hive, Sqoop, Oozie, Flume,Storm, big data technologies.
- Professional in deploying and configuring Elasticsearch, Logstash, Kibana (ELK) and AWS Kinesis for log analytics and skilled in monitoring servers using Nagios, Splunk, AWS CloudWatch, and ELK.
- Worked on Sparks, Spark Streaming and using CoreSparkAPI to explore Spark features to build data pipelines.
- Experienced in maintaining CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, Terraform, Ansible
- Experienced in working with different scripting technologies like Python, UNIX shell scripts.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQLServer, Teradata and Netezza using Sqoop.
- Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing -MapReduce framework.
- Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Very keen in knowing newer techno stack that Google Cloud platform (GCP).
- Experience in changing over existing AWS infrastructure to Server less architecture (AWS Lambda, AWS Kinesis) through the creation of a Server less Architecture using AWS Lambda, API gateway, Route 53, S3 buckets
- Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Compose.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Worked parallelly in both GCP and AWS Clouds coherently.
TECHNICAL SKILLS:
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala,YARN, Kafka, Flume, Sqoop, Oozie,Spark, Airflow, MongoDB,Cassandra, HBase, and Storm.
Programming Languages: Java, Python, Hibernate, JDBC, JSON, HTML,CSS.
Cloud Technologies: AWS, GCP, Amazon, S3, EMR, Redshift, Lambda, Athena Composer, Big Query.
Script Languages: Python, Shell Script(bash, shell).
Databases: Oracle, MySQL, SQL Server, PostgreSQL,HBase, Snowflake, Cassandra, MongoDB.
Version controls and Tools: GIT, Maven, SBT, CBT.
Web/Application server: Apache Tomcat, WebLogic, WebSphere
PROFESSIONAL EXPERIENCE:
Confidential, San Jose, CA
Sr GCP Data Engineer
Responsibilities:
- Migrating an entire oracle database to BigQuery and using of power bi for reporting. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experienced in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experienced in Google cloud components, Google container builders and GCP client libraries and cloud SDK’S.
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Worked on Analysis and understanding of the data from different domains in order to integrate to Data Market Place.
- Developed Pyspark programs and created the date frames and worked on transformations.
- Working with AWS/GCP cloud using in GCP Cloud storage, Data-proc,Data Flow,Big-Query,EMR,S3,Glacier and EC2 with EMR Cluster
- Experienced in working Services like Data Lake, Data Lake Analytics, SQL Database, Synapse, Data Bricks, Data factory, Logic Apps and SQL Data warehouse and GCP services Like Big Query, Dataproc, Pub sub etc.
- Worked on analysing the data using Pyspark,Hive,bases on ETL mappings.
- Experienced in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and GCP. Experienced in Hadoop 2.6.4 and Hadoop 3.1.5.
- Developed multi cloud strategies in better using GCP (for its PAAS).
- Experienced in migrating legacy systems into GCP technologies.
- Storing data files in Google Cloud S3 Buckets daily basis, Using DataProc, Big Query to develop and maintain GCP cloud base solutions.
- Developed Pyspark script to merge static and dynamic files and cleanse the data.
- Worked with Different business units to drive the design & development strategy.
- Created functional specification and technical design documentation. Co-ordinated with Different Teams like cloud security, Identity Access Management, Platform, Network in order to get all the necessary accreditations and intake process.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines such as GCP.
- Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
- Compared Self hosted Hadoop with respect to GCPs Data Proc, and explored Big Table (managed HBase) use cases, performance evolution
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.
- Was involved in setting up of apache airflow service in GCP.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Environment: GCP,Pyspark, GCPs Data Proc BigQuery, Hadoop, Hive, GCS, Python,Snowflake,Dynamo DB, Oracle Database,Power Bi,SDK’S,Data Flow,Glacier,EC2, EMR Cluster, SQL Database, Synapse, Data Bricks.
Confidential, New York, NY
Sr. Data Engineer
Responsibilities:
- Perform complex transformations from different sources in AWS Redshift and unload the result dataset into HIVE/Presto stage which is built on AWS Data lake S3.
- Builds the HIVE query with set of Applicable Parameters to load the data from HIVE/Presto Stage to Actual HIVE/Presto Target table which is again built on AWS S3 path. In Some cases, Unload of data from Stage to Target table happens through AWS RDS.
- Worked with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.)
- Perform tuning of AWS Redshift SQL Queries by effectively using Appropriate Distribution Styles and Keys.
- Working of Hadoop framework involving Hadoop Distributed File system and its components like Pig,Hive,Sqoop,Pyspark.
- Build Python Programming to extract data from AWS S3 and load into SQL server for one of business teams as they are not exposed to cloud.
- Developed Pyspark scripts that run on MSSQL table pushes to Big data where data is stored in Hive Tables.
- Conducted statistical analysis on Healthcare data using python and various tools.
- Experience working on Healthcare data, developing data pre-processing pipelines for data like DICOM and NONDICOM images of XRAYS, CT-SCANS etc.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, Parquet/Text Files into AWS Redshift.
- Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations using AWS Athena
- Create external tables with partitions using AWS Athena and Redshift
- Responsible for the execution ofbig data analytics, predictive analytics, and machine learning initiatives.
- Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- DevelopedScalascripts using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
- Wrote, compiled, and executed programs as necessary using Apache Spark toperform ETL jobswith ingested data.
- Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.
- UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
Environment: Spark, Spark Streaming, Scala, AWS S3 BUCKET, Snowflake, Python, PySpark, Spark SQL, Teradata, Tableau, CSV, Json.
Confidential, Twin Cities, MN
Data Engineer
Responsibilities:
- Worked with different feeds data like JSON, CSV, XML,DAT and implemented Data Lake concept.
- Developed Informatica design mappings using various transformations.
- Most of the infrastructure is on AWS, used, AWS EMR Distribution for Hadoop
- AWS EC2 for Kafka Used AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store AWS S3 for raw file storage.
- Programmed ETL functions between Oracle and Amazon Redshift. Maintained end to end ownership for analysed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
- Developed and implemented ETL pipelines using Python,SQL,Spark andPyspark to ingest data and updates to relevant databases.
- Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark framework etc.)
- Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events. Performed data analysis and predictive data modeling.
- Wrote Python,Spark and Pyspark scripts to build ETL pipelines to automate data ingestion, update data to relevant databases and tables.
- Explore clickstream events data with SparkSQL. Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and
- Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
- Coordinating the QA, PROD environments deployments. Python was used in automation of Hive and Reading Configuration files
- Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
- Using Hive to analyze the partitioned data and compute various metrics for reporting
Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, MapR, Sqoop, Zookeeper, AWS EMR,AWS S3,Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile Methodology, Scrum
Confidential
Data Analyst
Responsibilities:
- Built scalable and deployable machine learning models.
- Utilized Sqoop to ingest real-time data. Used analytics libraries Sci-Kit Learn, MLLIB and MLxtend. Extensively used Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn and NLTK.
- Performed Exploratory Data Analysis, trying to find trends and clusters. Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
- Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.
- Extensively performed large data read/writes to and from csv and excel files using pandas.
- Tasked with maintaining RDD's using SparkSQL.
- Communicated and coordinated with other departments to collection business requirement. Tackled highly imbalanced Fraud dataset using undersampling with ensemble methods, oversampling and cost sensitive algorithms.
- Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
- Implemented machine learning model (logistic regression, XGboost) with Python Scikit- learn. Optimized algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
- Developed a technical brief based on the business brief. This contains detailed steps and stages of developing and delivering the project including timelines.
- After sign-off from the client on technical brief, started developing the SAS codes.
- Wrote the data validation SAS codes with the help of Univariate, Frequency procedures.
- Measured the ROI based on the differences pre-promo-post KPIs. Extensively used SAS procedures like IMPORT, EXPORT, SORT, FREQ, MEANS, FORMAT, APPEND, UNIVARIATE, DATASETS and REPORT.
- Standardised the data with the help of PROC STANDARD.
- Responsible for maintaining and analyzing large datasets used to analyze risk by domain experts. Developed Hive queries that compared new incoming data against historic data. Built tables in Hive to store large volumes of data.
- Used big data tools Spark (Sparksql, Mllib) to conduct the real time analysis of credit card fraud based on AWS Performed Data audit, QA of SAS code/projects and sense check of results.
- Iteratively rebuild models dealing with changes in data and refining them over time.
- Extensively used SQL queries for legacy data retrieval jobs. Tasked with migrating the django database from MySQL to PostgreSQL
Environment: Spark, Hadoop, AWS, SASEnterpriseGuide, SAS/MACROS, SAS/ACCESS, SAS/STAT, SAS/SQL, ORACLE, MS-OFFICE, Python (scikit-learn, pandas, Numpy), Machine Learning (logistic regression, XGboost), Gradient Descent algorithm, Bayesian optimization, Tableau.
Confidential
SQL Server DBA
Responsibilities:
- Involve in Wire servers upgrade and consolidation. Sheshunoff Deposit score card application and database upgrade. Compass Money room application upgrade and DR environment implementation etc.
- Working with Business unit, application user and branch manager to provide different type of support like pw reset, user logins create, provide different object permission to make sure application is operating well.
- Using Idera diagnostic manager for resource utilization like disk space used in percentage, IO bottlenecks, CPU Bottleneck, Blockings etc.
- Excellent in creating reports with SQL Server Reporting Services SSRS Used SSRS to create reports, customized Reports, on-demand reports, ad-hoc reports and involved in analyzing multi-dimensional reports in SSRS
- Follow the standard procedure HPSM ticket System before making any change in prod environment.
- Experience in creating ETL mappings in Informatica to load data from different sources to different target databases.
- Designed user defined hierarchies in SSAS including parent-child hierarchy.
- Created new cubes and modified existing cubes using SSAS 2008 to make data available for the decision makers.
- Used SSIS an Extract Transform Loading ETL tool to gather data from various data sources, created packages for different data loading operations for application and bcp utilities.
- Use IderaSQLSafe tool for backup and restore database for extensive Database space/alert management.
- Implementing the database maintenance plans for database optimization. Update and generate database script files whenever changes were made to stored procedures or views.
- Using DBCC utilities and update database statistics weekly for fixed data corruption in application databases.
- Using DBAtrisan tool for performing administrative task for both Oracle and MS SQL servers. Administered the MS SQL Server by creating user logins with appropriate roles, dropping and locking the logins, monitoring the user accounts, creation of groups, granting the privileges to users and groups.
Environment: Windows 2000/2003/Windows 7 SP1, MS SQL Server 2000/2005,Oracle 10g/9i/8i,Share point MOSS, SSAS, Microsoft Access 2010/2003/2000/8.0 Tool 6.0.1907.2,C, VB.NET, VB Script, XML, VisualStudio.NET, Visio Enterprise, MS Reporting Services, MS Replication Server.