Senior Data Engineer Resume
SUMMARY:
- 7+ years of experience working with almost all Hadoop ecosystem components, AWS cloud services, Google Cloud Platform, Apache Spark, and Beam, and 3+ years of experience using Talend Data Integration/Big Data Integration (6.1/5.x) / Talend Data Quality. An enthusiastic and dedicated person who can accurately perform challenging tasks that demand precision and exceptional analytical abilities. Strengths include time management, problem - solving, and decision-making skills with the ability to set priorities and affirmatively produce results.
- Extensive hands-on experience with major Hadoop Ecosystem components such as core Map Reduce, HDFS, Hive, HBASE, Sqoop, Apache Solr.
- Developed jobs, components, and Joblets in Talend. Designed ETL Jobs/Packages using Talend Integration Suite.
- Created Talend jobs to load data into various Oracle tables. Utilized Oracle stored procedures and wrote a few java codes to capture global map variables and use them in the job.
- Good experience in developing multiple Kafka Producers and Consumers as per business requirements.
- Experience in custom Mapper, Reducer, Partitioner, and Combiner in MRv1/MRv2.
- Extending Hive functionality by using custom UDF’s and UDAF’s
- Experience in importing and exporting relational data using Sqoop and custom map-reduce into HDFS
- Working experience on ETL pipeline implementation using AWS services such as Glue, Lambda, EMR, Athena, S3, SNS, Kinesis, Data-Pipelines, Pyspark, etc. is required.
- Developed UDFs for encrypting and decrypting data from hive tables.
- Worked in hive-HBASE integration and retrieving HBASE data from the hive and vice versa.
- Worked extensively in HBASE bulk loading and developed map-reduce programs for generating H-files and loading the same to HBASE.
- Have hands-on experience in Apache Solr cloud setup with HDFS as underlying index storage.
- Deep hands-on development expertise in data architecture, batch, and real-time data integration, Azure Databricks, Snowflake analytical warehouse, Apache Spark, relational databases, and OLAP (SQL Server)
- Versatile experience in working with Azure Cloud. Overall, 4 years of experience in Software Development, Analysis Datacenter Migration, Azure Data Factory (ADF) V2. Managing Database, Azure Data Platform services (Azure Data Lake (ADLs), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW), SQL Server, Data Warehouse, and many more.
- Expert in generating on-demand and scheduled reports for business analysis or management decisions using SQL Servers Reporting Services (SSRS) and POWER BI.
- Created Excel reports, Dashboards & Performing Data validation activity VLOOKUP, HLOOKUP, Macros, formulas, index match, Slicer with (Pivot table, Get Pivot Data, Dashboards), Power View, Power Map, and Heat Map.
- Expert on Microsoft Power BI reports dashboards and publishing to the end-users for executive-level Business Decisions.
TECHNICAL SKILLS:
Hadoop: Spark, Hive, Oozie, Sqoop, Kafka, HDFS, YARN, Zeppelin, and HBase
AWS: EMR, Glue, Athena, DynamoDB, Redshift, RDS, Data pipelines, Lake formation, S3, IAM, CloudFormation, EC2, ELB/CLB, Terraform
Operating systems: Amazon Linux 1 and 2, Custom AMIs based on Amazon Linux with encryption, Windows, CentOS, RHEL
Programming languages: Java, Python, Scala, SAS, Spark, Glue ETL
Web: Servlets, JSP, Spring MVC, Spring Boot, Hibernate
Frontend: HTML, XML, React Js, AngularJS, NodeJS
Database: DynamoDB, HBase, Teradata, MongoDB MYSQL, SQL SERVER 2008, PostgreSQL
Version control: Git, SVN, SourceTree
Scripting languages: Shell scripting, PowerShell, Bash
DevOps platforms: Docker, Jenkins, Kubernetes, Ansible
Streaming platforms: Kafka, Confluent Kafka
Azure: Data Lakes, Data Factory, SQL Data warehouse, Data Lake Analytics, Databricks, other Azure services
Data Analytics: ML, AI, MLOPS, Tableau, PowerBI, Microsoft Excel
Big Data: GCP, Airflow, MongoDB, Fusion
PROFESSIONAL EXPERIENCE:
Confidential
Senior Data Engineer
Responsibilities:
- Developed a framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Coordinating with development teams to determine application requirements and writing scalable code using Python programming language.
- Configured Spark Streaming to receive real time data from Kafka and store the stream data to HDFS and process it using Spark and Scala.
- Analyzed structured, unstructured data, and file system data and loaded the data to HBase tables based on the project requirement using IBM Big SQL with Sqoop mechanism and processing the data using Spark SQL in-memory computation & processing results to Hive, HBase
- Defined virtual warehouse sizing for Snowflake for a different type of workloads.
- Built the Logical and Physical data model for snowflake as per the changes required.
- Developed SQL queries using SnowSQL. Optimized and fine-tuned those queries. Work alongside AWS architecture and engineering teams to design and implement any scalable software services
- ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Setup Alerting and monitoring using Stackdriver in GCP.
- Using GCP, built a real-time Event Data collection and storage platform to support high throughput data collection and storage platform.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Loading data from Linux/Unix file system to HDFS and working with PUTTY for better communication between Unix and Window system and for accessing the data files in the Hadoop environment.
- Developed and implemented HBase capabilities for Big de-normalized data set and then apply the transformation on the de-normalized data set using Spark/Scala.
- Automated a previously manual ETL process to integrate data from multiple sources and load in Amazon Redshift using Python.
- Created Spark Jobs to extract data from Hive tables and process the same using Dataproc. Historical data load to Cloud Storage using Hadoop utilities and load to BigQuery using BQ tools
- Worked on Google Cloud Platform Architecture for business needs by designing, building, and configuring applications to meet business process and application requirements.
- Developed and Implemented Spark ETL custom component to extract the data from upstream systems and push the data to HDFS and finally, store the data in HBase with wide row format
- Provided guidance to the development team working on PySpark as an ETL platform
- Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Handled Google Cloud Storage to use the storage service efficiently.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, PySpark.
- Creating and cloning the jobs and Job streams in the TWS tool and promoting them to higher environments.
- Prepared reports using graphs, charts, dashboards, and other visualization techniques.
- Created dashboards and reports utilizing BI technologies to show analytical results for internal governance team usage.
- Developed, built, implemented, and troubleshooted CI/CD pipelines, built out Jenkins's master plugin to GitHub, and installed security s in an on-premises and virtualized environment.
- Created Talend ETL jobs to receive attachment files from pop e-mail using tPop, tFileList, and tFileInputMail and then loaded data from attachments into the database and achieved the files.
- Well versed with Talend Big Data, Hadoop, Hive and used Talend Big data components like tHDFSInput, tHDFSOutput, tPigLoad, tPigFilterRow, tPigFilterColumn, tPigStoreResult, tHiveLoad, tHiveInput, tHbaseInput, tHbaseOutput, tSqoopImport and tSqoopExport.
- Design and develop data pipeline architectures using Hadoop, Spark, Kafka and related AWS Services.
- Expertise working with and understanding existing infrastructure as well as target cloud infrastructure in order to build Terraform scripts that describe and deploy infrastructure.
- Created, managed, and maintained the Terraform modules and helped defining the cloud security requirements and audit configurations.
Environment: Scala, Spark framework, Linux, Jira, Bitbucket, IBM Big SQL, Hive, HBase, Kafka, IntelliJ IDEA, Maven, Db2 Visualizer, ETL, TeamCity, WinSCP, PuTTY, IBM TWS (Tivoli Workload Scheduler), Windows, Data Factory, Linux, Tableau, Talend, Data stage, ETL Pipelines, Agile scrum, Google cloud platform, Pyspark, Erwin r7.1/7.2, ER Studio V8.0.1 and Oracle Designer, GCP Services: Dataproc, Dataflow, BigQuery, AVRO, Java 8., AWS
Confidential
Senior Data Engineer
Responsibilities:
- Apache Spark for data analytics such as filter, join enrichment, Spark SQL, Spark Streaming, etc.
- Installed & configured multi-node Hadoop cluster for data store &, processing.
- Used SSMS to access, configure, administer all components of SQL Server, Azure SQL Database, and Azure Synapse Analytics.
- Developed Hive UDFs and Pig UDFs using Python in Microsoft HDInsight environment
- Importing and exporting data into HDFS, HBase, and Hive using Sqoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Involved in various projects related to Data Modeling, System/Data Analysis, Design, and Development for both OLTP and Data warehousing environments.
- Ensure necessary system security by using best-in-class AWS cloud security solutions.
- Set up GCP Firewall rules to allow or deny traffic to and from the VM's instances based on specified configuration and used GCP cloud CDN (content delivery network) to deliver content from GCP cache locations drastically improving user experience and latency.
- Experienced in deploying Java projects using Maven/ANT and Jenkins.
- DevOps and CI/CD pipeline knowledge - Mainly Teamcity, Selenium.
- Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring, and cloud deployment manager.
- Implement continuous integration/continuous delivery (CI/CD) pipelines in AWS when necessary.
- Design, develop, test, deploy, maintain, and improve data integration pipeline objects that were developed using Apache Spark / Pyspark / Python or Scala.
- Experienced in working with AWS services like EMR, Redshift, S3, Glue, Kinesis, and Lambda for serverless ETL.
- Managed the development and performance of SQL databases for web applications, businesses, and organizations using SQL server management studio (SSMS).
- Engineered, designed, developed, and advanced Query Processing and Self-Tuning functionality using Synapse SQL.
- Demonstrated strength in data modeling, ETL development, and data warehousing
- Load and transform large sets of structured and semi-structured data. Implemented solutions using Hadoop.
- Assisted Management with tool and metrics development, data and interpretation and analysis, and process improvement in PHP.
- Developed enterprise-scale data platforms and pipelines to support analytics and machine learning/artificial intelligence solutions
- Solid familiarity with Azure’s analytics stack - Data Lake, Data Explorer/Kusto, Storage, Data Factory, Synapse, Data Bricks, HDInsight.
- I worked with engineers, developers, and QA on the development of current applications and future applications related to the content management line of business.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Implemented ML programs to analyze large datasets in the warehouse for BI purposes.
- Worked on Implementation of a log producer that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper.
- Wrapper developed in Python to run this application along with other applications
- Used Talend most used components (tMap, tDie, tConvertType, tFlowMeter, tLogCatcher, tRowGenerator, tSetGlobalVar, tHashInput & tHashOutput and many more).
- Worked on various Talend components such as tMap, tFilterRow, tAggregateRow, tFileExist, tFileCopy, tFileList, tDie etc.
- Experienced in building automation using Jenkins, Maven, ANT.
- Created a de-normalized Big Query Schema for analytical and reporting requirements.
- Designed, led, and managed the SAS environment.
- Assisted in the establishment of new users. Responded to requests and problems from the Service Desk.
- Capacity planning was completed, and utilization indicators were monitored.
- Upgrades to SAS Infrastructure technology were planned and implemented.
- SAS Infrastructure security vulnerabilities were remediated on a quarterly basis.
- Where necessary, changes to the SAS infrastructure deployment architecture were made.
Environment: Scala 10.4, Apache Spark 1.6.2, Apache Hadoop 2.6, HDFS, Map Reduce, PIG, Hive, Sqoop, Flume, HBase, Apache Jboss 6.1 server, Oracle DB 10g, Kafka, Tableau, Talend Data Integration 6.1/5.5.1, Talend Enterprise Big Data Edition 5.5.1, Talend Administrator Console, AWS cloud services, ETL Pipelines, Jenkins, Maven, Agile scrum, Google cloud platform
Confidential, Cleveland, OH
Data Engineer
Responsibilities:
- Prepared design blueprints and application flow documentation
- Maintained the data in Data Lake Extraction, Transformation, and Loading which is coming from Teradata Database.
- Responsible for creating Hive Tables to load the data which comes from MySQL by using Sqoop.
- Experienced in creating Hive schema, external tables, and managing views and worked on performing Join operations in Spark using hive
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and working with Spark-Shell.
- Extensive experience in Relational Data Modeling, Dimensional Data Modeling, Logical/Physical Design, ER Diagrams, Forward and Reverse Engineering, Publishing Erwin diagrams, analyzing data sources, and creating interface documents.
- Writing optimized SQL queries for integration with other applications and creating database triggers for use in automation. Maintaining data quality and overseeing database security.
- Developed Spark code using Java and Spark-SQL for faster testing and data processing.
- Designed Solution architecture on GCP in multiple projects.
- Used Spark SQL to process a huge amount of structured data.
- Analyzed the SQL scripts and designed them by using PySpark SQL for faster performance.
- Expertise in Microsoft Azure Cloud Services (PaaS & IaaS), Application Insights, Document DB, Internet of Things (IoT), Azure Monitoring, Key Vault, Visual Studio Online (VSO), and SQL Azure.
- Designing, developing, troubleshooting, evaluating, deploying, and documenting data management and business intelligence systems, enabling stakeholders to manage the business and make effective decisions.
- Implemented scalable and sustainable data engineering solutions using tools such as Databricks, Matillion, Snowflake, Apache Spark, and Python.
- The data pipelines were created, maintained, and optimized as workloads and moved from development to production for specific use cases.
- Designed, developed, and implemented ETL pipelines using python API (pySpark) of Apache Spark
- Performance tuning of pySpark scripts.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Scheduling and allocating work, providing advice and guidance, and resolving problems to meet technical performance and financial objectives
- Implemented Spark Data Frames transformations, actions to migrate Map-reduce algorithms.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Spark YARN.
- Used Data Frame API solutions to pre-process large sets of structured, with different file formats like Text files, CSV, Sequence files, XML and JSON files, and Parquet files and converting the distributed collection of data organized into named columns.
- Designing and highly implementing performant data ingestion pipelines from multiple sources using Apache Spark and/or Azure Databricks.
- Experienced in Architecting applications for the Google Cloud Platform.
- Expert in implementing advanced procedures like text analytics and processing using in-memory computing capabilities like Apache Spark written in Java.
- Responsible for building and maintaining, three Batch Frameworks utilizing Autosys, and Unix Korn Shell Scripts.
- Experience managing Azure Data Lakes (ADLs) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL and how it can be used for data transformation as part of a cloud data integration strategy.
Environment: Hadoop, HDFS, Hive, Java 1.7, Spark 1.6, Kafka, SQL, HBase, UNIX Shell Scripting, MapReduce, Putty, WinSCP, IntelliJ, Teradata, Linux, Redshift, Azure Databricks, Google cloud platform, Python, Pyspark.
Confidential, Costa Mesa, CA
Data Engineer
Responsibilities:
- Processed the raw log files from the set-top boxes using java map reduce code and shell scripts and stored them as text files in HDFS.
- Extensive involvement in Designing Azure Resource Manager Template and in designing custom build steps using PowerShell.
- Apache Sqoop, Flume, java, MapReduce programs, hive queries, and pig scripts.
- Generating the required reports using Oozie workflow and Hive queries for the operations team from the ingested data.
- Reporting and BI tools like Microsoft SQL Server Reporting Services (SSRS) and SAP Crystal Reports
- Worked on NoSQL database systems, such as MongoDB and CouchDB
- Experience in ETL/Pipeline Development using tools such as Azure Databricks, Matillion, Apache Spark, and Python
- Writing Map Reduce code to make unstructured and semi-structured data into structured data and loading into Hive tables.
- Coordinate all Scrum Ceremonies including Sprint Planning, Daily Standups, Sprint retrospectives, Sprint Demos, Story Grooming, and Release Planning
- Involved in Spark streaming solution for the time-sensitive revenue-generating reports to match the pace with upstream (STB) systems data
- Experience in working on HBase with Apache phoenix as a data layer to serve the web requests to meet the SLA requirements.
- Experienced in architecting and designing solutions leveraging services like Cloud Bigquery, Clod Data Flow, Cloud Pub OR Sub, Cloud BigTable.
- Worked on SFDC ODATA connector to get the data from NodeJS services which in turn fetch the data from HBase.
- Hands-on experience inAzureDevelopment worked onAzure web application,App services,Azure storage,Azure SQL Database,Virtual machines,Fabric controller,Azure AD, Azure search, and notification hub.
- Utilized AWS S3 services to push/store and pull the data from AWS from external applications
- Responsible for functional requirements gathering, code reviews, deployment scripts, and procedures, offshore coordination, and on-time deliverables.
- Automating the build and configuration of IaaS based solutions in Google Cloud Platform.
- Leveraged Google Cloud Platform Services to process and manage the data from streaming and file-based sources
- Experience in migrating the existing v1 (Classic) Azure infrastructure into v2 (ARM), scripting and templating the whole end-to-end process as much as possible so that it is customizable for each area being migrated.
- Designed, configured, and deployed Microsoft Azure for a multitude of applications utilizing the Azure stack (Including Compute, Web & Mobile, Blobs, Resource Groups, Azure SQL, Cloud Services, and ARM), focusing on high - availability, fault tolerance, and auto-scaling.
Environment: Apache Hadoop, HDFS, Pig, Hive, Flume, Kafka, MapReduce, Sqoop, Spark, Oozie, LINUX, NodeJS, SFDC, ODATA, and AWS, Agile scrum, GCP
Confidential, Seattle, WA
Data Analyst
Responsibilities:
- Experience extracting data from MySQL into HDFS using Sqoop.
- Exported the analyzed data to the Relational databases using Sqoop for performing visualization and generating reports for the Business Intelligence team.
- Developed Simple to complex Map Reduce jobs.
- Analyzed the data by performing Hive queries and running Pig Scripts to know user behavior and creating partitioned tables in Hive as part of my job.
- Administered and supported distribution of Horton works.
- Wrote Korn shell, Bash shell, Pearl scripts to automate most Database maintenance tasks.
- Worked on installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Monitoring the running Map-Reduce programs on the cluster.
- Responsible for loading data from UNIX file systems to HDFS.
- Through PHP, I created documents and execute software designs that may involve complicated workflows or multiple product areas
- An alternate UNIX/Oracle-based system required bug fixes, change requests, and tuning. My position was responsible for all requests of this system. Implementation, testing, and documentation were performed for this system.
- Consult with project managers, business analysts, and development teams on application development and business plans
- Installed and configured Hive and Created Hive UDFs.
- Involved in creating Hive Tables, loading with data, and Writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Implemented the workflows using the Apache Oozie framework to automate tasks.
- Developed scripts and automated data management from end to end and sync up between the clusters.
- Designed, developed, tested, and deployed Power BI scripts and performed detailed analytics.
- Performed DAX queries and functions in Power BI.
Environment: s: Apache Hadoop, Java, Bash, ETL, Map Reduce, Hive, Pig, Horton works, Deployment tools, Data tax, Flat files, Oracle 11g/10g, MySQL, Window NT, UNIX, Sqoop, Oozie, Tableau.