Data Engineer Resume
Boyertown, PA
SUMMARY
- 7+ years of experience in Analysis, Design, Palantir Foundry, Development and Implementation of various ETL projects and complex ETL pipelines
- Experienced in full SDLC starting from Design and Development, Testing, and documenting the entire life cycle using various methodologies
- Extensive career experience in Data Warehousing, Decision support Systems and has wide experience in implementing Data Warehousing solutions
- Proficient in understanding business processes / requirements and translating them into technical requirements
- Excellent communication skills. Successfully working in fast - paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner
- Strong in Data warehousing concepts, dimensional Star Schema and Snowflakes Schema methodologies
- Extensive experience in design, development and Implementations of Data warehouses and Data marts by utilizing applications Informatica, Redshift, Teradata, Python, Airflow, Unix etc.
- Experienced in relational databases like Oracle, Teradata, PostgreSQL and Redshift databases
- Extensive Experience in Performance Tuning in Informatica & Teradata
- Created Informatica Mappings and Mapplets using different transformations
- Extensively used different Informatica transformations like Source Qualifier, Filter, Aggregator, Expression, Connected and Unconnected Lookup, Sequence Generator, Router, Update Strategy, Normalizer and Transaction Control
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files
- Proficient in programming in PDO (Push Down Optimization), SQL and experience in shell scripting, UNIX and Linux
- Over 4 years of experience in Apache Hadoop ecosystem like Apache Spark Framework, HBase, Hive, Buildinlines, and Migrating Data from cloud with domain expertise.
- Result-oriented professional with experience in Creating Data Mapping Documents, Writing Functional Specifications and Queries, Normalizing Data from 1NF to 3NF/4NF. Requirements gathering, System & Data Analysis, Requirement Analysis, Data Architecture, Database Design, Database Modeling, Development, Implementation and Maintenance of OLTP and OLAP Databases.
- Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in-depth knowledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
- Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
- Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
- Expertise in writing end to end Data Processing jobs to analyze data using MapReduce, Spark and Hive.
- Created Informatica Metadata Queries to identify Project impact analysis, to reduces rework & manual efforts and estimate accurate effort
- Proven ability to manage multiple projects and coordinate efforts of several parties to achieve milestones within scope and time
- Experienced in Python programming to setup ETL pipelines and orchestrate them
- Experienced in handling slowly changing dimensions SCD Type 1 and 2
- Experienced with normalization and Denormalization concepts and datasets
- Experienced in Job Monitoring, setting up alarms on job failures, troubleshooting and restarting jobs based on the error
- Experienced in troubleshooting Data Quality Issues, ETL Job failures and backfills
- Experienced in Data Governance practices & processes
- Experienced in code deployments to Production Systems and automating daily operational tasks
- Experienced in project deployment using CI/CD
- Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources
- Developed Python automation scripts to facilitate quality testing
- Estimate efforts and Timelines and Highlight Risk associated with it
- Experience in cleansing and trapping incorrect data
- Good working Experience in Agile (SCRUM) and waterfall methodologies with high quality deliverables delivered on time
- Wrote Python modules to extract/load asset data from the MySQL source database
- Good working experience in using version control systems like Git and GitHub
- Excellent problem solving and sound decision-making capabilities, recognized by associates for the quality of data, alternative solutions, and confident decision making
- Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, IAM, DynamoDB, Redshift, Cloud Watch, Auto Scaling, Security Groups, CloudWatch, CloudFormation, Kinesis, IAM, SQS, SNS
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Boyertown, PA
Responsibilities:
- Created workflows and data pipelines using Apache Airflow
- Loaded kinesis streams data through kinesis Firehose to S3 data lake
- Performed ETL on S3 Data Lake using Transient EMR Cluster
- Used Glue Data catalog as Hive Metastore DB
- Loaded Data from S3 to Redshift warehouse using COPY Command
- Created Hive Tables with DynamoDB SeriDe for daily load of data from DynamoDB to S3
- Created S3 Lifecycle policy to clean up old S3 objects and for moving data from s3 standard storage Tier to Glacier for Archiving
- Written Spark Jobs for applying transformations and business rules, data munging on S3 data and joining with reference data
- Written Shell scripts for Submitting Spark Jobs
- Performed Redshift modeling for choosing Distribution styles and Sort key definitions
- Performed Redshift VACCUM FULL and ANALYZE command for reclaim the space occupied by deleted rows and re-sorting the tables and updating metadata info which is used in query plan generation
- Performed Data Profiling for getting context of data
- Developed Airflow jobs for Data cleaning using Business Rules
- Involved in analysis for Tuning AWS EMR jobs by changing required Task nodes of AWS Spot Instance Type in Cloud Formation template
- Applied transformation on data from Staged Zone in S3
- Written extensive Spark/Scala programming using Data Frames, Data Sets & RDD's for transforming transactional database data and load it into Redshift tables.
- Implemented best design practices for AWS Redshift for query optimization by distribution style of fact tables.
- Reduced the latency of SPARK jobs by tweaking spark configurations and following other performance optimization techniques like memory tuning, serializing RDD data structures, broadcasting large variables, data locality etc.
- Implemented delta lake feature to overcome the challenges of backfill and re-ingestion into the data lake.
- Used auto scaling feature in AWS to increase and decrease the clusters based on intensity of the computation.
- Used serverless computing platform (AWS lambda services) for running the SPARK jobs.
- Created Views in Redshift for reporting purpose usages for BI teams
- Built ontology-backed time series analysis and monitoring product for Palantir Foundry using
- Java and Typescript, enabling clients to visually analyze petabytes of time series data in the context of real world assets
- Built the first machine learning product for Palantir Foundry using Java, Python, and Typescript, enabling clients to manage ML models in Foundry and understand model performance
- Designed and prototyped new datastore and schema to replace non-performant legacy datastore for one of Palantir’s largest commercial customers
- Troubleshooting Redshift performance using AWS Cloud Watch metrics for CPU Utilization and Storage Utilization
- Developed serverless workflows using AWS Step Function service and automated the workflow using AWS CloudWatch.
- Actively participating in meetings and solving any technical issues
Environment: Apache Airflow, Redshift, Hadoop, Spark, AWS EMR, Dynamo DB, S3 Data Lake, Hive, EC2, AWS Glue, Scala, Kinesis Streams, Kinesis Firehouse.
Data Engineer
Confidential, Irving, TX
Responsibilities:
- Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.
- Computed MapReduce jobs using Java API and PIG Latin.
- Worked on loading the data into the cluster from the dynamically generated files using Flume and sent the cluster to Relational database management systems using SCOOP.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Managed resources and scheduling across the cluster using Azure Kubernetes Service.
- Worked on Oozie for defining and scheduling jobs to manage Apache Hadoop jobs by directing an Acyclic graph (DAG) of actions with control flows.
- Involved in creating Hive tables and working on them using HiveQL and performing data analysis using Hive and Pig.
- Designing the distribution strategy for tables in Azure SQL data warehouse
- Experience in the binding of Services in Cloud and Installed Pivotal Cloud Foundry (PCF) on Azure to manage the containers created by PCF.
- Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
- Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
- Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
- Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SQL Context.
- Worked with AWS components like Amazon Ec2 instances, S3 buckets, and Cloud Formation templates and Boto library
- Responsible for managing data from multiple sources.
- Wrote Pig scripts to run ETL jobs on the data in HDFS for future testing.
- Used Hive to analyze the data and checked for correlation.
- Imported data using Sqoop to load data from MySQL to HDFS and Hive on regular basis.
- Automatically Importing data regular basis using SQOOP into the Hive partition by using apache Oozie.
- Used the A/B testing, multivariate testing, and conversion optimization techniques across digital platforms.
- Used Azure Data Factory, SQL API, and Mongo API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
- Created programs in Python for automating the processes for creating Excel sheet reading data from Redshift databases
- Assisting with Palantir analytics use cases for data loads
- Providing Data integration, Data ingestion for the use cases required for the Palantir analytics
- Writing Mesa code in helping data transformation and data integration required for Palantir Analytic use cases
- Performed Data Analysis on the Analytic data present in Teradata, Hadoop/Hive/Oozie/Sqoop, and AWS using SQL, Teradata SQL Assistant, Python, Apache Spark, SQL Workbench
- Used Flume to extract log files data and send the data into HDFS
- As per Ad-hoc request, created History tables, views on the top of the Datamart/ production databases by using Teradata, Hadoop/Hive/Oozie/Sqoop, BTEQ, and UNIX.
- Developed SQL scripts for data loading and table creation in Teradata and Hadoop/HIVE/Oozie/Sqoop
- Supported Map Reduce Programs that are running on the cluster.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
- Designed and automated Custom-built input adapters using Spark, Sqoop, and Oozie to ingest and analyze data from RDBMS to Azure Data Lake.
- Used Agile methodology in developing the application, which included iterative application development, weekly status report, and stand-up meetings.
Environment: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Flume, Zookeeper, Agile Cloudera Manager, Oozie, MySQL, SQL, Azure, AWS, Azure DevOps, Linux
Data Engineer
Confidential, Evansville, IN
Responsibilities:
- Responsible for gathering requirements, system analysis, design, development, testing, and deploying ETL Pipelines
- Working on Palantir foundry tool to create contract entity modeling and design patterns for the automation of development and production environments.
- Participated in the complete SDLC process
- Developed transformation logic and designed various Complex Mappings and Mapplets using the Informatica Designer
- Work on maintaining a framework written in Python to process data
- Work on writing quality checks for the processed data
- Write SQL queries to make sure the retrieved data is adhering to the schema and has no discrepancies
- Troubleshooting and developing production hot fixes in case of failures
- Performed Mapping Optimizations to ensure maximum Efficiency
- Expertise in Historical Data population and Correction
- Build numerous Lambda functions using python and automated the process using the event created
- Work on building a framework for data processing on AWS Glue, to increase speed, efficiency and decrease costs
- Resolve Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
- Import and export of data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDD.
- Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
- Built ETL pipeline for scaling up data processing flow to meet the rapid data growth by exploring Spark and improving the performance of the existing algorithm in Hadoop using Spark-Context, Spark-SQL, Data Frame, Pair RDD’s and Spark YARN.
- Created an automated loan leads and opportunities match back model used to analyze loan performance and convert more business leads
- Ingested forecasted budgets history into data warehouse
- Worked on PySpark APIs for data transformations.
- Monitored database performance with the help of AWS Cloud Watch and communicated with users to consume resources optimally
- Wrote Shell and Python scripts to automate loading data and kicking off some parts of the data pipeline
- Write SQL queries to perform checks on the retrieved data
- Created Informatica Mappings and Mapplets using different transformations
- Extensively used different Informatica transformations like Source Qualifier, Filter, Aggregator, Expression, Connected and Unconnected Lookup, Sequence Generator, Router, Update Strategy, Normalizer and Transaction Control
- Performed Development using Teradata utilities like BTEQ, Fast Load, MultiLoad and TPT to populate the data into DW
- Implemented various Teradata specific features like selection of PI, USI/NUSI, PPI and Compression based on requirements
- Built table as slowly changing dimensions SCD Type 1 and 2 with both Informatica and Teradata
- Developed a CI/CD pipeline
- Wrote Python modules to extract/load asset data from the MySQL source database
- Estimate efforts and Timelines and Highlight Risk associated with it
- Perform end to end unit testing and documenting the results in unit test plans
- Troubleshooting and developing production hot fixes in case of failures
- Performed Mapping Optimizations to ensure maximum Efficiency
- Expertise in Historical Data population and Correction
- Cleansing and trapping incorrect data, Fine-tuning the Informatica Code (mapping and sessions), SQL to obtain optimal performance and throughput
- Followed Agile (SCRUM) methodology with high quality deliverables delivered on-time
Environment: Informatica Power Center 9.6, Informatica Power Center 11, SQL, Oracle, MySQL, Teradata 13, Teradata SQL Assistant, Control M, Shell Scripting, Python 3, AWS Redshift, AWS Glue.
Data Engineer
Confidential
Responsibilities:
- Perform ETL using Python and Redshift for reading data from Amazon S3 service
- Used Apache Airflow to orchestrate the ETL workflow
- Data Modeling with PostgreSQL to design Fact and Dimension tables using Snowflake Methodologies
- Setting up S3 bucket and Access Control policies using IAM
- Setting up and configuring IAM role and attaching policies, Virtual Private Cloud (VPC) components -subnet, Internet Gateway (IGW), Security Groups, EC2 instances for an AWS Redshift cluster using python AWS SDK
- Created scripts in Python to read CSV, JSON and parquet files from S3 buckets and load them into Redshift
- Processed files from Data Lake to populate Facts and Dimension tables using Apache Spark and writing them back to S3 in Parquet format
- Written data quality checks for the processed data
- Written SQL queries to make sure the retrieved data is adhering to the schema and has no discrepancies
- Used Git and GitHub for version control
- Created documentation in markdown language and in Jupyter Notebooks
ETL/Data Warehouse Developer
Confidential
Responsibilities:
- Gathered requirements from Business and documented for project development.
- Coordinated design reviews, ETL code reviews with teammates.
- Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
- Extensively worked with Informatica transformations.
- Created datamaps in Informatica to extract data from Sequential files.
- Extensively worked on UNIX Shell Scripting for file transfer and error logging.
- Scheduled processes in ESP Job Scheduler.
- Performed Unit, Integration and System testing of various jobs.
Environment: Informatica Power Center 8.6, Oracle 10g, SQL Server 2005, UNIX Shell Scripting, ESP job scheduler.