Data Engineer Resume Boyertown, PA - Hire IT People

SUMMARY

7+ years of experience in Analysis, Design, Palantir Foundry, Development and Implementation of various ETL projects and complex ETL pipelines
Experienced in full SDLC starting from Design and Development, Testing, and documenting the entire life cycle using various methodologies
Extensive career experience in Data Warehousing, Decision support Systems and has wide experience in implementing Data Warehousing solutions
Proficient in understanding business processes / requirements and translating them into technical requirements
Excellent communication skills. Successfully working in fast - paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner
Strong in Data warehousing concepts, dimensional Star Schema and Snowflakes Schema methodologies
Extensive experience in design, development and Implementations of Data warehouses and Data marts by utilizing applications Informatica, Redshift, Teradata, Python, Airflow, Unix etc.
Experienced in relational databases like Oracle, Teradata, PostgreSQL and Redshift databases
Extensive Experience in Performance Tuning in Informatica & Teradata
Created Informatica Mappings and Mapplets using different transformations
Extensively used different Informatica transformations like Source Qualifier, Filter, Aggregator, Expression, Connected and Unconnected Lookup, Sequence Generator, Router, Update Strategy, Normalizer and Transaction Control
Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files
Proficient in programming in PDO (Push Down Optimization), SQL and experience in shell scripting, UNIX and Linux
Over 4 years of experience in Apache Hadoop ecosystem like Apache Spark Framework, HBase, Hive, Buildinlines, and Migrating Data from cloud with domain expertise.
Result-oriented professional with experience in Creating Data Mapping Documents, Writing Functional Specifications and Queries, Normalizing Data from 1NF to 3NF/4NF. Requirements gathering, System & Data Analysis, Requirement Analysis, Data Architecture, Database Design, Database Modeling, Development, Implementation and Maintenance of OLTP and OLAP Databases.
Vast experience in designing, creating, testing, and maintaining the complete data management from Data Ingestion, Data Curation and Data Provision with in-depth knowledge in Spark APIs like Spark Framework-SQL, DSL, Streaming, by working on different file formats like parquet, JSON, and performance tuning of spark applications from various aspects.
Gathering and translating business requirements into technical designs and development of the physical aspects of a specified design by creating Materialized views/Views/Lookups.
Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
Expertise in writing end to end Data Processing jobs to analyze data using MapReduce, Spark and Hive.
Created Informatica Metadata Queries to identify Project impact analysis, to reduces rework & manual efforts and estimate accurate effort
Proven ability to manage multiple projects and coordinate efforts of several parties to achieve milestones within scope and time
Experienced in Python programming to setup ETL pipelines and orchestrate them
Experienced in handling slowly changing dimensions SCD Type 1 and 2
Experienced with normalization and Denormalization concepts and datasets
Experienced in Job Monitoring, setting up alarms on job failures, troubleshooting and restarting jobs based on the error
Experienced in troubleshooting Data Quality Issues, ETL Job failures and backfills
Experienced in Data Governance practices & processes
Experienced in code deployments to Production Systems and automating daily operational tasks
Experienced in project deployment using CI/CD
Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources
Developed Python automation scripts to facilitate quality testing
Estimate efforts and Timelines and Highlight Risk associated with it
Experience in cleansing and trapping incorrect data
Good working Experience in Agile (SCRUM) and waterfall methodologies with high quality deliverables delivered on time
Wrote Python modules to extract/load asset data from the MySQL source database
Good working experience in using version control systems like Git and GitHub
Excellent problem solving and sound decision-making capabilities, recognized by associates for the quality of data, alternative solutions, and confident decision making
Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, IAM, DynamoDB, Redshift, Cloud Watch, Auto Scaling, Security Groups, CloudWatch, CloudFormation, Kinesis, IAM, SQS, SNS

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Boyertown, PA

Responsibilities:

Created workflows and data pipelines using Apache Airflow
Loaded kinesis streams data through kinesis Firehose to S3 data lake
Performed ETL on S3 Data Lake using Transient EMR Cluster
Used Glue Data catalog as Hive Metastore DB
Loaded Data from S3 to Redshift warehouse using COPY Command
Created Hive Tables with DynamoDB SeriDe for daily load of data from DynamoDB to S3
Created S3 Lifecycle policy to clean up old S3 objects and for moving data from s3 standard storage Tier to Glacier for Archiving
Written Spark Jobs for applying transformations and business rules, data munging on S3 data and joining with reference data
Written Shell scripts for Submitting Spark Jobs
Performed Redshift modeling for choosing Distribution styles and Sort key definitions
Performed Redshift VACCUM FULL and ANALYZE command for reclaim the space occupied by deleted rows and re-sorting the tables and updating metadata info which is used in query plan generation
Performed Data Profiling for getting context of data
Developed Airflow jobs for Data cleaning using Business Rules
Involved in analysis for Tuning AWS EMR jobs by changing required Task nodes of AWS Spot Instance Type in Cloud Formation template
Applied transformation on data from Staged Zone in S3
Written extensive Spark/Scala programming using Data Frames, Data Sets & RDD's for transforming transactional database data and load it into Redshift tables.
Implemented best design practices for AWS Redshift for query optimization by distribution style of fact tables.
Reduced the latency of SPARK jobs by tweaking spark configurations and following other performance optimization techniques like memory tuning, serializing RDD data structures, broadcasting large variables, data locality etc.
Implemented delta lake feature to overcome the challenges of backfill and re-ingestion into the data lake.
Used auto scaling feature in AWS to increase and decrease the clusters based on intensity of the computation.
Used serverless computing platform (AWS lambda services) for running the SPARK jobs.
Created Views in Redshift for reporting purpose usages for BI teams
Built ontology-backed time series analysis and monitoring product for Palantir Foundry using
Java and Typescript, enabling clients to visually analyze petabytes of time series data in the context of real world assets
Built the first machine learning product for Palantir Foundry using Java, Python, and Typescript, enabling clients to manage ML models in Foundry and understand model performance
Designed and prototyped new datastore and schema to replace non-performant legacy datastore for one of Palantir’s largest commercial customers
Troubleshooting Redshift performance using AWS Cloud Watch metrics for CPU Utilization and Storage Utilization
Developed serverless workflows using AWS Step Function service and automated the workflow using AWS CloudWatch.
Actively participating in meetings and solving any technical issues

Environment: Apache Airflow, Redshift, Hadoop, Spark, AWS EMR, Dynamo DB, S3 Data Lake, Hive, EC2, AWS Glue, Scala, Kinesis Streams, Kinesis Firehouse.

Data Engineer

Confidential, Irving, TX

Responsibilities:

Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.
Computed MapReduce jobs using Java API and PIG Latin.
Worked on loading the data into the cluster from the dynamically generated files using Flume and sent the cluster to Relational database management systems using SCOOP.
Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
Managed resources and scheduling across the cluster using Azure Kubernetes Service.
Worked on Oozie for defining and scheduling jobs to manage Apache Hadoop jobs by directing an Acyclic graph (DAG) of actions with control flows.
Involved in creating Hive tables and working on them using HiveQL and performing data analysis using Hive and Pig.
Designing the distribution strategy for tables in Azure SQL data warehouse
Experience in the binding of Services in Cloud and Installed Pivotal Cloud Foundry (PCF) on Azure to manage the containers created by PCF.
Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
Used Spark Data Frames API over platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
Built end-to-end ETL models to sort vast amounts of customer feedback, derive actionable insights and tangible business solution.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for Data Validation, Cleansing, Transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using Pyspark and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SQL Context.
Worked with AWS components like Amazon Ec2 instances, S3 buckets, and Cloud Formation templates and Boto library
Responsible for managing data from multiple sources.
Wrote Pig scripts to run ETL jobs on the data in HDFS for future testing.
Used Hive to analyze the data and checked for correlation.
Imported data using Sqoop to load data from MySQL to HDFS and Hive on regular basis.
Automatically Importing data regular basis using SQOOP into the Hive partition by using apache Oozie.
Used the A/B testing, multivariate testing, and conversion optimization techniques across digital platforms.
Used Azure Data Factory, SQL API, and Mongo API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
Created programs in Python for automating the processes for creating Excel sheet reading data from Redshift databases
Assisting with Palantir analytics use cases for data loads
Providing Data integration, Data ingestion for the use cases required for the Palantir analytics
Writing Mesa code in helping data transformation and data integration required for Palantir Analytic use cases
Performed Data Analysis on the Analytic data present in Teradata, Hadoop/Hive/Oozie/Sqoop, and AWS using SQL, Teradata SQL Assistant, Python, Apache Spark, SQL Workbench
Used Flume to extract log files data and send the data into HDFS
As per Ad-hoc request, created History tables, views on the top of the Datamart/ production databases by using Teradata, Hadoop/Hive/Oozie/Sqoop, BTEQ, and UNIX.
Developed SQL scripts for data loading and table creation in Teradata and Hadoop/HIVE/Oozie/Sqoop
Supported Map Reduce Programs that are running on the cluster.
Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
Designed and automated Custom-built input adapters using Spark, Sqoop, and Oozie to ingest and analyze data from RDBMS to Azure Data Lake.
Used Agile methodology in developing the application, which included iterative application development, weekly status report, and stand-up meetings.

Environment: Hadoop, MapReduce, HDFS, Pig, Hive, HBase, Flume, Zookeeper, Agile Cloudera Manager, Oozie, MySQL, SQL, Azure, AWS, Azure DevOps, Linux

Data Engineer

Confidential, Evansville, IN

Responsibilities:

Responsible for gathering requirements, system analysis, design, development, testing, and deploying ETL Pipelines
Working on Palantir foundry tool to create contract entity modeling and design patterns for the automation of development and production environments.
Participated in the complete SDLC process
Developed transformation logic and designed various Complex Mappings and Mapplets using the Informatica Designer
Work on maintaining a framework written in Python to process data
Work on writing quality checks for the processed data
Write SQL queries to make sure the retrieved data is adhering to the schema and has no discrepancies
Troubleshooting and developing production hot fixes in case of failures
Performed Mapping Optimizations to ensure maximum Efficiency
Expertise in Historical Data population and Correction
Build numerous Lambda functions using python and automated the process using the event created
Work on building a framework for data processing on AWS Glue, to increase speed, efficiency and decrease costs
Resolve Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
Import and export of data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDD.
Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
Built ETL pipeline for scaling up data processing flow to meet the rapid data growth by exploring Spark and improving the performance of the existing algorithm in Hadoop using Spark-Context, Spark-SQL, Data Frame, Pair RDD’s and Spark YARN.
Created an automated loan leads and opportunities match back model used to analyze loan performance and convert more business leads
Ingested forecasted budgets history into data warehouse
Worked on PySpark APIs for data transformations.
Monitored database performance with the help of AWS Cloud Watch and communicated with users to consume resources optimally
Wrote Shell and Python scripts to automate loading data and kicking off some parts of the data pipeline
Write SQL queries to perform checks on the retrieved data
Created Informatica Mappings and Mapplets using different transformations
Extensively used different Informatica transformations like Source Qualifier, Filter, Aggregator, Expression, Connected and Unconnected Lookup, Sequence Generator, Router, Update Strategy, Normalizer and Transaction Control
Performed Development using Teradata utilities like BTEQ, Fast Load, MultiLoad and TPT to populate the data into DW
Implemented various Teradata specific features like selection of PI, USI/NUSI, PPI and Compression based on requirements
Built table as slowly changing dimensions SCD Type 1 and 2 with both Informatica and Teradata
Developed a CI/CD pipeline
Wrote Python modules to extract/load asset data from the MySQL source database
Estimate efforts and Timelines and Highlight Risk associated with it
Perform end to end unit testing and documenting the results in unit test plans
Troubleshooting and developing production hot fixes in case of failures
Performed Mapping Optimizations to ensure maximum Efficiency
Expertise in Historical Data population and Correction
Cleansing and trapping incorrect data, Fine-tuning the Informatica Code (mapping and sessions), SQL to obtain optimal performance and throughput
Followed Agile (SCRUM) methodology with high quality deliverables delivered on-time

Environment: Informatica Power Center 9.6, Informatica Power Center 11, SQL, Oracle, MySQL, Teradata 13, Teradata SQL Assistant, Control M, Shell Scripting, Python 3, AWS Redshift, AWS Glue.

Data Engineer

Confidential

Responsibilities:

Perform ETL using Python and Redshift for reading data from Amazon S3 service
Used Apache Airflow to orchestrate the ETL workflow
Data Modeling with PostgreSQL to design Fact and Dimension tables using Snowflake Methodologies
Setting up S3 bucket and Access Control policies using IAM
Setting up and configuring IAM role and attaching policies, Virtual Private Cloud (VPC) components -subnet, Internet Gateway (IGW), Security Groups, EC2 instances for an AWS Redshift cluster using python AWS SDK
Created scripts in Python to read CSV, JSON and parquet files from S3 buckets and load them into Redshift
Processed files from Data Lake to populate Facts and Dimension tables using Apache Spark and writing them back to S3 in Parquet format
Written data quality checks for the processed data
Written SQL queries to make sure the retrieved data is adhering to the schema and has no discrepancies
Used Git and GitHub for version control
Created documentation in markdown language and in Jupyter Notebooks

ETL/Data Warehouse Developer

Confidential

Responsibilities:

Gathered requirements from Business and documented for project development.
Coordinated design reviews, ETL code reviews with teammates.
Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
Extensively worked with Informatica transformations.
Created datamaps in Informatica to extract data from Sequential files.
Extensively worked on UNIX Shell Scripting for file transfer and error logging.
Scheduled processes in ESP Job Scheduler.
Performed Unit, Integration and System testing of various jobs.

Environment: Informatica Power Center 8.6, Oracle 10g, SQL Server 2005, UNIX Shell Scripting, ESP job scheduler.

We provide IT Staff Augmentation Services!

Data Engineer Resume

Boyertown, PA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship