Data Engineer/data Analyst Resume
New, JerseY
SUMMARY
- 7 years of experience as a Data Engineer, Data Analyst using Python, NumPy, Pandas, AWS, Postgres, Kafka, Cassandra, MongoDB.
- Hands on experience using Panda’s data frames, NumPy, matplotlib and seaborn to create correlation, bar, time series plots.
- Skilled in Tableau desktop for creating visualizations using bar charts, line charts, scatter plots, pie charts etc.
- Have good experience in conducting exploratory data analysis, data mining and working with Statistical models.
- Worked with python and Bash scripting in automating tasks and creating data pipelines.
- Good Experience in creating and designing data ingestion pipelines using Apache Kafka.
- Data preprocessing techniques like handling missing data, categorical data (dummy variable handling), feature scaling.
- Good knowledge on data modelling and creating schemas (Snowflake/ Star), creating tables (fact/ dimension) in data warehouse. Experienced with 3NF, Normalization and Denormalization of tables depending on use case Scenario.
- Well versed with Big data on AWS cloud services i.e., EC2, S3, Glue, Athena, DynamoDB and RedShift.
- Experienced with performing Query optimization in MySQL, Teradata using explain commands to increase Query performance.
- Proficiency in working with SQL/ NoSQL like MongoDB, Cassandra, MySQL and PostgreSQL.
- Exposure to various AWS services like Lambda, VPC, IAM, Load Balancing, CloudWatch, SNS, SQS, Autoscaling, Load Balancing.
- Good experience with data pipeline/workflow management tools like AWS Kinesis, Apache Airflow.
- Experienced in performing ETL operations using Apache Airflow DAG, Informatica Power center to load data into Data Warehouse.
- Have Knowledge on container Orchestration platform like Kubernetes and building images with Docker and deploying on private registry.
- Experienced with full software development life cycle, architecting scalable platforms, object - oriented programming, database design and Kanban methodologies.
- Experience in UNIX shell scripting for processing large volumes of data from varied sources and loading into databases like Teradata.
- Sub Queries, Stored Procedures, Triggers, Cursors, and Functions on MongoDB, MySQL, and PostgreSQL database.
- Thorough understanding of providing specifications using Waterfall and Agile Software methodology to modelling systems and business processes.
- Have good experience in conducting exploratory data analysis, data mining and working with Statistical models.
- Experience in Design and Development of ETL methodology for supporting Data Migration, data transformations &processing in a corporate wide ETL Solution using Teradata.
- Worked on various applications usingpythonintegrated IDEs like Jupyter, Spider, Eclipse, VSCode, IntelliJ, ATOM and PyCharm.
TECHNICAL SKILLS
Libraries: Keras, TensorFlow, GYM, scikit, matplotlib, seaborn, NumPy, Pandas, Boto3, Beautiful Soup, PySpark, Gurobi, scikit.
SQL/NoSQL: PostgreSQL, MongoDB, Cassandra, MySQL, MS SQL, Kafka
Language: Python, R, C, C++.
Operating System: Windows, Red Hat Linux
Version Control: Git, GitHub, SVN
Architecture: Relational DBMS, OLAP, OLTP.
Reporting Tools: Power-BI, Tableau, SSRS (SQL Server Reporting Services)
ETL Tools: Apache Airflow, Informatica, SSIS (SQL Server Integration Services)
PROFESSIONAL EXPERIENCE
Confidential - New Jersey
Data Engineer/Data Analyst
Responsibilities:
- Utilized AWS EMR with spark to perform batch processing operations on various Big Data sources.
- Installed the applications on AWS EC2 instances and configured on the storage S3 buckets.
- Created data pipeline to extract data from various API and ingested them in S3 Buckets using Apache Airflow.
- Deployed Lambda functions to be triggered on certain events in S3 Bucket to perform data transformation and load the data in AWS Redshift.
- Worked with Linux EC2 instances and RDBMS databases.
- Developed python and shell scripts to perform spark jobs for batch processing data from various sources.
- Working knowledge of Amazon’s Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) to store objects.
- Experience in AWS EC2, configuring the servers for Auto scaling and Elastic load balancing.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD.
- Extracted data using spark from AWS redshift and performed data Analysis.
- Developed POC to perform ETL operations using AWS glue to load Kinesis stream data into S3 buckets.
- Performed cleaning of data, data quality checks, data governance for incremental loads.
- Normalized data and created correlation plots, scatter plots to find underlying patterns.
- Filter and “clean” data by reviewing computer reports, printouts, and performance indicators to locate and correct code problems.
- Testing dashboards, to ensuredatais matching as per the business requirements and if there are any discrepancies inunderlying data.
- Created data pipelines using Step functions, implemented state machines and ran pipelines at different times to process data.
- Develop and implement databases, data collection systems, data analytics and other strategies that optimize statistical efficiency and quality.
- Work with management to prioritize business and information needs.
Environment: Python, AWS, Jira, GIT, CI/CD, Docker, Kubernetes, Web Services, Spark streaming API, Kafka, Cassandra, Python, MongoDB, JSON, Bash scripting, Linux, SQL, Apache Airflow.
Confidential - Sioux Falls, SD
Data Engineer
Responsibilities:
- Implemented Spark using Python and utilized Data frames and Spark SQL API for processing and querying data.
- Developed Spark Applications by using python Driver and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with various unstructured data sources like parquet, json, csv etc. using spark.
- Experienced in working real time streaming with Kafka as data pipeline using spark streaming module.
- Consumed Kafka messages and loaded data into Cassandra cluster deployed in containers.
- Developed preprocessing jobs using spark data frames to flatten json files.
- Ingested data into Mysql RDBMS data and performed transformations and then export the transformed data to Cassandra.
- Assisted in Key space creation, Table creation, Secondary Index creation in Cassandra database.
- Performed Query optimization of the tables through load testing using Cassandra stress tool.
- Created various POC using pyspark module in python using MLlib.
- Worked with various teams in deploying containers on site using Kubernetes to run Cassandra clusters in Linux Environment.
- Good Knowledge on Kubernetes architecture like scheduler, pods, nodes, kubectl api and etcd database.
- Experience in creating tables in Cassandra clusters, building images using docker and deploying the images.
- Assisted in developing and creating schemas, tables for Cassandra cluster to ensure good query performance for the front-end Application.
- Experience in managing MongoDB environment from availability, performance and scalability perspectives.
- Worked on the Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Used GitHub as a version control.
- Worked on the UNIX environment.
Environment: Python, Spark, Kafka, JSON, GitHub, LINUX, Flask, Varnish, Nginx, REST, CI CD, Kubernetes, Helm, MongoDB, Cassandra.
Confidential - Norfolk, VA
Data Engineer
Responsibilities:
- Involved in developing python scripts, informatica and other ETL tools for extraction, transformation, loading of data into Teradata.
- Worked with a Team of Business Analysts in requirements gathering, business analysis and project coordination in creating reports.
- Performed Unit and Integration Testing of Informatica Sessions, Batches and Target Data.
- Responsible for Using Autosys and workflow Manager Tools to schedule Informatica jobs.
- Responsible for creating Workflows and sessions using Informatica workflow manager and monitor the workflow run and statistic properties on Informatica Workflow Monitor
- Developing complex Informatica mappings using different types of transformations like Connected and Unconnected LOOKUP transformations, Router, Filter, Aggregator, Expression, Normalizer and Update strategy transformations for large volumes of Data.
- Involving Testing, Debugging, Validation and Performance Tuning of data warehouse, help develop optimum solutions for data warehouse deliverables.
- Moving the data from source systems to different schemas based on the dimensions and fact tables by using the slowly changing dimensions type two and type one.
- Interacted with key users and assisted them with various data issues, understood data needs and assisted them with Data analysis.
- Performed Query optimization for various SQL tables using Teradata explain command.
- Developed procedures to populate the customer data warehouse with transaction data, cycle and monthly summer data.
- Experienced in Linux environment and scheduling jobs, file transfers.
- Very Good Understanding of Database skew, PPI, Join methods, aggregate and hash.
Environment: Informatica power center, Teradata, XML, Flat files, Cron Job, Linux, Bash shell, Python scripting.
Confidential
Data Analyst
Responsibilities:
- Created data visualizations and developed dashboard, stories in Tableau.
- Utilized various charts like scatterplot, bar, pie, heatmap in Tableau for data analysis.
- Analyzed various KPI results frequently to assist in development of performance improvement concepts.
- Experience in using Excel pivot chart, VBA tools to create customized reports and analysis.
- Assisted with sizing query optimization, buffer tuning, backup and recovery, installations, upgrades and security including other administration functions as part of profiling plan.
- Ensured production data being replicated into data warehouse without any data anomalies from the processing databases.
- Designed databases for referential integrity and involved in logical design plan.
- Analyzed code to improve query optimization and to verify that tables are using indexes.
- Created, tested MySQL programming, forms, reports, triggers and procedures for the Data Warehouse.
- Involved in troubleshooting and fine-tuning of databases for its performance and concurrency.
- Automated the code release process, bringing the total time for code releases from 8 hours to 1 hour.
- Developed a fully automated continuous integration system using Git, MySQL and custom tools developed in Python and Bash.
- Working experience Tableau Desktop to generate the reports by writing the SQL queries.
- Played a key role in a department wide transition from Subversion to Git, which resulted in an increase in efficiency for the development community.
Environment: Python, MSQL, Tableau, JSON, GitHub, LINUX.