Sr. Data Engineer Resume
Bielenberg Dr St, PauL
SUMMARY
- A professional career with 7+ years of Experience in IT industry in Predictive Modeling, Data Analytics, Data Modeling, Data conversion, Data Migration, Data Integration, Data Architecture, Data Analysis, ETL tools, Data mining, Big Data, Data Warehouse, Hadoop Eco System, Apache technologies, Python, Spark, Scala, MS Excel, SQL, Erwin, AWS, AZURE, Google Cloud Platform and worked in industries verticals like Healthcare system, Financial Assets, Banking, Financial, Business Issues.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Validation and Visualization and reporting solutions that scales across massive.
- Excellent working in Big Data Hadoop Horton works, HDFS architecture, R, Python, Java Jupiter, Pandas, NumPy, Scikit, Matplotlib, Keras, Hive, NoSQL - HBASE, Sqoop, Pig, Map Reduce, Spark MLlib, Map Reduce.
- Experience using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- Excellent Programming skills at a higher level of abstraction using Scala and Python.
- Good understanding of NoSQL databases like HBase and Cassandra.
- Experience with Data Lakes and Business Intelligence tools in Azure.
- Experience in working with different databases such as Oracle, SQL Server, MySQL and writing Stored Procedures, Functions, Joins and Triggers for different Data Models.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning and advanced data processing.
- Python was used in automation of Hive and Reading Configuration files.
- Experience optimizing ET workflows.
- Experienced with AWS AZURE services to smoothly manage application in the cloud and creating or modifying the instances.
- Hands on experience in Stream processing frameworks such as Storm, Spark Streaming.
- Experience in Amazon Web Services (AWS) products S3, EC2, EMR, and RDS.
- Efficient in developing Logical and Physical Data model and organizing data as per the business requirements using Avro Sybase Power Designer, Erwin, E Studio in both OLTP and OLAP applications.
- Worked on UNIX shell scripting with good knowledge of various Unix command line commands.
- Worked in container-based technologies like Docker and Kubernetes.
- Design and develop Tableau, Matplotlib visualization solutions.
- Expertise in implementing merging, branching strategies, defect fixes and configuration of version control tools like GIT (Bit Bucket and GitHub) for smooth release management into production environments.
- Adept at Web Development and experience in developing front end applications using JavaScript, CSS, XML and HTML.
- Experienced in working in Agile and Waterfall Methodologies.
- Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
TECHNICAL SKILLS:
Hadoop Eco-System: Cloudera Distribution, HDFS, Yarn, Resource Manager, MapReduce, PIGSQOOP, HBase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala.
Programming Languages: R, Python, SQL, Scala, UNIX, JAVA.
Database Tools: Snowflake(cloud), Oracle, SQL Server, MySQL, NoSQL, MongoDB.
ETL Tools: Hadoop, Informatica PowerCenter, Dataddo.
Cloud Platform: AWS, Azure, Google cloud Platform.
Operating System: Windows, Dos, Unix/Linux.
Web-technologies: HTML, XML, CSS, JavaScript.
Data Visualization: Tableau, Matplotlib.
Data Modeling: ER/Studio, Erwin, dB schema.
Data warehousing: Bigdata, Pentaho, ETL Development, Amazon Redshift, Informatica Data Quality (IDQ).
Machine Learning Algorithm: Logistic Regression, K-means, Decision Tree.
Software Life Cycle: Waterfall, Agile.
Version Tools: GIT, CVS
Development Tools: Visual Studio, Eclipse, NetBeans.
Security Management: Hortonworks Ambari, Cloudera Manager, Apache Knox, XA Secure, Kerberos.
Web/App Server: UNIX server, Apache Tomcat.
PROFESSIONAL EXPERIENCE
Confidential, Bielenberg Dr, St Paul
Sr. Data Engineer
Responsibilities:
- Prepared Extract and load design document which consists of the database structure, change data capture, Error handling, restart, and refresh strategies.
- Involved in developing python scripts and other ETL tools for extraction, transformation, loading of data into data warehouse.
- Worked with different feeds data like JSON, CSV, XML, DAT and implemented Data Lake concept.
- Most of the infrastructure is on AWS, AWS EMR Distribution for Hadoop, AWS S3 for raw file storage.
- Used AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
- Programmed ET functions between Oracle and Amazon Redshift.
- Python was used in automation of Hive and Reading Configuration files.
- Using Hive to analyze the partitioned data and compute various metrics for reporting.
- Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
- Performed data analysis and predictive data modeling and explore clickstream events data with SparkSQL.
- Spark SQL is used as a part of Apache Spark big data framework for structured, Shipment, POS, Consumer, Household, Individual digital impressions, Household TV impressions data processing.
- Created Data Frames from different data sources like Existing DDs, Structured data files, JSON Datasets, Hive tables, External databases.
- Used Airflow which is a scheduling tool which helped in define workflows and timings of when we want to run.
- Used Agile to increase the efficiency of the data.
- Used Jenkins for doing automated builds, tests, and deployments.
- Load terabytes of different level raw data into Spark DD for data Computation to generate the Output response.
- Keepingtrackofchangesand the ability torollbackthese changes, when necessary. For this Git is used.
- Design and develop Tableau visualization solutions.
- Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, SQL, Sqoop, Zookeeper, AWS EMR, AWS S3, Data Pipeline, Airflow, Jenkins, GIT, JIRA, Unix/Linux, Agile.
Confidential
Big Data Engineer
Responsibilities:
- Involved in Big data requirement analysis, develop and design solutions for EL and Business Intelligence platforms.
- Installed Kafka on Hadoop cluster and configured producer and consumer in java to establish connection from source to HDFS with popular hash tags.
- Load real time data from various data sources into HDFS using Kafka.
- Worked on reading multiple data formats on HDFS using python.
- Optimizing MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Implemented Spark using Python (PySpark) and SparkSQL for faster testing and processing of data.
- Load the data into Spark DD and do in memory data Computation.
- Involved in converting Hive/SQL queries into Spark transformations using APl's like Spark SQL, Data Frames, and python.
- Analyzed the SQL scripts and designed the solution to implement using python.
- Used DBT that allows modular SQL queries to transform the data into the warehouses.
- Exploring the Spark by improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SparkSQL, Data Frame, Pair DD & Spark YARN.
- Performed transformations, cleaning and filtering on imported data using Spark Data Frame API, Hive, MapReduce, and loaded final data into Hive.
- Involved in converting Map Reduce programs into Spark transformations using Spark DD on python.
- Developed Spark scripts by using python Shell commands as per the requirement.
- Worked with NoSQL databases like Base in creating tables to load large sets of semi structured data coming from source systems.
- Design and develop the HBase target schema.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs with control flows.
- Worked on visualizing the reports using Tableau.
Environment: Apache Spark, PySpark, HDFS, Python, DBT, Java, Map Reduce, NoSQL, Kafka, Hive, HBase, Yarn, Sqoop, SQL, Oozie, Cloudera Manager, Zookeeper, SparkSql, Tableau.
Confidential
Data Engineer
Responsibilities:
- Designed and built full end-to-end Data Warehouse infrastructure from the ground up on Redshift for large scale data handling Thousands of records every day.
- Implemented and Managed ET solutions and automating operational processes
- Design and develop ET integration patterns using Python on Spark
- Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs
- Wrote various data normalization jobs for new data ingested into Redshift. Worked at optimizing volumes and created multiple VPC instances and on to create new accounts, roles, and groups.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of Transformations.
- Worked on creating various types of indexes on different collections to get good performance in Mongo database.
- Built S3 buckets and managed policies for S3 buckets, used S3 bucket and Glacier for storage and backup on AWS.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
- Integrated services like GitHub, AWS Code Pipeline, Jenkins, and AWS Elastic Beanstalk to create a deployment pipeline.
- Created monitors, alarms, and notifications for EC2 hosts using CloudWatch.
- Implemented new projects builds framework using Jenkins as build framework tools.
Environment: AWS, S3, Redshift, Kinesis Firehose, Kinesis Data Streams, Cloud Watch, Python, PySpark, MySQL, Shell Scripts, Lambda, Mongo DB, GIT, Apache, Spark, Docker.
Confidential
Data Engineer
Responsibilities:
- Involved in installation and configuration of Cloudera Distribution Hadoop platform.
- Extract, transform, and load (ETL) data which is collected from Azure cloud from multiple federated data sources JSON, relational database with Data Frames in Spark.
- Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in Hive Context, with transformations and actions.
- Extended the capabilities of Data Frames using User Defined Functions in and Scala.
- Resolved missing fields in Data Frame rows using filtering and imputation.
- Integrated visualizations into a Spark application using Databricks and popular visualization libraries.
- Improved application performance using Azure Search and SQL query optimization.
- Trained analytical models with Spark ML and Python.
- Worked on Hive developing external table, managed table, the pipeline for smooth ETL processing.
- Performed pre-processing on a dataset prior to training, including standardization, normalization.
- Created pipelines to create a processing pipeline including transformations, estimations, evaluation of analytical models.
- Working knowledge of using GIT for project dependency / build / deployment.
- Evaluated model accuracy by dividing data into training and test datasets and computing metrics using evaluators.
- Computed using Spark MLlib functionality that wasn't present in SparkML by converting Data Frames to DDs and applying DD transformations and actions.
Environment: Spark, Spark MLlib, Python, JSON, Spark ML, Hive, Sqoop, HBase, MySQL, Scala, Shell Scripting, Azure, Tableau, GIT.
Confidential
Data Analyst
Responsibilities:
- Worked with BI team in gathering the report requirements and Sqoop to export data into HDFS and Hive.
- For the efficient capture of the data Google Cloud Platform (GCP) is used.
- Involved in the below phases of Analytics using R, Python and Jupyter notebook and Data collection.
- Analyzed existing internal and external data classification errors and defined criteria for missing values.
- Developed multiple Map Reduce jobs in Java for data cleaning and pre-processing.
- Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.
- Developed Map Reduce programs to extract and transform the data sets and results were exported back to RDBMS using Sqoop
- Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
- Developed Map Reduce programs to parse the raw data and store the refined data in partitioned tables in the Enterprise data warehouse (EDW).
- Adept in statistical programming languages like Rand Python including Big Data technologies like Hadoop and Hive.
- Delivering high availability and performance.
- Created tables in Hive and loaded the structured data.
- Was responsible for importing the data (mostly log files) from various sources into HDFS using Flume.
- Managed and reviewed Hadoop log files.
Environment: HDFS, Pig, Hive, Map Reduce, GCP, Java, Linux, Jupyter notebook, HBase, Sqoop, RDBMS, R, Eclipse, Data Mining, Cloudera, Rand Python.