We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

Minnetonka, MN

SUMMARY

  • Professional 8+years of experience as aData Engineerand coding with analytical programming usingPython.experience in AWS Cloud Engineer (Administrator) and working on AWS ServicesIAM, EC2, VPC, AMI, SNS, RDS,SQS,EMR,LAMBDA,GLUE,ATHENA, Dynamo DB, Cloud Watch, Auto Scaling, S3, and Route 53.Worked in various Linux server environments from DEV all the way to PROD and along with cloud powered strategies embracing Amazon Web Services (AWS).
  • Good Knowledge of web services usingGRPC and GRAPHQLprotocols.
  • UsedGRPC and GRAPHQL as a data Gateway.
  • Strong experience inCI (Continuous Integration)/ CD (Continuous Delivery)software development pipeline stages like Commit, Build, Automated Tests, and Deploy usingBogie PipelinesinJenkins.
  • Experience in usinganalytic data warehouse likeSnowflake.
  • Experience in usingData bricksfor handling all analytical process from ETL to all data modeling by leveraging familiar tools, languages, and skills, via interactive notebooks or APIs.
  • Experience inApache Airflowto author workflows as directed acyclic graphs(DAGs), to visualize batch and real - time data pipelines running in production, monitor progress, and troubleshoot issues when needed.
  • Experience inQuantum frameworksto easily ingest, process and act on batch and streaming data using Apache Spark.
  • Worked onDockers containersby combining them with the workflow to make them lightweight.
  • Experience in tuning EMR according to requirements on importing and exporting data using stream processing platforms likeKafka.
  • Experience with developing and maintaining Applications written forAmazon Simple Storage, AWS Elastic Map Reduce, and AWS Cloud Watch.
  • Exploring with theSparkimproving the performance and optimization of the existing algorithms in Hadoop usingSpark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Uploaded and processed terabytes of data from various structured and unstructured sources intoHDFS(AWS cloud) usingSqoop.
  • Experienced in moving data from different sources usingKafka producers, consumersandpreprocess data.
  • Proficient inData Warehousing, Data Miningconcepts andETL transformationsfrom Source to target systems.
  • Experience to build servers using AWS which includes importing necessary volumes, launching theEC2 instance.
  • Creating security groups, auto-scaling, load balancers,Route 53andSNSas per the architecture.
  • Experience on setting up the life cycle policies to back the data fromAWS S3toAWS Glacier, Worked with various AWS,EC2andS3 CLI tools.
  • Expertise in DevOps, Release Engineering, Configuration Management, Cloud Infrastructure, Automation. It includes Amazon Web services (AWS),Apache Maven, Jenkins, Github, and Linux etc.
  • Experienced in creating User/Group Accounts, Federated users and access management to User/Group Accounts usingAWS IAM service.
  • Set-up databases in AWS usingRDS, storage usingS3bucket and configuring instance backups to S3 bucket.
  • Expertise in Querying RDBMS such asPOSTGRES, MYSQL and SQL Serverby using SQL for data integrity.
  • Experienced in Working on Big Data Integration and Analytics based onHadoop, Kafka.
  • Excellent understanding and knowledge of Hadoop Distributed file system data modelling, architecture and design principles.
  • Experience inHadoop clusterperformance tuning by gathering and analyzing the existing infrastructure.
  • Constructed aKafka brokerwith proper configurations for the needs of the organization using big data.
  • Good experience inShell Scripting, SQL Server, UNIXandLinux and knowledgeon version control softwareGithub
  • Strong team player, ability to work independently and in a team as well, ability to adapt to a rapidly changing environment, commitment towards learning, Possess excellent communication, project management, documentation, interpersonal skills

TECHNICAL SKILLS

Big Data Eco System: Hadoop, Kafka, Sqoop, Zookeeper, Spark, Avro, Parquet and Snappy

Languages: Python, Scala, SQL, Linux shell scripting

Databases: Oracle, DB2, SQL Server, MySQL, PL/SQL., NoSQL, RDS, HBase, PostgreSQL

AWS: EC2, S3, EMR, Dynamo DB, Athena, AWS Data-Pipeline, AWS Lambda, cloud watch, SNS,SQS

Micro services Tools: GRPC,GRAPHQL

Virtualization tools: Dockers

Operating Systems: UNIX, Linux, Windows

J2EE Technologies: Servlets, JDBC

CICD, Build Tools: Jenkins, Artifactory, Maven

IDE/ Programming Tools: F Net Beans, PyCharm

Libraries and Tools: PySpark, Psycopg2, PySpells, PyArrow, Pandas, Mysql, Boto3, Jira, Scrum

PROFESSIONAL EXPERIENCE

Confidential, Minnetonka, MN

Data Engineer

Responsibilities:

  • Craft highly scalable and resilient cloud architectures that address customer business problems and accelerate the adoption of AWS services for clients.
  • Build application and database servers using AWSEC2and createAMIsas well as useRDSforPostgreSQL.
  • Carried Deployments and builds on various environments using continuous integration tool Jenkins. Designed the project workflows/pipelines usingJenkinsas CI tool.
  • UsedTerraformto allow infrastructure to be expressed as code in buildingEC2, LAMBDA, RDS.
  • Built analytical warehouses inSnowflakesand queried data in staged files by referencing metadata columns in a staged file.
  • Continuous data loads usingSnow-Pipeand file sizing and loaded structure and semi-structured data using web interfaces into Snowflakes.
  • Involved in designing of API's for the networking and cloud services and Leveraged spark (PySpark) to manipulate unstructured data and apply text mining on user's table utilization data.
  • Designed Data Quality Framework to perform schema validation and data profiling on Spark (PySpark).
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Used Pandas API to put the data as time series and tabular format for easy timestamp data manipulation and retrieval.
  • Integration of web portal and users associated with S3 bucket. Used Amazon S3 to backup database instances periodically to save snapshots of data.
  • Created a Kafka broker in structured streaming to get structured data by schema using case classes. Configured Kafka broker for theKafkacluster of the project and streamed the data to Spark for structured streaming to get structured data by schema.
  • Implemented Spark inEMRfor processing Enterprise Data across our Data Lake in AWS System.
  • Fine-tuned Ec2 for long-running Spark Applications to utilize better parallelism and executor memory for more caching.
  • Experience working on Dockers Hub, creating Dockers images and handling multiple images primarily for middleware installations and domain configuration.
  • Developed GIT hooks for the local repository, code commit and remote repository, code push functionality and worked on the GIT-Hub.
  • Developed Airflow Workflow to schedule batch and real-time data from source to target.
  • Backing up AWSPostgrestoS3on daily job run onEMRusing Data Frames.
  • Worked on ETL Processing which consists of data transformation, data sourcing and also mapping, Conversion and loading.
  • Used Pandas API to put the data as time series and tabular form for east timestamp data manipulation and retrieval to handle time series data and do data manipulation.
  • Responsible for Technical architecture and creation of technical specs & designing of ETL processes like mapping, source, target and staging databases.
  • Knowledge on cloud based DAG and Apache Air Flow.
  • Exploring DAG's, their dependencies and logs using Air Flow pipelines for automation

Environment: Python, AWS Lambda,SNS,SQS,EMR,Ec2,Cloud watch, RDS, Spark, Linux, Shell Scripting, Github, Jira Oracle BPM.

Confidential, Houston, TX

Data Engineer

Responsibilities:

  • Worked as Data Engineer to review business requirement and compose source to target data mapping documents
  • Involved in Agile development methodology active member in scrum meetings
  • Involved in Data Profiling and merge data from multiple data sources
  • Involved in Big data requirement analysis, develop and design solutions for ETL and Business Intelligence platforms
  • Designed 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas
  • Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka
  • Developed data warehouse model in Snowflake for over 100 datasets
  • Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse
  • Work with structured/semi-structured data ingestion and processing on AWS using S3, Python. Migrate on-premises big data workloads to AWS
  • Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
  • Involved in migration of data from existing RDBMS to Hadoop using Sqoop for processing data, evaluate performance of various algorithms/models/strategies based on real-world data sets
  • Implemented Data Validation using MapReduce programs to remove un-necessary records before move data into Hive tables
  • Created Hive tables for loading and analyzing data and developed Hive queries to process data and generate data cubes for visualizing
  • Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark and feature selection and created nonparametric models in Spark
  • Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS
  • Worked on Enterprise Messaging Bus with Kafka-Tibco connector and published Queues were abstracted using Spark Dstreams and parsed XML, JSON data in Hive.
  • Designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer 0.6.3 API's to produce messages
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
  • Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
  • Development of Spark structured streaming to read the data from Kafka in real time and batch modes, apply different mode of Change data captures (CDCs) and then load the data into Hive
  • Developed and Configured Kafka brokers to pipeline server logs data into spark streaming
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis
  • Integrate AWS Kinesis with on premise Kafka cluster
  • Implemented data ingestion and handling clusters in real time processing using Kafka
  • Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click-stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm
  • Created Streamsets pipeline for event logs using Kafka, Streamsets Data Collector and Spark Streaming in cluster mode by customizing with mask plugins, filters and distributed existing Kafka topics across applications using Streamsets Control Hub.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data
  • Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well
  • Captured unstructured data that was otherwise not used and stored it in HDFS and HBase / MongoDB. Scrape data using Beautiful Soup and saved data into MongoDB (JSON format)
  • Worked with Apache Airflow and Genie to automate job on EMR
  • Worked on AWS S3 buckets and intra cluster file transfer between PNDA and s3 securely
  • Used Amazon EC2 command line interface along with Python to automate repetitive work
  • Design & Implementation of Data Mart, DBA coordination, DDL & DML generation & usage
  • Provide data architecture support to enterprise data management efforts, such as development of enterprise data model and master and reference data, as well as support to projects, such as development of physical data models, data warehouses and data marts
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Worked extensively on the migration of different data products from Oracle to Azure
  • Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time analytics streaming, sqoop, pig, hive and CosmosDB for batch jobs
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks
  • Worked with Data governance, Data quality, data lineage, Data architect to design various models and processes
  • Independently coded new programs and designed Tables to load and test program effectively for given POC's using with Big Data/Hadoop
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
  • Worked on Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLlib
  • Built and analyzed datasets using SAS, and Python, designed data models and data flow diagrams using Erwin and MS Visio
  • Used Kibana an open-source plugin for Elasticsearch in analytics and Data visualization.
  • Used pandas, NumPy, seaborn, SciPy, matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms for predictive modeling utilizing R and Python
  • Implemented a Python-based distributed random forest via Python streaming
  • Utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis

Environment: Python, SQL, CSV/XML Files, Oracle, JSON, Cassandra, MongoDB, AWS, Azure, DataBricks, Snowflake, Hadoop, Hive, MapReduce, Scala, Spark, J2EE, Agile, Apache Avro, Apache Maven, AirFlow, Kafka, MLlib, regression, Docker, Tableau, Git, Jenkins.

Confidential, Irving, TX

Data Engineer

Responsibilities:

  • As a Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics.
  • Responsible for the planning and execution of big data analytics, predictive analytics, and machine learning initiatives
  • Very good hands on experience in advanced Big - Data technologies like Spark Ecosystem (Spark SQL, MLlib, SparkR and Spark Streaming), Kafka and Predictive analytics (MLlib, R ML packages including ML library of H2O)
  • Designed and developed spark jobs for performing ETL on large volumes of medical membership and claims data
  • Created Airflow Scheduling scripts in Python
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily
  • Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
  • Used predictive modeling with tools in SAS, SPSS, and Python.
  • Applied concepts of probability, distribution, and statistical inference on given dataset to unearth interesting findings through these of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, SciPy, NumPy and Pandas module of Python.
  • Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
  • Developed visualizations and dashboards using ggplot2, Tableau
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, MATLAB, and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
  • Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
  • Used Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Wrote Hive queries for data analysis to meet the business requirements.
  • Expertise in Business Intelligence and data visualization using Tableau.
  • Expert in Agile and Scrum Process.
  • Validated the Macro-Economic data (e.g. Blackrock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
  • Worked on setting up AWS EMR clusters to process monthly workloads
  • Was involved in writing pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process
  • Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate the monthly account level customer data stored in S3
  • Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes, and Hadoop Distributed File System (HDFS).
  • Interfaced with large-scale database system through an ETL server for data extraction and preparation.
  • Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
  • Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem

Environment: Python (Scikit-Learn/SciPy/NumPy/Pandas),Spark, AirFlow, Machine learning, AWS, MS Azure, Cassandra,, Avro, HDFS, GitHub, Hive, Pig, Linux, SAS, SPSS, MySQL, Bitbucket, Eclipse, XML, PL/SQL, SQL connector, JSON, Tableau, Jenkins.

Confidential

Data Specialist

Responsibilities:

  • Wrote scripts in Python for automation of testing Framework jobs.
  • Used Multi-Threading factory model to distribute learning process back-testing and the into various worker processes.
  • Designed Power BI data visualization utilizing cross tabs, maps, scatter plots, pie, bar and density Charts
  • Utilized Power Query in Power BI to Pivot and Un-Pivot the data model for data cleansing
  • Provided continued maintenance and development of bug fixes for the existing and new Power BI Reports.
  • Experienced in embedding PowerBI reports into salesforce and web page using PowerBI URL parameters
  • Developed Power BI reports and Dashboards from multiple data sources
  • Experienced in managing organization visuals and Embed Codes in PowerBI Admin Portal.
  • Created Drill Down Reports, Drill Through Report by Region.
  • Created workspace and content packs for business users to view the developed reports.
  • Experienced in setup and manage PowerBI Premium capacities using PowerBI Admin Portal capacity settings.
  • Experienced in creating and managing workspaces using PowerBI Admin Portal workspaces and Tenant Settings.
  • Working as a Team lead and Developer in creating mappings using Informatica Power center and BDM
  • Implemented several DAX functions for various fact calculations for efficient data visualization in Power BI.
  • Utilized Power BI gateway to keep dashboards and reports up to-date with on premise data sources.
  • Experienced in managing users, admins, and groups in the office 365 admin center.
  • Experienced in creating and managing PowerBI data gateways using Power Platform admin center
  • Experienced in creating paginated reports using PowerBI Report Builder and in migrating
  • SSRS, Tableau, and Cognos reports to PowerBI
  • Created Complex ETL Packages using SSIS to extract data from staging tables to partitioned tables with incremental load.
  • Designed and normalized the databases, have written T-SQL Queries and created different objects like Tables, Views, Stored Procedures, User defined functions and Indexes.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing
  • Transform and load data from different sources like Azure SQL, Blob storage and Azure SQL Data warehouse.
  • Integration runtime in Azure Data Factory to connect On-Premise SQL Server
  • Created data integration and technical solutions for Azure Data Lake Analytics, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases and Azure SQL Data Warehouse
  • Used Copy Activity in Azure Data Factory to copy data among data stores located on-premises and in the cloud

Environment: Python, Django, HTML5/CSS, Postgres, Azure, SNS, SQS, MySQL, JavaScript, Eclipse, Linux, Shell Scripting, JQuery, Github, Jira, PySpark, Bootstrap, Jenkins, Power BI

Confidential

Data Engineer

Responsibilities:

  • Participated in design discussions with data architects and application architects.
  • Supported data process and worked closely with data architects to build data flows for predictive analytics for revenue cycle risks
  • Building data flows across redshift, Quicksight & Madlib to Forecasting risk to the revenue delays
  • Adjusting the data processes to support the Web App configuration and desired performance
  • Aligning the data engineering related work with the on-going sprints and future sprints based on the priority of the functionalities that are dependent on the data processes
  • Worked on loading CSV/TXT/DAT files using Scala language in Spark Framework to process the data by creating Spark Data frame and RDD and save the file in parquet format in HDFS to load into fact table using ORC Reader.
  • Worked on various applications using Python integrated IDEs Eclipse, PyCharm, and Net Beans.
  • Designed and developed an entire module called CDC (change data capture) in python and deployed in AWS GLUE using PySpark library and python
  • Built database Model, Views and API's using Python for interactive web based solutions.
  • Used Python scripts to update the content in database and manipulate files.
  • Wrote and executed several complex SQL queries in AWS glue for ETL operations in Spark data frame using SparkSQL.
  • Automated most of the daily task using python scripting.
  • Performed job functions using Spark API's in Scala for real time analysis and for fast querying purposes.

Environment: Python, Django,HTML5, XML, JavaScript, Linux, MS SQL Server, NoSQL, Amazon s3, Jenkins, Git, Github, JIRA, AWS Services.

Confidential

Healthcare Data Engineer

Responsibilities:

  • Gathered, validated, and analyzed patient’s data daily to understand the trends of the hospital.
  • Maintained the records of the hospital time to time. Spread sheets are used to perform the initial cleaning procedures on data.
  • Migrated data bases from Oracle to SQL Server to preserve the historic data. Performed querying with SELECT statements and worked on JOINS for merging several tables.
  • Prepared monthly status reports and score cards to visualize the data on the patients outflow and revenue to presented to the management.
  • Responded to ad hoc data reporting requests in a timely fashion and interacted with the dentists to gather requirements and answered questions related to patient’s data.
  • Converted data into actionable insights by predicting and modeling future outcomes.
  • Used Python creating graphics, XML processing, data exchange and business logic implementation.
  • Utilize in-depth knowledge of Technical experience in LAMP and other leading-edge products and technology in conjunction with industry and business skills to deliver solutions to customer.
  • Developed multiple spark batch jobs in Scala using Spark SQL and performed transformations using many APIs and update master data in Cassandra database as per the business requirement.
  • Written Spark-Scala scripts, by creating multiple udf's, spark context, Cassandra sql context, multiple API's, methods which support data frames, RDD's, data frame Joins, Cassandra table joins and finally write/save the data frames/RDD's to Cassandra database.
  • As part of the POC migrated the data from source systems to another environment using Spark, SparkSQL.

Environment: Python, Linux HTML5, XML, JavaScript, JQuery, MS SQL Server, NoSQL, Jenkins, Mongo DB, Beautiful soup, Eclipse, Git, Github, JIRA.

Confidential

Healthcare Data Engineer

Responsibilities:

  • Root cause analysis and solution of the hospital which is suffering from huge revenue loss in the past 2 years.
  • Data is collected for the past 2 years and updated in SQL Server to perform root cause analysis. Improper scheduling system resulted in the patients’ missing appointments has been concluded as the cause for the revenue loss.
  • A data warehouse which enables to track the patient’s moments and dentist’s availability has been built using Tableau.
  • As the final output the hospital has now able to successfully overcome the missing appointments to 90% and recovering the revenue gradually with the help of this data warehouse.
  • Designed the front end of the application using Python, HTML, CSS, AJAX, JSON and JQuery. Worked on backend of the application, mainly using Active Records.
  • Involved in the design, development and testing phases of application using AGILE methodology.
  • Developed and designed an API (Restful Web Service).
  • Used the Python language to develop web-based data retrieval systems.
  • Designed and maintained databases using Python and developed Python based API (Restful Web Service) using Flask, SQLAlchemy and PostgreSQL.
  • Developed web sites using Python, XHTML, CSS, and JavaScript.
  • Developed and designed e-mail marketing campaigns using HTML and CSS.
  • Tested and implemented applications built using Python.
  • Developed and tested many features for dashboard using Python, ROBOT framework, Bootstrap, CSS, and JavaScript.

Environment:, Python, Mod Python, Perl, Linux, PHP, MySQL, NoSQL, JavaScript, Ajax, Shell Script, HTML, CSS

We'd love your feedback!