We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

Greenwood Village, CO

SUMMARY

  • Accomplished IT proficient with 8 years of engagement, spent significant time in Big Data systems, Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
  • An adept fanatic engineer with solid critical thinking, investigating and logical abilities, who effectively participates in understanding and delivering business requirements.
  • Experience in Big Data analytics, Data manipulation, using Hadoop Eco system tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, AWS, Spark integration with Cassandra, Avro and Zookeeper.
  • Closely worked together with business outcomes, production prop up, designing group consistently for diving profound on informational data, effective dynamic decision making and to help Analytics phases.
  • Strong Hadoop and stage uphold involvement in major Hadoop Distributions like Cloudera, Hortonworks, Amazon EMR, and Azure HDInsight.
  • Excellent information on Hadoop design and its core key ideas like distributed frameworks, Parallel transformations, High accessibility, Fault resistance and Flexibility.
  • Extensive working involvement in Big Data systems like Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi.
  • Proficient at composing MapReduce jobs and UDF's to assemble, examine, change, and convey the information according to business prerequisites.
  • Hands on involvement with making continuous data streaming solutions utilizing Apache Spark, Spark SQL and Data Frames, Kafka, Spark streaming and Apache Storm.
  • Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
  • Strong working involvement in SQL and NoSQL databases, data modeling and data pipelines. Associated with start to finish advancement and automate ETL pipelines utilizing SQL and Python.
  • Acquired significant knowledge with AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elastic-search, Kinesis, SQS, DynamoDB, Redshift, and ECS).
  • Experience on Azure cloud segments (HDInsight, Databricks, DataLake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, and Cosmos DB).
  • Acquired experience in Spark scripts in Python, Scala and SQL for advancement in development and examination through analysis.
  • Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka Confluent, Storm, Flume and Sqoop.
  • Broad involvement with advancement in Bash scripting, T-SQL, and PL/SQL Scripts.
  • Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
  • Experience with NoSQL databases and integration with Hadoop cluster like HBase, Cassandra, MongoDB, DynamoDB and CosmosDB.
  • Leveraged diverse file formats like Parquet, Avro, ORC and Flat files.
  • Knowledge on growing profoundly versatile and strong Restful APIs, ETL arrangements and third-party product incorporations as a component of Enterprise Site platform.
  • Profound involvement in building ETL pipelines between a few source frameworks and Enterprise Data Warehouse by utilizing Informatica PowerCenter, SSIS, SSAS and SSRS.
  • Experience in all stages of Data Warehouse advancement like prerequisites gathering, design, development, implementation, testing, and documentation.
  • Solid information on Dimensional Data Modeling with Star Schema and Snowflake for FACT and Dimensions Tables utilizing Analysis Services.
  • Worked in both Agile and Waterfall environment, Utilizing Git and SVN form control frameworks.
  • Experience in planning intuitive dashboards, reports, performing analysis and perceptions utilizing Tableau, Power BI, Arcadia and Matplotlib.

TECHNICAL SKILLS

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, UNIX

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Azure, Google Cloud.

Databases: Oracle 12c/11g, Teradata R15/R14.

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris, Mac OS

Big Data Tools: Hadoop Ecosystem Map Reduce

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Greenwood Village, CO

Responsibilities:

  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Prepared the complete data mapping for all the migrated jobs using SSIS.
  • Designed SSIS Packages to transfer data from flat files to SQL Server using Business Intelligence Development Studio.
  • Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
  • Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL.
  • Azure Kubernetes Service was used to deploy a managed Kubernetes cluster in Azure and built an Azure portal AKS cluster with Azure CLI, and also used template-driven deployment options such as templates for the Resource Manager and Terraform.
  • Used Kubernetes to deploy scale, load balance, scale and manage docker containers with multiple name spaced versions.
  • Designed strategies for optimizing all aspect of the continuous integration, release and deployment processes using container and virtualization techniques like Docker and Kubernetes. Built Docker containers using microservices project and deploy to Dev.
  • Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB)
  • Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
  • Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
  • Utilized machine learning algorithms such as linear regression, multivariate regression, PCA, K-means, & KNN for data analysis.
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, Databricks, Lambda, Glue, Azure, ADF, Blob, cosmos DB, Python, PySpark, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Data engineer/Big Data Developer

Confidential, Rensselaer, NY

Responsibilities:

  • Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
  • Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
  • Experience in data cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
  • Involved in migrating data from on prem Cloudera cluster to AWS EC2 instances deployed on EMR cluster and developed ETL pipeline to extract logs and store in AWS S3 Data Lake and further processed it using PySpark.
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Troubleshoot and maintain ETL/ELT jobs running using Matillion.
  • Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
  • Deploy the code to EMR via CI/CD using Jenkins
  • Extensively used Code cloud for code check-in and checkouts for version control.

Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Code cloud, AWS.

Big Data Developer

Confidential, Fort Worth, TX

Responsibilities:

  • Worked with Hortonworks dispersion. Introduced, arranged, and kept up a Hadoop cluster dependent on the business prerequisites.
  • Involved in start to finish execution of ETL pipelines utilizing Python and SQL for high volume data analysis, likewise audited use cases before on boarding to HDFS.
  • Mindful to stack, oversee and audit terabytes of log records utilizing Ambari web UI.
  • Utilized Sqoop to relocate information between relational DBMS and HDFS. Ingested data from MS SQL, Teradata, and Cassandra databases.
  • Performed specially appointed inquiries utilizing Hive joins, bucketing techniques for faster data access.
  • Utilized Nifi to mechanize the data stream between divergent frameworks. Planned dataflow models and objective tables to acquire applicable metrics from different sources.
  • Developed Bash contents to get log documents from FTP server and executed Hive responsibilities to parse them.
  • Actualized different Hive queries for analytics. Created External tables, advanced Hive queries and improved the cluster execution by 30%.
  • Enhanced contents of existing Python modules. Dealt with composing APIs to stack the processed data to HBase tables.
  • Migrated ETL tasks to Pig contents to apply joins, aggregations, and transformations.
  • Worked with Jenkins build and continuous integration tools. Involved in writing Groovy scripts to automate the Jenkins pipeline's integration and delivery service.
  • Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to plan and create dashboards, workbooks, and complex aggregate calculations.
  • Used Jenkins for CI/CD and SVN for version control.
  • Used Informatica as an ETL tool to create source/target definitions, mappings and sessions to extract, transform and load data into staging tables from various sources.
  • Designed and Developed Informatica processes to extract data from internal check issue systems.
  • Used Informatica Power exchange to extract data from one of the EIC s operational system called Datacom.
  • Extensive experience in Building, publishing customized interactive reports and dashboards, report scheduling using Tableau Desktop and Tableau Server.
  • Extensive experience in Tableau Administration Tool, Tableau Interactive Dashboards, Tableau suite.
  • Developed Tableau visualizations and dashboards using Tableau Desktop and published the same on Tableau Server.
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
  • Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components.
  • Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations and complex calculations to manipulate the data using Tableau Desktop.
  • Used Custom SQL feature on Tableau Desktop to create very complex and performance optimized dashboards.
  • Connected Tableau to various databases and performed live data connections, query auto updates on data refresh etc.

Hadoop Developer

Confidential

Responsibilities:

  • Requirement discussions, design the solution.
  • Estimated the Hadoop cluster requirements
  • Responsible for choosing the Hadoop components (hive, pig, map-reduce, Sqoop, flume etc)
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Hadoop cluster building and ingestion of data using Sqoop
  • Imported streaming logs to HDFS through Flume
  • Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS
  • Developed Use cases and Technical prototyping for implementing Hive,and Pig.
  • Worked in analyzing data using Hive, Pig and custom MapReduce programs in Java.
  • Implemented partitioning, dynamic partitions and buckets in HIVE
  • Installed and configured Hive, Sqoop, Flume, Oozie on the Hadoop cluster.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
  • Tuned the Hadoop Clusters and Monitored for the memory management and for the Map Reduce jobs.
  • Responsible for Cluster maintenance, Adding and removing cluster nodes, Cluster Monitoring and Troubleshooting.
  • Developed a custom Framework capable of solving small files problem in Hadoop.
  • Deployed and administered 70 node Hadoop clusters. Administered two smaller clusters.

Environment: Map Reduce, HBase, HDFS, Hive, Pig, Java, SQL, Cloudera Manager

Data Analyst

Confidential

Responsibilities:

  • Understand the data visualization requirements from the Business Users.
  • Writing SQL queries to extract data from the Sales data marts as per the requirements.
  • Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.
  • Designed and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
  • Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
  • Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.
  • Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.
  • Data Cleaning, merging and exporting the dataset was done in Tableau Prep.
  • Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.

We'd love your feedback!