We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

San Jose, CA

SUMMARY

  • Big Data professional with 7+ years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
  • Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
  • Expert in providing ETL solutionsfor any type of business model.
  • Implemented Integration solutions for cloud platforms with Informatica Cloud.
  • High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
  • Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
  • Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
  • Involved in buildingData ModelsandDimensional Modelingwith3NF, Star and Snowflakeschemas forOLAPandOperational data store (ODS)applications.
  • Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
  • Experience inAlteryx platform, and involved in data preparation, data blending, and the creation of data models, data sets using Alteryx.
  • Experienced in data manipulation using Python for loading and extraction as well as with Python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
  • Experience in using to analyze data from multiple sources and creating reports with Interactive Dashboards using power BI. Extensive knowledge on designing Reports, Scorecards, and Dashboards using Power BI.
  • Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
  • Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
  • Expertise in writing Map Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
  • Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
  • Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
  • Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
  • Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Strong experience in working with UNIX/LINUX environments, writing shell scripts.
  • Excellent knowledge of J2EE architecture, design patterns, object modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
  • Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
  • Experienced in working in SDLC, Agile and Waterfall Methodologies.
  • Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

TECHNICAL SKILLS

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm

Hadoop Distribution: Cloudera distribution and Horton works

Programming Languages: Scala, Spring, Hibernate, JDBC, JSON, HTML, CSS

Script Languages: JavaScript, jQuery, Python, Shell Script (bash,sh)

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB

Operating Systems: Linux, Windows, Ubuntu, Unix

Analytics Tools: Tableau, Microsoft SSIS, SSAS and SSRS

Data Warehousing & BI: Star Schema, Snowflake schema, SAS, SSIS and Splunk

ETL Tools: Informatica,Talend, PowerCentre

Cloud Services: AWS, Azure,EC2, EMR, S3, Redshift, EMR, Lambda, Athena

Python: Numpy,Pandas,Pyspark

PROFESSIONAL EXPERIENCE

Confidential, San Jose CA

Data Engineer

Responsibilities:

  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
  • Worked with different feeds data like JSON, CSV, XML,DAT and implemented the Data Lake concept.
  • Developed Informatica design mappings using various transformations.
  • Azure Databricks Distribution for Hadoop, Blob storagefor raw file storage and Virtual machine for Kafka.
  • UsedAzure Functions to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
  • ProgrammedETL functionsbetween Oracle and Amazon Redshift.
  • Maintained end to end ownership for analyzed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
  • Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark frameworketc.).
  • UsedKafkaproducer to ingest the raw data intoKafkatopics run theSpark Streamingapp to process clickstream events.
  • Performed data analysis and predictivedata modeling.
  • Migrated data from Azure Blob storage bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
  • Designed and Developed Spark workflows using Scala for data pull from Azure blob storage bucket and Snowflake applying transformations on it.
  • Install and configureApache Airflowfor Azure Blob storage and Snowflake data warehouse and createdDagsto run the Airflow.
  • Explore clickstream events data withSparkSQL.
  • Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and advertising data related to Consumer Product Goods.
  • Spark SQLis used as a part of Apache Spark big data framework for structured, Shipment, POS, Consumer, Household, Individual digital impressions, Household TV impressions data processing.
  • Implemented user provisioning, password reset, creating and mapping groups to users using Azure identity management. feature Installed and configured for user provision and day to Identity administration.
  • CreatedDataFramesfrom different data sources like Existing RDDs, Structured data files, JSON Datasets, Hive tables, External databases.
  • Manage User Access/Login Security to Azure IAM Applications.
  • Evaluated and working on Azure Data Factory as an ETL tool to process business critical data into aggregated tables in Hive Cloud. Deployed and Development in Bigdata applications like Spark, Hive, Kafka and Flink in Azure cloud.
  • Load terabytes of different level raw data into Spark RDD for data Computation to generate the Output response.
  • Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
  • Coordinating the QA, PROD environments deployments.
  • Pythonwas used in automation of Hive and Reading Configuration files.
  • Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
  • UsingHiveto analyze the partitioned data and compute various metrics for reporting.

Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, MapR, Sqoop, Zookeeper, Azure Blob storage, Azure Databricks, Azure Virtual machine, Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile Methodology, Scrum.

Confidential, New York, NY

Sr. Data Engineer

Responsibilities:

  • Responsible for the execution ofbig data analytics, predictive analytics and machine learning initiatives.
  • Implemented a proof of concept deploying this product inAWS S3 bucketandSnowflake.
  • Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
  • DevelopedScalascripts,UDF's using bothdata frames/SQL and RDDinSparkfor data aggregation, queries and writing back into S3 bucket.
  • Experience indata cleansing and data mining.
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala toperform ETL jobswith ingested data.
  • UsedSpark Streamingto divide streaming data into batches as an input to Spark engine for batch processing.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine,Spark SQLfordata analysisand provided to the data scientists for further analysis.
  • Prepared scripts to automate the ingestion process usingPythonandScalaas needed through various sources such asAPI, AWS S3,Teradata and snowflake.
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
  • ImplementedSpark RDD transformationstoMap business analysis and apply actions on top of transformations.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3, DynamoDB and Snowflake.
  • ImplementedAWS Lambdafunctions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/writesnowflake utilityfunction using Scala.
  • Worked on Snowflake Schemas and Data Warehousing andprocessedbatch and streaming data load pipeline usingSnow Pipeand Matillion from data lake Confidential AWS S3 bucket.
  • Profile structured, unstructured, and semi-structured data across various sources to identifypatterns in data and Implement data quality metricsusing necessary query’s orpythonscripts based on source.
  • Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
  • Created DAG to use theEmail Operator, Bash Operator and spark Livy operatorto execute and inEC2instance.
  • Deploy the code toEMRviaCI/CD using Jenkins
  • Extensively usedCode cloudfor code check-in and checkouts for version control.

Environment: AgileScrum, MapReduce, Snowflake, Pig, Spark, Scala,Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Codecloud, AWS.

Confidential, Dallas TX

Data Engineer

Responsibilities:

  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier.
  • Created Volumes and configured Snapshots for EC2 instances.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs.
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Developed Hive queries to pre-process the data required for running the business process.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Implementations of generalized solution model using AWS SageMaker.
  • Extensive expertise using the core Spark APIs and processing data on an EMR cluster.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Programmed in Hive, Spark SQL, Java, C# and Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines
  • Extensive expertise using the core Spark APIs and processing data on a EMR cluster.
  • Worked on ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server) & Credit Edge server.
  • Experience in using and tuning relational databases (e.g. Microsoft SQL Server, Oracle, MySQL) and columnar databases (e.g. Amazon Redshift, Microsoft SQL Data Warehouse).

Environment: Hortonworks, Hadoop, HDFS, AWS Glue, AWS Athena, EMR, Pig, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau.

Confidential

Data Analyst

Responsibilities:

  • Participated in Business Analysis, talking to business Users and determining the entities and attributes for Data Model.
  • Gathered, analyzed, documented, and translated application requirements into data models and Supports standardization of documentation and the adoption of standards and practices related to data and applications.
  • Identified and determined physical attributes and their relationships through cross-analysis of functional areas.
  • Identified and analyzed source data coming from Oracle, SQL server and flat files.
  • Converted data from PDF to XML usingpython scriptin two ways i.e., fromraw XMLto processedXML and from processed them too .CSV files.
  • Managing and analyzing of the data and reporting generating from SQL Server to Microsoft Excel for creating Pivot Tables and V-Lookups.
  • Data which is stored in sqlite3 datafile (DB.) were accessed using the python and extracted themetadata, tables, and data from tables and converted the tables torespective CSV tables.
  • Extensively usedPandas,NumPy,Seaborn,Matplotlib,Scikit learn,SciPyandNLTKwith python as working with various machine learning algorithms.
  • Performs complex pattern recognition of automotive time series data and forecast demand through the Facebook Prophet and ARIMA models and exponential smoothening for multivariate time series data.
  • Used Natural Language Processing (NLP) to pre-process the data, determine the number of words and topics in the emails and form cluster of words.
  • Performed Data Cleaning, features scaling, features engineering using Pandas and Numpy packages in python.
  • Implemented visualizations and views like combo charts, stacked bar charts, pareto charts, donut charts, geographic maps, spark lines, crosstabs etc.
  • Designed, developed, and maintained daily and monthly summary, trending, and benchmark reports repository in Tableau Desktop.
  • Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Provided ad-hoc analysis and hosted recurring reports. Used data blending feature in tableau.

We'd love your feedback!