Data Engineer Resume
Columbus, IN
SUMMARY
- Total Experience of 7 years in IT with Data Engineer with 4 years of experience in building data pipelines for ingesting & transforming data.
- Hands - on experience in writing complex SQL queries to extract, transform and load (ETL) data from databases.
- Good knowledge of Big data applications and implementation of end-to-end streaming solutions using Spark.
- Knowledge of design & data modeling for OLTP & OLAP databases with problem solving and analytical skills.
- Strong hands-on experience in data cleaning and exploration using various libraries in Python and Scala.
- Experience in Data load management, importing & exporting data using SQOOP & FLUME
- Experience in scheduling and monitoring jobs using Oozie, Hue and Appworx
- Worked on real time data integration using Kafka, Spark streaming and HBase.
- Experience in working with Structured Streams in Streaming, Accumulators, Broadcast variables, various levels of caching and optimization techniques in Spark.
- Hands on experience of writing code in Scala, building jar in maven and deploying it on databricks cluster.
- Developed highly scalable Spark applications using Spark Core, Data frames, Spark-SQL and Spark Streaming API's in Scala.
- Worked on Setting Up and Configuring ELK Stack for Error Log capturing and Management
- Solid experience in working with csv, text, Avro, parquet, orc, JSON formats of data.
- Experience in working with Hive data warehouse tool - creating tables, data distribution by implementing static and dynamic partitioning, bucketing and optimizing the Hive QL queries.
- Worked on installing, configuring, and monitoring Apache Airflow for running both batch and streaming workflows.
- Strong Experience in writing SQL queries
TECHNICAL SKILLS
Programming: Python, Scala, Java, R, JavaScript, C
Big Data: HDFS, MapReduce, HIVE, Apache Spark, Kafka, Nifi, Airflow, Databricks
Databases: MySQL, SQL/PL-SQL, Microsoft SQL server, Redshift, Cassandra, HBase
BI/Analytics Tools: Tableau, Kibana. Grafana, D3.js, Shiny, Plotly, MS Excel
Scripting/ Web Languages: JavaScript, HTML5, CSS3, XML, SQL, JSON, Shell
ETL Tools: APPWORX, SQOOP, OOZIE, HUE
Office Tools: MS-Office, MS-Project, Visio, Confluence, Jira, Asana
Software Life Cycles: Waterfall and Agile model
Utilities/Tools: Eclipse, Tomcat, JUnit, SVN, Log4j, ANT, Maven, Gitlab, Bitbucket, IntelliJ IDE, Postman
Cloud Platforms: Microsoft Azure, AWS
PROFESSIONAL EXPERIENCE
Confidential, Columbus, IN
DATA ENGINEER
Responsibilities:
- Developed pipelines to process data in near real-time
- Played a key role in migrating the frameworks’ environment to reflect the latest Databricks runtime version 7.3 LTS
- Developed the solution to read and store the data into flattened JSON format to overcome schema drift challenges
- Designed and implemented In-House features store (usable functions) which was used to triangulate the engine condition based on Engine Sensor/Servicing Data
- Worked on structured streaming to read encrypted messages from Amazon SQS
- Migrated from the traditional spark-submit framework on Azure HDInsight to Databricks. All the workloads were moved to DBR 5.5 LTS and later to DBR 7.3.
- Upgraded to Delta Lake: Worked on migrating hive tables from parquet to delta format in the Azure data lake Gen2 environment, which brought a significant improvement in the overall query performance for the team.
- Implemented Structured Streaming: Implemented an end-to-end structured streaming solution for a product, which replaced an existing batch data pipeline with an almost real-time pipeline from the raw layer to feature layer.
- Appworx to Databricks setup: Carried out POC to execute API-based call from Appworx to Databricks.
- Challenges related to management model/ server scale up issues/ master-slave network issues were identified and resolved.
- Databricks Workspace setup: Worked on setting up an NPIP Databricks workspace for product teams.
- Airflow Setup: Installed and configured Apache airflow for workflow management and created workflows in python.
Environment: Apache SPARK, Databricks, Microsoft Azure, Scala, SQL, Python, HIVE
Confidential, Irving, TXDATA ENGINEER
Responsibilities:
- System of Insights Framework: As Part of the S.O.I team, worked on developing, maintaining frameworks for data ingestion and transformation.
- Spark ETL Pipelines: Developed ETL pipelines to ingest transactional data, transform it by applying data transformation techniques and move the data using a real time processing pipeline into data warehouse for analysis.
- Developed pipelines to process data in near real-time
- Worked on spark structured streaming for developing live steaming data pipeline with source as Kafka topics and output as Insights into Cassandra Db. The Data was fed in JSON/XML format and then Stored in Cassandra DB.
- Performed data aggregation, queries and writing data back into OLTP system through Sqoop.
- Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map reduce jobs that extract.
- Handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Developed and designed system to collect data from multiple portal using kafka and then process it using spark.
- Setup and Development in Cassandra: Involved with Optimizing Cassandra Namespaces for Low latency and high fault tolerance.
- Involved in Developing Insight Store data model for Cassandra, which was utilized to store the transformed data
- Development in Hive: Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
- ELK STACK DEVELOPMENT: Worked on Setting Up and Configuring ELK Stack for Error Log capturing and Management.
Environment: Apache SPARK, Kafka, Scala, SQL, Python, Hive, Cassandra, HBase, ELK, Grafana, AWS
Confidential
DATA ENGINEER-BIG DATA DEVELOPER
Responsibilities:
- Worked with BI team in the area of Big Data Hadoop cluster implementation and data integration in developing large - scale system software.
- Processing of incoming files using Spark native API.
- Usage of Spark Streaming and Spark SQL API to process the files.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Processing the schema oriented and non-schema oriented data using Scala and Spark.
- Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.
- Collected the Json data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
- Created Hive tables and involved in data loading and writing Hive UDFs.
- Developed Spark scripts to import large files from Amazon S3 buckets.
- Developed Spark core and Spark SQL scripts using Scala for faster data processing.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Developing Spark jobs using Scala in test environment for faster real time analytics and used Spark SQL for querying.
- Developed and designed system to collect data from multiple portal using kafka and then process it using spark.
- Developed and designed automate process using shell scripting for data movement and purging.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Developed Spark code using Python and Spark-SQL/Streaming for faster processing of data
Environment: HDFS, Scala, Spark Cloudera Manager, Sqoop, PL/SQL, MySQL, Windows, HBase.
Confidential
SOFTWARE ENGINEER - JAVA DEVELOPER
Responsibilities:
- Involved in analysis and design phase of Software Development Life cycle (SDLC).
- Used JMS to pass messages as payload to track statuses, milestones and states in the workflows.
- Involved in reading & generating pdf documents using ITEXT and also merge the pdfs dynamically.
- Involved in the software development life cycle coding, testing, and implementation.
- Worked in the health - care domain.
- Involved in Using Java Message Service (JMS) for loosely coupled, reliable and asynchronous exchange of patient treatment information among J2EE components and legacy system
- Developed MDBs using JMS to exchange messages between different applications using MQ Series.
- Involved in working with J2EE Design patterns (Singleton, Factory, DAO, and Business Delegate) and Model View Controller Architecture with JSF and Spring DI.
- Involved in Content Management using XML.
- Developed a standalone module transforming XML 837 module to database using SAX parser.
- Installed, Configured and administered WebSphere ESB v6.x
- Worked on Performance tuning of WebSphere ESB in different environments on different platforms.
- Configured and Implemented web services specifications in collaboration with offshore team.
- Involved in Creating dash board charts (business charts) using fusion charts.
- Involved in creating reports for the most of the business criteria.
- Involved in the configurations set for Web logic servers, DSs, JMS queues and the deployment.
- Involved in creating queues, MDB, Worker to accommodate the messaging to track the workflows
- Created Hibernate mapping files, sessions, transactions, Query and Criteria s to fetch the data from DB.
- Enhanced the design of an application by utilizing SOA.
- Generating Unit Test cases with the help of internal tools.
- Used JNDI for connection pooling.
- Developed ANT scripts to build and deploy projects onto the application server.
- Involved in implementation of continuous build tool as Cruise control using Ant
- Used Star Team as version controller.
Environment: Java multithreading, JDBC, Hibernate, Struts, Collections, Maven, Subversion, JUnit, SQL language, Struts, JSP, SOAP, Servlets, Spring, Hibernate, Junit, Oracle, XML, Putty and Eclipse.