Data Engineer Resume
Austin, TX
SUMMARY:
- 5 years of professional experience in IT industry, experience in Hadoop ecosystem's implementation, maintenance, and Big Data analysis operations.
- Experience in Bigdata related infrastructure like Hadoop frameworks, Spark, Python, HDFS, Map Reduce, Hive, Pig, YARN, HBase, Oozie, Zookeeper, Flume, Sqoop.
- Having working experience on Cloudera Data Platform using VMware Player, Cent OS 6 Linux environment. Strong experience on Hadoop distributions like Cloudera and Hortonworks.
- Experience in Developing Spark jobs using Scala and python for faster real - time analytics and used Spark SQL for querying.
- Experience in ingesting the streaming data to Hadoop clusters using Kafka.
- Experience with different data source files like Text, Avro, Parquet, RC, and ORC formats and compressions like snappy, bzip.
- Extending HIVE core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive.
- Experienced in using Zookeeper and OOZIE Operational Services for coordinating the cluster and scheduling workflows.
- Experience in NoSQL column-oriented database like HBase and its integration with Hadoop cluster.
- Migrating the data from Oracle, MS SQL Server in to HDFS using Sqoop and importing various formats of flat files in to HDFS. Using sqoop jobs imported data from Teradata. Used Sqoop for full load and Kafka for incremental load.
- Experience with RDBMS like SQL Server, MySQL, Oracle and data warehouses.
- Strong knowledge of core Spark components including - RDDs, Dataframe and Dataset APIs, DStreams, in memory capabilities, DAG scheduling, data partitioning, tuning.
- Performed various optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
- Expertise in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's in Scala and Python, deploying in yarn cluster in client, cluster mode. Used Spark Data frame APIs to ingest data from HDFS to AWS S3.
- Involved in converting HBase/Hive/SQL queries into Spark transformations using RDDs, and Scala.
- Hands on experience with version control software tools like SVN, Bit Bucket and Gitlab.
- Experience in SDLC models like Agile SCRUM, Waterfall model under the guidelines of CMMI.
- Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills.
TECHNICAL SKILLS:
Hadoop: HDFS, MapReduce, Yarn, Spark, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Hue, Zookeeper
Programming Language: C, Java, Python, Scala, SQL, Hive QL
Scripting and Markup: JavaScript, jQuery, REST, JSON, Pig Latin
Web/Application Server: IIS 7.0/6.0/5.0
IDE: Eclipse, Visual Studio 2005/2008/2012/2015/2017
Framework: Struts, Hibernate, SPRING MVC
Databases: Oracle 11g/12C, MySQL, MS-SQL Server, Teradata, IBM DB2SQL Server 2016/2014/2012/2008/2005
Distributed Platform: Cloudera, Hortonworks, MapR
NOSQL Technologies: Cassandra, HBase, MongoDB
Operating systems: Windows 7/Vista/XP & Windows Server 2008/2005/2000 & Unix
Methodologies: Agile, UML, Waterfall
Design Patterns: MVVM, MVC 4/5, Dependency Injection
Version Control: CVS, GIT, TFS
PROFESSIONAL EXPERIENCE
Confidential -Austin, TX
Data Engineer
Responsibilities:
- Collaborate with business to understand business requirements and develop better big data solutions that meet the business need.
- Worked on establishing processes to receive, ingest, standardize, transform and make the data operational via the various Data Lake Zones.
- Worked on recreating OBIEE Dashboards Ad-hoc reporting capability, sourcing from Oracle EDW. Various Dashboards had been identified for this recreation that includes Financials, EIS, Manager and Supply chain dashboards.
- Used Apache Streamsets to ingest the data from Oracle EDW as flat files into HDFS.
- Create a streamsets jobs which assigns table name, add headers and user defined fields which is further used to determine unique records.
- Build utilities, user defined functions, and frameworks to better enable data flow patterns.
- Worked on creating Pyspark scripts to read the data from landing zones and move it further down streams with transformations, aggregations as per the requirements.
- Worked on reading multiple data formats on HDFS. Involved in converting Hive/SQL queries into Spark transformations using Spark Dataframes/RDDs using python.
- Designed both 3NF data models for ODS, OLTP, OLAP systems and dimensional data models using star and snow flake Schemas.
- Created a quality zone(DQZ) using pyspark for data from each input stream, to keep track of quality of the data flowing through each level, with notifications enabled on any discrepancy encountered.
- Involved in all phases of data collection, data cleaning, developing models, validation and visualization.
- Involved in performance tuning jobs process identifying and resolve performance issues.
- Developed implementation plan for code deployment also supported production deployment.
- Scheduled and Monitored the jobs using Tidal.
Environment: Cloudera Hadoop, Apache Streamsets, HDFS, Hive, Shell Scripting, Spark2.2, Linux- Cent OS, Eclipse, GIT.
Confidential -Dallas, TX
Spark Developer
Responsibilities:
- Involved in analysis and architecture phases of the project. Contribute to overall architecture, frameworks and patterns for processing and storing large data volumes.
- Worked on big data tools like Hive, HDFS, Impala, Spark2.2.Worked on RDD, Dataframe and SQL API's in Spark 2.2.0,
- Developed Python scripts, UDF’s using both Data frames/SQL and RDD in Spark 2.2.0 for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Improving the performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN using Scala.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop.
- Worked on optimizing the existing queries for better performance using bucketing, partitioning.
- Worked on shell and python scripting to orchestrate the jobs.
Environment: Cloudera Hadoop, HDFS, Hive, Sqoop, Shell Scripting, Spark, AWS EMR, Linux- Cent OS, Map Reduce, Scala, Eclipse, SBT.
Confidential -Minneapolis, MN
Spark Scala Developer
Responsibilities:
- Worked closely with the Source System Analysts and Architects in identifying the attributes and to convert the Business Requirements into Technical Requirements.
- Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark, Impala.
- Implementing POC's to migrate iterative map reduce programs into Spark transformations using Scala.
- Hands-on experience with AWS (Amazon Web Services), using Elastic MapReduce (EMR), creating buckets in S3 and storing data in them.
- Converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Optimized MapReduce jobs to use HDFS efficiently by using gzip, LZO and Snappy compression techniques.
- Experience in Configure, Design, Implement and monitor Kafka Cluster and connector
- Worked on physical transformations of data model which involved in creating Tables, Indexes, Joins, Views and Partitions.
- Worked towards creating near real time data streaming solutions using Spark Streaming, Kafka and persist the data in Cassandra.
- Involved in configuring and developing Kafka producers, consumers, topics, brokers using java.
- Implemented CRUD operations using CQL on top of Cassandra file system. Used Spark - Cassandra connector to load data to and from Cassandra.
Environment: Cloudera Hadoop, HDFS, Pig, Hive, Flume, Sqoop, Shell Scripting, Spark, AWS EMR, Linux- Cent OS, Kafka, HBase, Map Reduce, Scala, Eclipse, SBT.
Confidential, New York City, NY
Hadoop Developer
Responsibilities:
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Developed Custom Input Formats in Map Reduce jobs to handle custom file formats and to convert them into key-value pairs.
- Involved in creating Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement. Created the Hive external tables using Accumulo connector.
- Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources.
- Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design.
- Worked with various HDFS file formats like Avro, Sequence File and various compression formats like Snappy, bzip2.
- Developed workflow in Oozie to orchestrate a series of scripts to cleanse data, such as removing personal information or merging many small files into a handful of very large, compressed files using pig pipelines in the data preparation stage
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Used Zookeeper for various types of centralized configurations, SVN for version control, Maven for project management, Jira for internal bug/defect management, MapReduce.
- Implemented helper classes that access HBase directly from java using Java API to perform CRUD operations.
Environment: Hadoop Framework, MapReduce, Hive, Sqoop, Pig, HBase, Cassandra, Apache Kafka, Storm, Flume, Oozie, Maven, Jenkins, Java(JDK1.6), UNIX Shell Scripting, Oracle 11g/12g.
Confidential
Hadoop Developer
Responsibilities:
- Exported the analyzed data to the relational databases (MySQL, Oracle) using Sqoop from HDFS.
- Along with the Infrastructure team, involved in design and developed Kafka and Storm based data pipeline.
- Implemented messaging system for different data sources using apache Kafka and configuring High level consumers for online and off-line processing.
- Prepared ETL standards, Naming conventions and wrote ETL flow documentation for Stage, ODS and Mart
- Used HIVE join queries to join multiple tables of a source system and load them into Elastic Search Tables.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
- Developed Map Reduce jobs in Java for data cleansing and preprocessing. Also implemented Combiners and Partitioners to optimize MapReduce algorithms and achieve Application performance optimization
- Expertise in writing Hadoop Jobs for analyzing data using HiveQL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java. Used HCATALOG to access Hive table metadata from Map Reduce.
- Worked in Agile development environment having KANBAN methodology. Actively involved in daily scrum and other design related meetings.
Environment: Apache Hadoop 0.20.203, Cloudera Manager (CDH3), HDFS, Java MapReduce, Eclipse, Hive, PIG, Sqoop, and SQL, Oracle 11g, YARN, Kafka.
Confidential
Java J2EE Developer
Responsibilities:
- Worked on complete life cycle of software development, which included new requirement gathering, redesigning and implementing the business specific functionalities, testing and assisted in deployment of the projects h
- Worked on the front-end development and enhancements of application using Angular JS, Bootstrap, HTML, CSS, JavaScript, Java Beans and jQuery.
- The application was developed using MVC (Model View Control) architecture using Spring MVC, Hibernate and Oracle as Back-end. Experience in various data sources like Oracle, IBM db2 and SQL Server.
- Responsible for the overall layout design, color scheme of the web site using HTML, XHTML and CSS3.
- Used core java concepts like Collections, Generics, Exception handling, IO, Concurrency to develop business logic.
- Involved in analyzing the performance of the application, gathered thread dump & tuned the application using JProfiler.
Environment: JAVA 1.6/J2EE, HTML, DHTML, Java Script, AJAX, jQUERY, Servlets, JSP, JSON, Oracle Web logic application server 10.3, JAXB, WSDL, SOAP, Spring 3.2, MVC, IOC, AOP, Hibernate 3.5, JAX-RS, CXF, JMS, RAD 8.0, JUNIT, SVN, SOAPUI, JNDI, Oracle, Apache AXIS, Maven, JProfiler etc.