We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

San Jose, CA

SUMMARY:

  • 7 + years of professional experience on development and testing of teh Extraction, Transformation and Loading (ETL) of data using Apache Hive, Hbase, IBM Big SQL, Java MapR, Zookeeper
  • Apache PIG, IBM Big Integrate, Spark SQL, Spark(Python/Scala) and Informatica PowerCenter. Worked on multiple platforms, databases and tools including Hadoop BigInsights 2.7/4.2, JSqsh, PERL, Ruby/Cucumber, Teradata, Oracle, Netezza and SQL Server
  • In depth knowledge of Hadoop architecture, setting up cluster along with programming in MR, Hive, PIG, Spark SQL, Spark(Scala/Python), BigSQL, HBase, Sqoop, infowork,Zookeeper, YARN, Knox, also have working knowledge about infrastructure related to Hadoop.
  • In depth knowledge on UDAF, Hive SerDes like Regex, cobol, Storage formats like PARQUET,ORC,AVRO.
  • In depth knowledge of data ingestions tools on Hadoop like Sqoop, Flume, Kafka. Extensively used Sqoop Import/Export over Rest services and Knox gate way.
  • In depth knowledge on Rest API and extensively applications like Webhdfs, Webhcat/Templeton. In depth Knowledge on YARN job submissions.
  • Hands on experience on Hadoop GUI Tools like IBM Big Integrate, Informatica Power Center Big Data Edition.
  • Developed Java Script wrapped in PERL Module to connect Zookeeper.
  • Performed some POCs on Spark, Hive on Spark, Spark SQL, Spark with Python/Scala.
  • Developed synchronization process between prod and DR systems and Data migration of multi - TB HDFS data using distcp across clusters.
  • Troubleshoot performance issues with Yarn, MapReduce, Hive, Hbase.
  • Involved in Architectural decisions to build data lake and defined Big data ingest strategies including simple, automated sqoop scripts using Rest Apis, pig transformations.
  • In depth knowledge on developing Ruby Automation Scripts to test data on HDFS and Hive.
  • Telematics solution ingesting vendor files into hdfs, hive scripts for transformation and compacted data in Hive. Summarized data loaded into hbase tables.

TECHNICAL SKILLS:

  • Big Data, HADOOP & Advanced Analytics
  • ETL and DWH .
  • Insurance Claims Analytics
  • Banking Fraud Analytics (Banking)
  • Project and Client Management
  • Data Analytics
  • Predictive Analysis
  • Hadoop and Map Reduce
  • C
  • C++
  • Java (Initial Level)
  • Python
  • SQL, DB2, PIG, HIVE, SQOOP, KAFKA, FLUME
  • SAS
  • Tableau

PROFESSIONAL EXPERIENCE:

Confidential, San Jose, CA

Data Engineer

Responsibilities:

  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
  • Developed SQOOP jobs to import data in Avro file format from RDBMS to HDFS and created Hive tables on top of it.
  • Preprocessed teh log data using Hive UDFs and loaded teh data into Hive tables.
  • Performed inter/intra-cluster copying using distributed copy (DistCp).
  • Performed Spark jobs such as transformations and actions on RDDs using Scala.
  • Implemented SparkSQL to access Hive tables into Spark for faster processing of data.
  • Worked on transforming teh queries written in Hive to Spark Application.
  • Improved teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Worked with various HDFS file formats like Avro, Sequence File, Parquet and various compression formats like Snappy, bzip2.
  • Involved in performance tuning of Spark Jobs using Cache and Persist.
  • Configured teh EC2 servers for Auto scaling and elastic load balancing and used AWS services like EC2 and S3 for small data sets.
  • Created and managed policies for S3 buckets and Utilized Glacier for low cost storage service for data archiving and long-term backup.
  • Good understanding on NoSQL databases such as HBase and MongoDB.
  • Good understanding on Kafka architecture i.e., Topics, Consumers, Producers, Brokers, Partitions and Clusters.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python.
  • Spark Streaming collects dis data from Kafka in near-real-time and performs necessary transformations and aggregation on teh fly to build teh common learner data model and persists teh data in NoSQL store (HBase).
  • Worked with NoSQL databases like Base to create tables and store teh dataCollected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
  • Worked including both distributed Hadoop and Teradata teams to build marketing experimentation reports upon User behaviour insights to Confidential ’s Data Warehouse business models.
  • Created SQL queries for Seller’s, Buyers levels data to report for behaviour insights upon monthly and yearly analysis.
  • Worked on Hive to load data from Teradata using sqoop and daily incremental live analytic reports over items at data warehouse level.
  • Create Infowork templates.
  • Created Batch processes using Fast Load, FastExport, Mload, Unix Shell and Teradata SQL to transfer cleanup and summarize data.
  • Worked with teh data modeling team in creating Hive tables dat better enhance teh performance of various queries run on these Hive tables.

Environment: HDFS, Map Reduce, Sqoop, Oozie, Pig, Hive, Flume, LINUX, Java, Eclipse, PL/SQL, UNIX Shell Scripting, Python, Scala, Spark, Teradata, Alation, Slack.

Confidential, Charlotte, NC

Application Developer

Responsibilities:

  • Actively involved in interaction with business partners and technology personnel in order to understand requirements and business road map.
  • Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
  • Developed SQOOP jobs to import data in Avro file format from RDBMS to HDFS and created Hive tables on top of it.
  • Preprocessed teh log data using Hive UDFs and loaded teh data into Hive tables.
  • Performed inter/intra-cluster copying using distributed copy (DistCp).
  • Performed Spark jobs such as transformations and actions on RDDs using Scala.
  • Implemented Spark SQL to access Hive tables into Spark for faster processing of data.
  • Worked on transforming teh queries written in Hive to Spark Application.
  • Improved teh performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Worked with various HDFS file formats like Avro, Sequence File, Parquet and various compression formats like Snappy, bzip2.
  • Involved in performance tuning of Spark Jobs using Cache and Persist.
  • Configured teh EC2 servers for Auto scaling and elastic load balancing and used AWS services like EC2 and S3 for small data sets.
  • Created and managed policies for S3 buckets and Utilized Glacier for low cost storage service for data archiving and long-term backup.
  • Good understanding on NoSQL databases such as HBase and MongoDB.
  • Good understanding on Kafka architecture i.e., Topics, Consumers, Producers, Brokers, Partitions and Clusters.
  • Perform ad hoc statistical, data mining, and machine learning analysis, Develop and Design advance predictive analysis models using Python
  • Performing Goodness of fit for various distribution and teh best distribution which models teh claims data.

Environment: HDFS, Map Reduce, Sqoop, Oozie, Pig, Hive, Flume, LINUX, Java, Eclipse, PL/SQL, UNIX Shell Scripting, Python, Scala.

Confidential, Irving, TX

Hadoop Developer

Responsibilities:

  • Involved in loading and transforming large sets of Structured and Semi-Structured data and analyzed them by running Hive queries and Pig scripts.
  • Extracted teh data from MySQL into HDFS using Sqoop.
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables.
  • Used Flume to collect and aggregate teh log data from web servers and store it into HDFS.
  • Designed and developed Pig scripts to parse raw data from several data sources into forming baseline data.
  • Developed Pig-Latin scripts to extract data from teh web server output files into required schema and store it in HBase for faster access.
  • Developed Hive scripts for end user / analyst requirements to perform ad hoc analysis.
  • Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation.
  • Worked on Hue interface for querying teh data in Hive and Pig editors.
  • Experience in using Sequence File, Avro, Text File and Parquet formats.
  • Developed Oozie workflow for scheduling and orchestrating teh ETL process.
  • Involved in Hadoop Cluster capacity planning of Data Nodes and Name Nodes.
  • Exported teh processed data from HDFS to RDBMS using Sqoop Export.
  • Involved in Agile methodologies, daily Scrum meetings and Sprint planning.

Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, LINUX, and Big Data

Confidential

Hadoop Developer

Responsibilities:

  • Designed and Developed data migration from legacy systems to Hadoop environment.
  • Imported teh data from RDBMS to HDFS & Hive and performed incremental imports using Sqoop Job in various formats such as Avro, Text and Parquet formats.
  • Performed various sqoop operations such as eval, import, export, job etc.,
  • Imported log data from FTP servers using SFTP protocol into HDFS.
  • Created External and Managed tables in Hive, loaded teh data into tables and processed hive queries dat will run internally in map reduce way.
  • Pre-processed log data in Pig-Latin by parsing using regular expressions.
  • Involved in processing xml & JSON file formats, created partitioning of teh data and implemented bucketing in Hive for performance optimization.
  • Used JSON, XML and Avro SerDe's for serialization and de-serialization packaged with Hive to parse teh contents.
  • Loaded data with complex data types such as Maps, Arrays and Structs into Hive tables.
  • Experience in building Pig Latin scripts to extract, transform and load data onto HDFS.
  • Experience in writing custom UDFs for Hive and Pig to extend teh functionality.
  • Developed a workflow in Oozie to automate teh task of loading teh data into HDFS using Sqoop and processing it with Hive.
  • Exported teh analyzed data to relational databases using SQOOP for visualization to generate reports for teh BI team
  • Designed and developed ad-hoc dashboards for business leaders/product owners.
  • Used Cloudera manager to pull metrics on various cluster features like JVM, Running Map and reduce tasks.

Environment: Java, HTML, Java Script, CSS, Oracle, JDBC, Swing and Eclipse.

We'd love your feedback!