We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Austin, TX

SUMMARY

  • 6 years of experience working in domains such as Big Data - Hadoop Eco-components, Spark Streaming and Java.
  • Expertise with handling Terabytes of structured and unstructured data on huge cluster environment.
  • Strong developer skillset in Java 7, Python 3.5.0, Scala 2.11.12, SQL/NoSQL
  • Technical experience of two years with Big Data using Cloudera 4, Hortonworks 2.6.5 distributions and Hadoop working environment including Hadoop 2.8.3, Hive 1.2.2, Sqoop 1.4.7, Flume 1.5.0.1, HBase 2.0.0, Apache Spark 2.2.1, Kafka 1.3.2
  • Comfortable with installation and configuration of Hadoop Ecosystem Components.
  • Well acquainted with HDFS 2, YARN and MapReduce programming paradigms
  • Experienced with implementing data handling with HDFS 2
  • HIVE 1.2.2 table creation plus loading data to run basic and advanced queries as well as partitioning, bucketing hive-stored data
  • Querying Databases using HiveQL and CQL
  • Extensive practical knowledge of data imports and exports using Sqoop 1.4.7, Flume 1.5.0.1 from HDFS 2 to Relational Database Systems.
  • Experience with Map reduce programs to Spark RDD transformations for improved Performance
  • Well versed in using Scala 2.11.12 for performing Spark operations
  • Proficient in working with Spark Ecosystem using Spark SQL and Scala 2.11.12 queries on different formats like Text file, Avro, Parquet files
  • Experience working with NoSQL databases like Cassandra 2.2, HBase 2.0.0
  • Strong experience working with Hadoop distributions like Hortonworks 2.6.5, Databricks 2.4.2
  • Well versed in working with Hadoop 2.8.3 in standalone, pseudo distributed and distributed modes
  • Experience in working with Amazon Web Services using EC2 for computations and S3 as a storage mechanism
  • Experience working with Machine Learning, Data Science and Advanced analytics
  • Strong knowledge and experience working with design & analysis of ML/data science algorithms like Classification, Association rules, Clustering and R egression
  • Neural network libraries: TensorFlow r1.8.0, Keras 2.2.1 etc.
  • Descriptive, Predictive and Prescriptive analytics, Machine Learning (ML), Deep Learning (DL), Natural Language Processing (NLP), Text Analytics, Data Mining, Unstructured Data Parsing and Sentiment Analysis
  • Implemented MLlib algorithms for and testing different models using Spark Machine Learning APIs
  • Experience working with R 3.5.1
  • Comprehensive knowledge of Core Java Concepts and Collections Framework, Object Oriented Design and Exception Handling
  • Good working knowledge on Eclipse IDE 4.7 for developing and debugging Java applications
  • Hands-on working experience with Brain Computer Interfaces - Emotiv technology for Data Mining on Brainwaves
  • Solid programming experience with Python 3.5.0, R 3.5.1, Java 7, Scala 2.11.12, C# 7.0, HTML 5, CSS 3, JavaScript 6
  • Hands-on experience with Python libraries like Matplotlib 2.2.2, NumPy, SciPy, Pandas
  • Habituated with Agile, Waterfall, Scrum software development methodologies
  • Experience working with source and version control systems like Git 2.12, GitHub
  • Experience in testing applications using JUnit 4.12
  • Experience in identifying Actors, Use Cases and representing UML and E-R diagrams.

TECHNICAL SKILLS

Programming Languages Operating Systems: Java 7, Python 3.5.0, R 3.5.1, Scala 2.11.12, C\Window 7, 8.1, 10; Linux; xv6 - Unix; Android

Web Programming & Scripting Languages: \Databases HTML 5, CSS 3, JavaScript 6\MySQL 5.0, SQL Server 2015

Software: Big Data Platform Eclipse IDE 4.7, Putty, Vmware, Virtual \Cloudera 4, Hortonworks 2.6.5, Amazon AWS \ box, Microsoft Excel 2015, Microsoft Access \(EC2, S3), Databricks 2.4.2\ 2015, Microsoft Word 015, Visual Studio 2015, NetBeans IDE 8.2, MATLAB 2015, RStudio \1.1.456, Anaconda 5.1.0, RapidMiner 7.2, \Knime 3.5, PyCharm 2.0, Emotiv, WEKA

Methodologies\SCM Tools: Agile, Waterfall\Git 2.12, GitHub

NoSQL Databases\Frameworks: HBase 2.0.0, Cassandra 2.2\Hadoop 2.8.3, Apache Spark 2.2.1

Hadoop Ecosystem\ETL: HDFS 2, MapReduce, Hive 1.2.2, Pig 0.13.\Sqoop 1.4.7, Flume 1.5.0.1, Kafka 1.3.2\Scheduling\Oozie 4.2.0

PROFESSIONAL EXPERIENCE

Confidential - Austin, TX

Data Engineer

Responsibilities:

  • Hadoop was used to perform Data Cleaning, Data Pre-processing and Data Flattening.
  • Worked in Multi-Cluster Hadoop Eco-System environment for implementation and Pseudo-distributed environment for testing.
  • Developed data pipeline using Flume to load data from Confidential ’s website directly into HDFS
  • Amazon S3 was used for storing the clickstream data generated from the Confidential website.
  • Used SparkSQL for performing data filtering, attribute reduction and data sampling.
  • Handled large datasets using partitions, Spark in-memory capabilities, broadcast variable, effective and efficient joins, transformations during ingestion process.
  • Implemented Spark transformations for data wrangling.
  • Import data from sources like HDFS and HBase into Spark RDD
  • Created Spark RDD transformations, and actions to implement business analysis.
  • Developed Spark programs in Scala for performing Data Transformations, creating data frames, and writing SparkSQL queries, implementing Spark Streaming and for generating windowed streaming application.
  • Created Scala queries that helped market analysts spot emerging trends by comparing fresh data with tables and historical metrics.
  • Worked with MLlib component of Spark for applying techniques like classification, regression, clustering, and creating predictive analysis model based on machine learning workflow.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
  • Worked on Reporting tools like Tableau to connect with SparkSQL for generating daily reports.
  • Databricks was the IDE used for processing the stored data.
  • Utilized Agile Methodology, SCRUM Methodology for project management and Git was used for source code tracking.

Environment: Hadoop 2.8.3, HDFS 2, Map Reduce, Hive 1.2.2, Kafka, Oozie 4.2.0, Amazon S3, Databricks 2.4.2, Tableau and Apache Spark 2.2.1

Confidential - Houston, TX

Data Engineer

Responsibilities:

  • Implemented loading and transforming terabytes of structured, semi structured and unstructured data.
  • Used Kafka for streaming data from the ground stations and air traffic control stations into HDFS.
  • Data received from Automatic dependent surveillance-broadcast was in an encoded format.
  • Compact Position Reporting (CPR) algorithm was used to decode this data.
  • Wrote Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language).
  • Developed Pig scripts to process unstructured data and create structured data for using with Hive.
  • Created Hive managed and external tables and processed data using HiveQL.
  • Designed and implemented Hive queries and functions for evaluation, filtering, loading and storing of data.
  • Used Hive for aggregating the data obtained from FAA and ADS-B.
  • Continuous integration of data in Hive data warehouse after creating tables, distributing data by implementing partitioning and bucketing and writing optimized HiveQL queries.
  • Designed Hive UDFs using Java for creating customized functions for filtering the data to retain the relevant data.
  • Wrote multiple Hive queries to convert the processed data to multiple file formats including XML, JSON and CSV file formats.
  • Used Compression Codec's like Snappy, Gzip for effective data compression.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Exported the processed data to the traditional RDBMS using Sqoop for visualization and to generate reports for the BI team.
  • Used Agile methodology for project management and Git for source code control.

Environment: Hadoop 2.8.3, HDFS 2, Kafka 1.3.2, Oozie 4.2.0, Pig 0.13, Hive 1.2.2, Sqoop 1.4.7

Confidential

Fraud Analyst

Responsibilities:

  • Installed Hadoop, Map Reduce, HDFS, AWS and developed multiple MapReduce jobs in Hive for data cleaning and pre-processing.
  • Implemented solutions for ingesting data from various sources and processing the Data utilizing Big Data Technologies such as Hive, Sqoop, Hbase, Map reduce, etc.
  • Analyzed Hadoop cluster using tools like Hive, Sqoop, Spark, SparkSQL .
  • Handled data import and export to and from HDFS, Hive using Sqoop.
  • Implemented data transfers between local/external file system and RDBMS to HDFS.
  • Demonstrated better organization of the data using techniques like hive partitioning, bucketing
  • Hive queries were used to analyze the data.
  • Spark context & Spark-SQL were used for optimizing the analysis of data.
  • Spark RDD was used to store and perform in-memory computations on the data.
  • Amazon S3 was used to store the data.
  • Anaconda was used as the IDE for developing the model.
  • WEKA was used for Initial visualizations to identify the outliers and RapidMiner was used for analyzing and interpretation of the data.
  • Used agile methodology throughout the project lifecycle.

Environment: Hadoop 2.8.3, Hive 1.2.2, Apache Spark 2.2.1, Amazon S3, Anaconda 5.1.0, WEKA, and RapidMiner 7.2

Confidential

Java Programmer-Big Data Developer

Responsibilities:

  • Involved in complete project life cycle starting from design discussion to production deployment.
  • Worked on major components in Hadoop Ecosystem including Hive, HBase, Hive, Scala, Sqoop and Flume.
  • Worked in Multi Cluster Hadoop Eco-System environment.
  • Used Oozie and Zookeeper for workflow scheduling and monitoring a cluster.
  • Implemented solutions for ingesting data from various sources and processing the data utilizing Big Data Technologies such as Hive, Sqoop, Hbase, Map reduce, etc.
  • Worked on Big Data Integration and Analytics based on Hadoop, Spark, Kafka.
  • Developed data pipeline using Flume, Sqoop and MapReduce to ingest data into HDFS for analysis.
  • Designed and developed a daily process to do incremental import of raw data from DB2 into Hive tables using Sqoop.
  • Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into Hive tables.
  • Effectively used Sqoop to transfer data from databases (SQL, Oracle) to HDFS, Hive.
  • Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.
  • Designed Hive external tables using shared meta-store instead of derby with dynamic partitioning and bucketing.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational database systems/mainframe and vice-versa.
  • Loading data into HDFS.
  • Created concurrent access for hive tables with shared/exclusive locks enabled by implementing
  • Implemented transformations using Scala and SQL for faster testing and processing of data.
  • Real time streaming the data using with Kafka
  • Used Oozie Operational Services for batch processing and scheduling workflows dynamically.
  • Worked on creating End-End data pipeline orchestration using Oozie.
  • Populated HDFS and Cassandra with massive amounts of data using Apache Kafka.
  • Developed Hive Scripts, Unix Shell scripts, programming for all ETL loading processes and converting the files into parquet in the Hadoop File System.

Environment: HDFS 2, YARN, Map Reduce, Hive 1.2.2, Impala, Oracle, Spark 2.2.1, Sqoop 1.4.7, Oozie 4.2.0, MySQL 5.0

Confidential

Java Developer

Responsibilities:

  • Worked in a Multi Cluster and a Pseudo-Cluster Hadoop Eco-System environment.
  • Populated HDFS and HBase with massive amounts of data using Apache Kafka.
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Supported code/design analysis, strategy development and project planning.
  • Created reports for the BI team using Sqoop to export data into HDFS and Hive.
  • Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
  • Involved in debugging Map Reduce job using MR Unit framework and optimizing Map Reduce.
  • Used Oozie for scheduling workflows dynamically.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

Environment: HBase 2.0.0, MapReduce, Sqoop 1.4.7, HDFS 2, Hive 1.2.2, Java 7, Kafka 1.3.2.

We'd love your feedback!