We provide IT Staff Augmentation Services!

Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Austin, TX

SUMMARY

  • 7+ years of Total IT experience in Big Data Developer and Data Analysis.
  • Experience in design and development of applications using Hadoop ecosystem components like HDFS, Hive, Spark, Sqoop, Scala, Kafka, Apache Nifi, HBase and YARN
  • Experience on Hadoop Distributions HDP 2.6.x and CDH 5.x
  • Experience in developing Spark streaming applications to consume real - time transactions via Kafka Topics
  • Experience on building the applications using Spark Core, Spark SQL, Data Frames, Spark Streaming
  • Experience on importing the data from RDBMS databases Oracle and SQL Server into Hadoop data lake using Sqoop
  • Experience on job scheduling tool - Oozie
  • Experienced in AWS - S3, EC2, RDS and EMR
  • Experience in developing Spark applications using DataFrame and Datasets. Transformed data using PySpark, Spark SQL.
  • Knowledge on NoSQL databases HBase, MongoDB and Cassandra
  • Experience in real - time messaging systems such as Kafka to ingest real time streaming data into Hadoop
  • Worked with different Bug Tracking Tools like Remedy, and Jira
  • Experience on developing Spark batch applications to ingest data into common data lake.
  • Experience in importing and exporting data using Sqoop from RDBMS to HDFS and vice-versa
  • Experience working with Agile and Waterfall methodologies
  • Highly motivated, detail oriented, ability to work independently and as a part of the team with excellent networking and communication with all levels of stakeholders as appropriate, including executives, application developers, business users, and customers

TECHNICAL SKILLS

Hadoop/Big data: HDFS, Hive, Map Reduce, Spark, Sqoop, HBase, Kafka, Oozie, Nifi, Impala, Hue, Strom.

No SQL Databases: Spanner, HBase, MapR-DB

Languages: Python, Scala, Core Java, Unix Shell scripts, SQL

Web/Application Server: Apache Tomcat

Databases/ETL: Oracle, DB2, SQL Server, MySQL, DataStage, Teradata

IDEs: Eclipse, IntelliJ

Other Tools & packages: CAWA, Bit Bucket, JUnit, Maven, ANT, GitHub, Stream sets Data Collector, Grafana, Tableau.

SDLC Methodology: Agile, Waterfall model

Operating Systems: Linux, UNIX, Windows

Office Tools: MS Office, Word, Power Point

PROFESSIONAL EXPERIENCE

Confidential - Austin TX

Big Data Engineer

Responsibilities:

  • Importing and exporting data usingSqoopto load data to and fromOracle 11gto HDFS on a regular basis.
  • Involved in Requirement analysis, Design, development, and testing of the application.
  • Worked on enhancing the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Spark RDD's
  • Followed a parametrized approach for the schema details, file locations and delimiter details to make the coding efficient and reusable.
  • Developed PySpark application to consume data from Apache Kafka topics and publish to HDFS and HBase.
  • Worked on DStreams (Discretized Stream), RDD’s (Resilient Distributed Dataset), Dataframes, Spark SQL to build the spark streaming application.
  • Involved in convertingHive queries intoSpark transformationsusingSpark RDDs.
  • UsedSpark-SQL, Spark RDD, Spark Dataframeto loadJSONdata into Hive tables.
  • Involved in the data ingestion process through DataStage to load data into HDFS from Mainframes, Teradata, DB2.
  • Used Apache Nifi for data Ingestion, load data into Kafka topics
  • Developed ELT workflows using Nifi to load data into Hive and Teradata.
  • Used Python subprocess module to call the PySpark job.
  • DevelopingSparkcodeinSparkSQLenvironment for faster testing and processing of data and Loading the data intoSparkRDD.
  • Load the data intoSparkRDDand do in-memory data computations to generate the Output response.
  • Used Hue and Cloudera Manager to monitor Spark jobs.
  • Worked on AWS POC to modernize the streaming pipeline using AWS kinesis, lambda and s3, redshift.
  • Developed and maintains system documentation and runbooks.
  • Led the effort of end user training, to increase and drive technology adoption program among business users.
  • Worked on UNIX shell scripts and automation of the Scoop jobs using UNIX shell scripting.
  • Worked on Tableau for reporting on top of Hive.
  • Worked in Agile and used JIRA for maintain the stories about project.

Environment: Cloudera Hadoop, HDFS, Yarn, Hive, Spark, PySpark, Spark SQL, HBase, Sqoop, MS SQL Server, Oracle, SQL/ NoSQL, Linux, Python

Confidential - Austin, TX

Big Data Developer

Responsibilities:

  • Teamed up with Architects to Design Spark model for the Generic ETL framework.
  • Implemented Spark with YARN to perform analytics on data in Hive.
  • Developed the Extract process using Spark 2.0.
  • Created libraries to connect multiple databases like DB2, SQL Server, Oracle, MongoDB, Postgres SQL, HBase and to invoke the spark session along with some UDF’s
  • Imported the data from multiple data bases DB2, SQL server, Oracle, MongoDB, files etc.
  • Created data frames as a result set for the extracted data.
  • Applied filters and developed the Spark MapReduce jobs to process the data.
  • Involved in converting Hive/SQL queries into spark transformations using Spark RDD’s, Scala.
  • Implemented the code using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Created multiple case class to generate the data based on the Object.
  • Converted the data into required JSON structure using Jackson4j to load the data into MongoDB or HBase or Postgres SQL.
  • As part of the transformation, read the data from MongoDB as a Json, applied explode on the data frame to flatten the data.
  • Developed code build using Scala API’s to compare the performance of Spark with Hive and Shell Script for the Sqoop job.
  • Used Struct type, struct of array to get/read the different schema’s on data frames.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL data bases for huge volume of data.
  • Worked on Grafana for real-time visualizations.
  • Have good experience with ETL tools like IBM DataStage, Talend.
  • Expert knowledge on MongoDB, NoSQL data modeling, tuning, Indexing.

Environment: Cloudera Hadoop, HDFS, Yarn, Hive, Spark, PySpark, Spark SQL, HBase, Sqoop, Kafka, DB2, SQL Server, Oracle, MongoDB, DataStage, Postgres SQL, HBase, Linux, Python

Confidential - Minneapolis, MN

Big Data Developer

Responsibilities:

  • Involved in Requirement analysis, Design, development and testing of the application.
  • Configured Kafka Connect JDBC with SAP HANA and MapR Streams for both real-time streaming and batch process.
  • Created MapR-Event Streams and Kafka topics.
  • Worked on Antiunity Replicate to load data from SAP ECC to Apache Kafka topics.
  • Developed Spark Streaming application using Python to stream data from MapR Event Streams and Apache Kafka topics to Hive and MapR-DB and to stream data from one topic to the other topic with in the MapR Event Streams.
  • Worked on DStreams (Discretized Stream), RDD’s (Resilient Distributed Dataset), Dataframes, Spark SQL to build the spark streaming application.
  • Involved in creating SQL queries to extract data, to perform joins on the tables in SAP HANA and MySQL.
  • Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
  • Implemented Partitioning, Dynamic Partition, and Bucketing in Hive for efficient data access.
  • Used Hue and MapR Control System (MCS) to monitor and troubleshoot Spark jobs.
  • Developed SQOOP scripts to move data from MapR-FS to SAP HANA.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Installed and configured Kafka Connect JDBC in AWS EC2 instance.
  • Created stored procedures in MySQL to improve data handling and ETL Transactions.
  • Worked on data validation using HIVE and written Hive UDFs.
  • Managed Linux and Windows virtual servers on AWS EC2.
  • Built statistical on AWS EMR by uploading data in S3 instance on EC2models and creating
  • Configured SAP HANA source connector with SAP HANA as source and Apache Kafka topic as target for real time streaming and batch processing.
  • Provisioned, installed, and configured SAP HANA enterprise edition on AWS cloud EC2 instance.
  • Developed Streaming application to stream data from MapR ES to HBase.
  • Streamed data from Apache Kafka topics to time series database OPEN TSDB.
  • Built dashboards and visualizations on top of MapR-DB and Hive using Oracle data visualizer desktop. Built real-time visualizations on top of Open TSDB using Grafana.
  • Worked on UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.

Environment: MapR 6.0, Apache Kafka 1.0.0, Hive 2.1, HBase 1.1.8, Hue, MapR-DB, MapR-FS,, Spark 2.1.0, Python, AWS, SAP HANA, SQOOP, Oozie, Pig, IntelliJ, Kafka Connect Framework, DB Visualizer, Oracle Data Visualizer Desktop, Stream-sets Data collector, MapR-ES, MySQL, GIT.

Confidential, Median-OH

Big Data Developer

Responsibilities:

  • Worked on enhancing the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Spark RDD's.
  • Worked on MySQL for identifying required tables and views to export into HDFS.
  • Loaded data from MySQL to HDFS to development cluster for validation and cleansing.
  • Created Apache Kafka topics.
  • Configured Stream sets data collector with Apache Kafka to stream real time data from different sources (database & files) into Kafka topics.
  • Developed streaming application to stream data from Kafka topics to Hive using Spark, Python.
  • Processed large amounts of structured and semi-structured data using MapReduce programs.
  • Worked on real time processing and batch processing of data sources using Apache Spark, Elastic search, Spark Streaming, Apache Kafka.
  • Created scripts for importing data into HDFS/Hive using Sqoop from DB2.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, python, Spark SQL.
  • Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance.
  • Conducted POC's for real time streaming of data from MySQL to Hive and HBase.
  • Worked on Sequence files, RC files, Map side joins, bucketing, Partitioning for Hive performance enhancement and storage improvement.
  • Handled importing & exporting of large data sets from various data sources into HDFS and vice-versa using Sqoop, performed transformations using Hive and loaded data into HDFS.
  • Built dashboards and visualizations on top of Hive using Tableau and published those reports on tableau online accounts and on the browser using iframe.
  • Monitoring Hadoop Cluster through Cloudera Manager and Implementing alerts based on Error messages.
  • Loaded data from UNIX file system to HDFS.

Environment: Cloudera, Apache Kafka, HDFS, Python, Hive, Spark, Spark SQL, PIG, Map Reduce, SQOOP, IntelliJ, Tableau, Stream-sets Data collector, UNIX, MySQL, GIT.

Confidential

Java Developer

Responsibilities:

  • Involved in the implementation of design using vital phases of the Software development life cycle.
  • Involved in design, development and testing of the application.
  • Implemented the object-oriented programming concepts for validating the columns of the import file.
  • Used DOM Parser to parse the xml files.
  • Implemented complex back-end component to get the count in no time against large size MySQL database (about 4 crore rows) using Java multi-threading.
  • Experience working in agile development following SCRUM process, Sprint, and daily stand-up meetings.
  • Developed front-end screens using JSP, HTML, JQuery, JavaScript and CSS.
  • Participate in OOAD, domain modelling, and system architecture.
  • Used WinSCP to transfer file from local system to other system.
  • Coming up with the test cases for unit testing before the QA release.
  • Working closely with QA team and coordinating on fixes.

Environment: Java, Core Java, Apache Tomcat, Maven, JavaScript, RESTful Web Services, Web logic, JBoss, Eclipse IDE, Apache CXF, FTP, HTML, CSS.

We'd love your feedback!