Big Data Engineer Resume
Austin, TX
SUMMARY
- 7+ years of Total IT experience in Big Data Developer and Data Analysis.
- Experience in design and development of applications using Hadoop ecosystem components like HDFS, Hive, Spark, Sqoop, Scala, Kafka, Apache Nifi, HBase and YARN
- Experience on Hadoop Distributions HDP 2.6.x and CDH 5.x
- Experience in developing Spark streaming applications to consume real - time transactions via Kafka Topics
- Experience on building the applications using Spark Core, Spark SQL, Data Frames, Spark Streaming
- Experience on importing the data from RDBMS databases Oracle and SQL Server into Hadoop data lake using Sqoop
- Experience on job scheduling tool - Oozie
- Experienced in AWS - S3, EC2, RDS and EMR
- Experience in developing Spark applications using DataFrame and Datasets. Transformed data using PySpark, Spark SQL.
- Knowledge on NoSQL databases HBase, MongoDB and Cassandra
- Experience in real - time messaging systems such as Kafka to ingest real time streaming data into Hadoop
- Worked with different Bug Tracking Tools like Remedy, and Jira
- Experience on developing Spark batch applications to ingest data into common data lake.
- Experience in importing and exporting data using Sqoop from RDBMS to HDFS and vice-versa
- Experience working with Agile and Waterfall methodologies
- Highly motivated, detail oriented, ability to work independently and as a part of the team with excellent networking and communication with all levels of stakeholders as appropriate, including executives, application developers, business users, and customers
TECHNICAL SKILLS
Hadoop/Big data: HDFS, Hive, Map Reduce, Spark, Sqoop, HBase, Kafka, Oozie, Nifi, Impala, Hue, Strom.
No SQL Databases: Spanner, HBase, MapR-DB
Languages: Python, Scala, Core Java, Unix Shell scripts, SQL
Web/Application Server: Apache Tomcat
Databases/ETL: Oracle, DB2, SQL Server, MySQL, DataStage, Teradata
IDEs: Eclipse, IntelliJ
Other Tools & packages: CAWA, Bit Bucket, JUnit, Maven, ANT, GitHub, Stream sets Data Collector, Grafana, Tableau.
SDLC Methodology: Agile, Waterfall model
Operating Systems: Linux, UNIX, Windows
Office Tools: MS Office, Word, Power Point
PROFESSIONAL EXPERIENCE
Confidential - Austin TX
Big Data Engineer
Responsibilities:
- Importing and exporting data usingSqoopto load data to and fromOracle 11gto HDFS on a regular basis.
- Involved in Requirement analysis, Design, development, and testing of the application.
- Worked on enhancing the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Spark RDD's
- Followed a parametrized approach for the schema details, file locations and delimiter details to make the coding efficient and reusable.
- Developed PySpark application to consume data from Apache Kafka topics and publish to HDFS and HBase.
- Worked on DStreams (Discretized Stream), RDD’s (Resilient Distributed Dataset), Dataframes, Spark SQL to build the spark streaming application.
- Involved in convertingHive queries intoSpark transformationsusingSpark RDDs.
- UsedSpark-SQL, Spark RDD, Spark Dataframeto loadJSONdata into Hive tables.
- Involved in the data ingestion process through DataStage to load data into HDFS from Mainframes, Teradata, DB2.
- Used Apache Nifi for data Ingestion, load data into Kafka topics
- Developed ELT workflows using Nifi to load data into Hive and Teradata.
- Used Python subprocess module to call the PySpark job.
- DevelopingSparkcodeinSparkSQLenvironment for faster testing and processing of data and Loading the data intoSparkRDD.
- Load the data intoSparkRDDand do in-memory data computations to generate the Output response.
- Used Hue and Cloudera Manager to monitor Spark jobs.
- Worked on AWS POC to modernize the streaming pipeline using AWS kinesis, lambda and s3, redshift.
- Developed and maintains system documentation and runbooks.
- Led the effort of end user training, to increase and drive technology adoption program among business users.
- Worked on UNIX shell scripts and automation of the Scoop jobs using UNIX shell scripting.
- Worked on Tableau for reporting on top of Hive.
- Worked in Agile and used JIRA for maintain the stories about project.
Environment: Cloudera Hadoop, HDFS, Yarn, Hive, Spark, PySpark, Spark SQL, HBase, Sqoop, MS SQL Server, Oracle, SQL/ NoSQL, Linux, Python
Confidential - Austin, TX
Big Data Developer
Responsibilities:
- Teamed up with Architects to Design Spark model for the Generic ETL framework.
- Implemented Spark with YARN to perform analytics on data in Hive.
- Developed the Extract process using Spark 2.0.
- Created libraries to connect multiple databases like DB2, SQL Server, Oracle, MongoDB, Postgres SQL, HBase and to invoke the spark session along with some UDF’s
- Imported the data from multiple data bases DB2, SQL server, Oracle, MongoDB, files etc.
- Created data frames as a result set for the extracted data.
- Applied filters and developed the Spark MapReduce jobs to process the data.
- Involved in converting Hive/SQL queries into spark transformations using Spark RDD’s, Scala.
- Implemented the code using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Created multiple case class to generate the data based on the Object.
- Converted the data into required JSON structure using Jackson4j to load the data into MongoDB or HBase or Postgres SQL.
- As part of the transformation, read the data from MongoDB as a Json, applied explode on the data frame to flatten the data.
- Developed code build using Scala API’s to compare the performance of Spark with Hive and Shell Script for the Sqoop job.
- Used Struct type, struct of array to get/read the different schema’s on data frames.
- Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL data bases for huge volume of data.
- Worked on Grafana for real-time visualizations.
- Have good experience with ETL tools like IBM DataStage, Talend.
- Expert knowledge on MongoDB, NoSQL data modeling, tuning, Indexing.
Environment: Cloudera Hadoop, HDFS, Yarn, Hive, Spark, PySpark, Spark SQL, HBase, Sqoop, Kafka, DB2, SQL Server, Oracle, MongoDB, DataStage, Postgres SQL, HBase, Linux, Python
Confidential - Minneapolis, MN
Big Data Developer
Responsibilities:
- Involved in Requirement analysis, Design, development and testing of the application.
- Configured Kafka Connect JDBC with SAP HANA and MapR Streams for both real-time streaming and batch process.
- Created MapR-Event Streams and Kafka topics.
- Worked on Antiunity Replicate to load data from SAP ECC to Apache Kafka topics.
- Developed Spark Streaming application using Python to stream data from MapR Event Streams and Apache Kafka topics to Hive and MapR-DB and to stream data from one topic to the other topic with in the MapR Event Streams.
- Worked on DStreams (Discretized Stream), RDD’s (Resilient Distributed Dataset), Dataframes, Spark SQL to build the spark streaming application.
- Involved in creating SQL queries to extract data, to perform joins on the tables in SAP HANA and MySQL.
- Worked with NoSQL databases like HBase in creating tables to load large sets of semi structured data coming from source systems.
- Implemented Partitioning, Dynamic Partition, and Bucketing in Hive for efficient data access.
- Used Hue and MapR Control System (MCS) to monitor and troubleshoot Spark jobs.
- Developed SQOOP scripts to move data from MapR-FS to SAP HANA.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Installed and configured Kafka Connect JDBC in AWS EC2 instance.
- Created stored procedures in MySQL to improve data handling and ETL Transactions.
- Worked on data validation using HIVE and written Hive UDFs.
- Managed Linux and Windows virtual servers on AWS EC2.
- Built statistical on AWS EMR by uploading data in S3 instance on EC2models and creating
- Configured SAP HANA source connector with SAP HANA as source and Apache Kafka topic as target for real time streaming and batch processing.
- Provisioned, installed, and configured SAP HANA enterprise edition on AWS cloud EC2 instance.
- Developed Streaming application to stream data from MapR ES to HBase.
- Streamed data from Apache Kafka topics to time series database OPEN TSDB.
- Built dashboards and visualizations on top of MapR-DB and Hive using Oracle data visualizer desktop. Built real-time visualizations on top of Open TSDB using Grafana.
- Worked on UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
Environment: MapR 6.0, Apache Kafka 1.0.0, Hive 2.1, HBase 1.1.8, Hue, MapR-DB, MapR-FS,, Spark 2.1.0, Python, AWS, SAP HANA, SQOOP, Oozie, Pig, IntelliJ, Kafka Connect Framework, DB Visualizer, Oracle Data Visualizer Desktop, Stream-sets Data collector, MapR-ES, MySQL, GIT.
Confidential, Median-OH
Big Data Developer
Responsibilities:
- Worked on enhancing the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Spark RDD's.
- Worked on MySQL for identifying required tables and views to export into HDFS.
- Loaded data from MySQL to HDFS to development cluster for validation and cleansing.
- Created Apache Kafka topics.
- Configured Stream sets data collector with Apache Kafka to stream real time data from different sources (database & files) into Kafka topics.
- Developed streaming application to stream data from Kafka topics to Hive using Spark, Python.
- Processed large amounts of structured and semi-structured data using MapReduce programs.
- Worked on real time processing and batch processing of data sources using Apache Spark, Elastic search, Spark Streaming, Apache Kafka.
- Created scripts for importing data into HDFS/Hive using Sqoop from DB2.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, python, Spark SQL.
- Worked on migrating PIG scripts and MapReduce programs to Spark Data frames API and Spark SQL to improve performance.
- Conducted POC's for real time streaming of data from MySQL to Hive and HBase.
- Worked on Sequence files, RC files, Map side joins, bucketing, Partitioning for Hive performance enhancement and storage improvement.
- Handled importing & exporting of large data sets from various data sources into HDFS and vice-versa using Sqoop, performed transformations using Hive and loaded data into HDFS.
- Built dashboards and visualizations on top of Hive using Tableau and published those reports on tableau online accounts and on the browser using iframe.
- Monitoring Hadoop Cluster through Cloudera Manager and Implementing alerts based on Error messages.
- Loaded data from UNIX file system to HDFS.
Environment: Cloudera, Apache Kafka, HDFS, Python, Hive, Spark, Spark SQL, PIG, Map Reduce, SQOOP, IntelliJ, Tableau, Stream-sets Data collector, UNIX, MySQL, GIT.
Confidential
Java Developer
Responsibilities:
- Involved in the implementation of design using vital phases of the Software development life cycle.
- Involved in design, development and testing of the application.
- Implemented the object-oriented programming concepts for validating the columns of the import file.
- Used DOM Parser to parse the xml files.
- Implemented complex back-end component to get the count in no time against large size MySQL database (about 4 crore rows) using Java multi-threading.
- Experience working in agile development following SCRUM process, Sprint, and daily stand-up meetings.
- Developed front-end screens using JSP, HTML, JQuery, JavaScript and CSS.
- Participate in OOAD, domain modelling, and system architecture.
- Used WinSCP to transfer file from local system to other system.
- Coming up with the test cases for unit testing before the QA release.
- Working closely with QA team and coordinating on fixes.
Environment: Java, Core Java, Apache Tomcat, Maven, JavaScript, RESTful Web Services, Web logic, JBoss, Eclipse IDE, Apache CXF, FTP, HTML, CSS.