Data Engineer Resume
TX
SUMMARY
- 10 years of professional IT experience in Bigdata Environment, Hadoop Ecosystem and good experience in Spark, SQL, Java Development.
- Hands on experience across Hadoop Eco System that includes extensive experience in Big Data. technologies like HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zookeeper and Apache Cassandra, HBase.
- Experience in using various tools like Sqoop, Flume, Kafka, NiFi and Pig to ingest structured, semi - structured and unstructured data into the cluster.
- Proficient with Apache spark ecosystem such as Spark, Spark Streaming using Scala and Python.
- Designing both time driven and data driven automated workflows using Oozie and used Zookeeper. for cluster co-ordination.
- Experience in Hadoop cluster using Cloudera’s CDH, Hortonworks HDP.
- Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Data pipeline consists of Spark, Hive and Sqoop, and custom build Input Adapters to ingest, transform and analyze operational data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
- Experience in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Expertise in writing MapReduce Jobs in Java, Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Experience working with Python, UNIX and shell scripting.
- Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files and Databases.
- Good knowledge of cloud integration with AWS using Elastic MapReduce (EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.
- Experience with complete Software Development Life Cycle (SDLC) process which includes.
- Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
- Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
- Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark.
- Context, Spark SQL, Data Frames and RDD’s.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Python.
- Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using Spark.
- Used Spark-Structured-Streaming to perform necessary transformations.
- Expertise in converting MapReduce programs into Spark transformations using Spark RDD's.
- Strong understanding of AWS components such as EC2 and S3
- Performed Data Migration to GCP
- Experience in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins and AWS.
- Exposure to CI/CD tools - Jenkins for Continuous Integration, Ansible for continuous deployment.
- Worked with waterfall and Agile methodologies.
TECHNICAL SKILLS
Big data Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Nifi, Airflow, Flume, Snowflake, Ambari, Hue
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR
Database: Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Language: C, C++, Java, Scala, Python
AWS Components: IAH, S3, EMR, EC2,Lambda, Route 53, Cloud Watch, SNS
Methodologies: Agile, Waterfall
Build Tools: Maven, Gradle, Jenkins.
Databases: NO-SQL, HBase, Cassandra, MongoDB, DynamoDB
IDE Tools: Eclipse, Net Beans, Intellij
Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML
BI Tools: Tableau
Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
PROFESSIONAL EXPERIENCE
Confidential, TX
Data Engineer
Responsibilities:
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines.
- Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operations
- Implemented Kafka model which pulls the latest records into Hive external tables.
- Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Experience in moving data between GCP and azure using Azure Data Factory
- Loaded all data-sets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.
- Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
- Wrote Pig scripts to store the data into HBase
- Exported the analysed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
- Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store (HBase).
- Migrated the computational code in HQL to PySpark.
- Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.
- Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
- Experienced in fact dimensional modeming (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension)
- Written Complex MapReduce programs.
- Experience in make conditional tasks and to trigger rules to implement joins at specific points in an Apache airflow DAG.
- Installed and configured Hive and also written Hive UDFs.
- Involved in HDFS maintenance and administering it through Hadoop-Java API
- Loaded data into the cluster from dynamically generated files using Flume and from RDBMS using Sqoop.
- Sound knowledge in programming Spark using Scala.
- Experience in Apache Airflow to run tasks in Parallel to create a database in Postges or MySQL and to configure it.
- Involved in writing Java API’s for interacting with HBase
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Involved in writing Flume and Hive scripts to extract, transform, and load data into Database
- Used HBase as the data storage
- Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources. Processed metadata files into AWS S3 and Elastic search cluster.
- Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in java for data cleaning and pre-processing.
- Experienced in Importing and exporting data into HDFS and Hive using Sqoop.
- Participated in development/implementation of Cloudera Hadoop environment.
- Used Amazon web services (AWS) like EC2 and S3 for small data sets.
- Building Nifi data pipelines in docker container environment in development phase.
- Implemented AWS services to provide a variety of computing and networking services to meet the needs of applications
- Populated HDFS and HBase with huge amounts of data using Apache Kafka.
- To Clustered NIFI Pipeline on EC2 nodes integrated with Spark, Postgres running on other instances using SSL handshakes in QA and Production Environments with Devops team.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Installed and configured Hive and also written Hive UDFs and Used MapReduce and Junit for unit testing.
- Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to Hive and Impala.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Extensively implemented terraform to create GCP projects, Kubernetes clusters other GCP sources, AWS route s3 DNS records and deploying helm charts.
- Used cloud shell in GCP to configure the services Data proc, storage
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.
- Deploying Spark jobs in Amazon EMR and running the job on AWS clusters.
- Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS)
- Redshift; Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP
- Monitored and managed the Hadoop cluster using Apache Ambary
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
- Involved in implementing security on Hortonworks Hadoop Cluster using with Kerberos by working along with operations team to move non secured cluster to secured cluster.
Environment: Java, Hadoop, Hive, Pig, Sqoop, Scala, Kafka, Scala, Flume, HBase, Python, Nifi, Pyspark, AWS, Hortonworks, Oracle 10g/11g/12C, Teradata, Cassandra, HDFS, Data Lake, Spark, MapReduce, Ambari, Cloudera, Tableau, Snappy, Zookeeper, NoSQL, Shell Scripting, Ubuntu, Solar.
Confidential, Bellevue, WA
Big data Engineer
Responsibilities:
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
- Developed ELT processes from the files from abnitio, in GCP with compute being data prep, pyspark
- Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behaviour in the events.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
- Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
- Implement POC with Hadoop. Extract data with Spark into HDFS.
- Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed code to read data stream from Kafka and send it to respective bolts through respective. stream.
- Developed Spark applications using Scala for easy Hadoop transitions.
- Implemented applications with Scala along with Aka and Play framework.
- Optimized the code using Pyspark for better performance.
- Worked on Spark streaming using Apache Kafka for real time data processing.
- Used Apache airflow in GCP environment to build data pipe lines and used various airflow operations
- Developed MapReduce jobs using MapReduce Java API and HIVEQL.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Implemented Snow pipe, Stage and file upload to Snowflake database using copy command.
- Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie.
- Experienced in optimizing Hive queries, joins to handle different data sets.
- Involved in ETL, Data Integration and Migration by writing Pig scripts.
- Integrated Hadoop with Solr and implement search algorithms.
- Experience in Storm for handling real-time processing.
- Hands on Experience working in Hortonworks distribution.
- Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs. e.
- Designed and implemented MongoDB and associated RESTful web service.
- Involved in writing test cases and implement test classes using MR Unit and mocking frameworks.
- Developed Sqoop scripts to extract the data from MYSQL and load into HDFS.
- Very capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitor
- Hadoop/Spark jobs on AWS.
- Experience in processing large volume of data and skills in parallel execution of process using Scala functionality.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate Terabytes of data and stored it in AWS HDFS.
- Used Talend tool to create workflows for processing data from multiple source systems.
Environment: Map Reduce, HDFS, Sqoop, Java, Pyspark, LINUX, Oozie, Hadoop, Pig, Hive, Solr, Spark Streaming, Kafka, Storm, Spark, Scala, Akka, Python, MongoDB, Hadoop Cluster, Amazon Web Services, Talend.
Confidential, PA
Spark Developer
Responsibilities:
- Involved in Requirements Analysis and design an Object-oriented domain model.
- Implemented test scripts to support test driven development and continuous integration.
- Experience in Importing and exporting data into big data, HDFS and Hive using Sqoop.
- Developed MapReduce programs to clean and aggregate the data.
- Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
- Development of new listeners for producers and consumer for both Rabbit MQ and Kafka
- Used Micro service with Spring Boot interacting through a combination of REST and Apache Kafka message brokers.
- Involved in writing queries in Spark SQL using Scala. Worked with Splunk to analyze and visualize data.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, ksql, Sqoop, Pig, Hive, Impala and NoSQL databases.
- Experience in spark application using spark SQL In data bricks for data extraction transformation and aggregation for, multiple file formats for analysing transforming the data
- Experience on migrating SQL data base to azure data lake azure SQL data base,
- Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
- Worked in complete SDLC phase like Requirements, Specification, Design, Implementation and Testing.
- Developed Spring and Hibernate data layer components for application.
- Developed profile view web pages add, edit using HTML, CSS, jQuery, Java Script
- Developed the application by using MAVEN script.
- Developed the mechanism for logging and debugging with Log4j.
- Involved in developing database tractions through JDBC.
- Used GIT for version control.
- Created RESTful services like Drop wizard framework for various web-services involving both JSON and XML.
- Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Used oracle as Database and used load for queries execution and involved in writing SQL scripts, PL/SQL code for procedures and functions.
- Developed Front-end applications which will interact the mainframe applications using J2C connectors.
- Hands on experience in exporting the results into relational databases using Sqoop for visualization. and to generate reports for BI team
- Designing, Development and implementation of JSPs in presentation layer for submission, Application, reference implementation
- Deployed Web, presentation, and business components on Apache Tomcat Application Server.
- Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
- Worked in Agile methodology
Environment: HDFS, Hive, Sqoop, Java, Core Java, Maven, HTML, CSS, Java Script, GIT, Map R, JUNIT, Agile, Log4j, SQL, Agile
Confidential, TX
Hadoop Developer
Responsibilities:
- Developed multiple MapReduce jobs in Java for data cleaning and pre-processing and assisted with data capacity planning and node forecasting.
- Involved in design and ongoing operation of several Hadoop clusters and configured and deployed.
- Hive Meta store using MySQL and thrift server.
- Implemented and operated on-premises Hadoop clusters from the hardware to the application layer including compute and storage.
- Uploaded and processed more than 30 terabytes of data from various structured and unstructured. sources into HDFS (AWS cloud) using Sqoop and Flume.
- Designed custom deployment and configuration automation systems to allow for hands-off management of clusters via Cobbler, FUNC, and Puppet.
- Prepared complete description documentation as per the Knowledge Transferred about the Phase-II
- Talend Job Design and goal and prepared documentation about the Support and Maintenance work to be followed in Talend.
- Deployed the company's first Hadoop cluster running Cloudera's CDH2 to a 44-node cluster storing.
- 160TB and connecting via 1 GB Ethernet.
- Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team.
- Modified reports and Talend ETL jobs based on the feedback from QA testers and Users in development and staging environments.
- Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HBase tables.
- Involved in Cluster Maintenance and removal of nodes using Cloudera Manager.
- Collaborated with application development teams to provide operational support, platform expansion, and upgrades for Hadoop Infrastructure including upgrades to CDH3.
- Participated in Hadoop development Scrum and installed, Configured Cognos8.4/10 and Talend ETL on single and multi-server environments.
Environment: Apache Hadoop, Cloudera, Pig, Hive, Talend, Map-reduce, Sqoop, UNIX, Cassandra, Java, LINUX, Oracle 11gR2, UNIX Shell Scripting, Kerberos
Confidential
Java Developer
Responsibilities:
- Implemented Multi-Threaded Environment and used most of the interfaces under the collection framework by using Core Java Concepts.
- Developed Graphical User Interfaces by using JSF, JSP, HTML, DHTML, Angular, CSS, and JavaScript and developed scripts in python for Financial Data coming from SQL Developer based on the requirements specified.
- Implemented several Java/J2EE design patterns like Spring MVC, Singleton, Spring Dependency
- Injection and Data Transfer Object.
- Used JAX-WS (SOAP) for producing web services and involved in writing programs to consume the web services using SOA with CXF framework and developed few web pages using JSP, JSTL, HTML, CSS, Java script, Ajax and JSON.
- Implemented business logic, data exchange, XML processing and created graphics using Python and Django.
- Wrote code to fetch data from Web services using JQUERY AJAX via JSON response and updating the HTML pages and developed high traffic web applications using HTML, CSS, and JavaScript, jQuery, Bootstrap, Ext JS, AngularJS, Node.js and react.js.
- Write SQL queries and create PL/SQL functions/procedures/packages that are optimized for APEX and improve performance and response times of APEX pages and reports
- Used jQuery library, NodeJS and AngularJS for creation of powerful dynamic Webpages and web. applications by using its advanced and cross browser functionality.
- Used Java Server Pages for content layout and presentation with Python and Extracted and loaded data using Python scripts and PL/SQL packages.
- Worked with various frameworks of JavaScript like BackboneJS, AngularJS, and Embers etc.
- Written with object-oriented Python, Flask, SQL, Beautiful Soup, httplib2, Jinja2, HTML/CSS, Bootstrap, jQuery, Linux, Sublime Text, GIT.
- Developed GUI using JSP, Struts, HTML3, CSS3, XHTML, jQuery, Swing and JavaScript to simplify the complexities of the application.
- Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package and generated Python Django forms to record data of online users and used PyTest for writing test cases.
Environment: Python, Django, Java, JSF MVC, Spring IOC, APEX, Ruby on Rails, Spring JDBC, Hibernate, ActiveMQ, Log4j, Ant, MySQL, JDK 1.6, J2EE, JSP, Servlets, HTML, LDAP, Salesforce, ESB Mule, JDBC, MongoDB, DAO, EJB 3.0, PL/SQL, react.js, Web Sphere, Eclipse, Angular.JS, and CVS.