We provide IT Staff Augmentation Services!

Spark/big Data Engineer Resume

5.00/5 (Submit Your Rating)

NY

SUMMARY:

  • 5 years of professional IT experience in Data Warehousing, ETL (Extract, Transform, and Load) and Data Analytics.
  • Extensive experience in Informatica Power Center 10.x/9.x/8.x. 3 years as Lead Developer, responsible for supporting enterprise level ETL architecture.
  • 2.5 years of experience in Hadoop 2.0. Led development of enterprise level solutions utilizing Hadoop utilities such as Spark, MapReduce, Sqoop, PIG, Hive, HBase, Phoenix, Oozie, Flume, streaming jars, Custom SerDe, etc. Worked on proof of concepts on Kafka, and Storm.
  • Experience with Hortonworks, Cloudera, and Amazon EMR Hadoop distributions.
  • Worked on Java EE 7 and 8. Developed ETL\Hadoop related java codes, created RESTful APIs using Spring Boot Framework, developed web apps using Spring MVC and JavaScript, developed coding framework, etc.
  • Well - versed in relational database management systems (RDBMS) including Oracle, MS SQL Server, MYSQL, Teradata, DB2, Netezza, and MS Access. More than 5 years of experience in Teradata.
  • Proficient in SQL, T SQL, BTEQ and PL/SQL (Stored Procedures, Functions, Triggers, Cursors, and Packages).
  • Extensive experience in developing UNIX Shell Script, Perl, Windows Batch Script, JavaScript and PowerShell to automate ETL processes.
  • Exposure to NoSQL databases such as MongoDB, HBase, and Cassandra. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
  • Experience with Talend’s Data Integration, ESB, MDM and Big Data tools.
  • Exposure to HL7’s FHIR specifications and related java API HAPI. Created FHIR APIs\ web services to store and manage resources in MongoDB.
  • Healthcare domain knowledge including Facets, CareAdvance, Care Analyzer, HL7, EDI, NCPDP, EMR, HEDIS, NCQA, URAC, etc.
  • Hands on experience in various open source Apache technologies such as NiFi, Hadoop, Avro, ORC, Parquet, Spark, HBase, Phoenix, Kite, Drill, Presto, Drools, Talend, Airflow, Falcon, Flume, Ranger, Ambari, Kafka, Oozie, ZooKeeper, Karaf, Camel, JMeter, etc.
  • Experience in Elasticsearch and MDM solutions.
  • Worked on message oriented architecture with RabbitMQ and Kafka as a Message Broker option. Used Talend ESB to exchange messages from AMQP and JMS clients.
  • Well-versed in version control and CI-CD tools such as SVN, GIT, SourceTree, Bitbucket, etc.
  • Experience in Amazon Web Services (AWS) products S3, EC2, EMR, and RDS.
  • Strong experience in design and development of Business Intelligence solutions using data modeling, Dimension Modeling, ETL Processes, Data Integration, OLAP and client /server application.
  • Extensive experience in agile software development methodology.

PROFESSIONAL EXPERIENCE:

Confidential, NY

Spark/Big Data Engineer

Responsibilities:

  • Designed a data workflow model to create a data lake in hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports
  • Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
  • Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Loading log data directly into HDFS using Flume.
  • Leveraged AWS S3 as storage layer for HDFS.
  • Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
  • Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
  • Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
  • Used Confluence to store the design documents and the STMs
  • Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
  • Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology

Environment: SPARK, Hive, Pig, Flume Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL

Confidential, NY

Hadoop Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Developed Simple to complex Map/reduce Jobs using Java programming language that are implemented using Hive and Pig
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
  • Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Used UDF’s to implement business logic in Hadoop.
  • Used Impala to read, write and query the Hadoop data in HBase.
  • Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
  • Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Experience with Storm for the real-time procession of data.
  • Used Solr to navigate through data sets in the HDFS storage.
  • Loading log data directly into HDFS using Flume.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
  • Stored Solr indexes in HDFS.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.

Environment: Hadoop, MapReduce, HDFS, Hive, Spark, Pig, Java, SQL, Cloudera Manager, Sqoop, Strom, Solr, Mahout, Flume, Oozie, Java (jdk 1.6), Eclipse

Confidential, CA

Bigdata Engineer

Responsibilities:

  • Develop a bigdata web application using Agile methodology in Scala as Scala has the capability of combining functional and object-oriented programming.
  • Work with different data sources like HDFS, Hive and Teradata for Spark to process the data.
  • Use Spark to process the data before ingesting the data into the HBase. Both Batch and real-time spark jobs were created using Scala.
  • Use HBase as the database to store application data, as HBase offers features like high scalability, distributed NoSQL, column oriented and real-time data querying to name a few.
  • Use Kafka a publish-subscribe messaging system by creating topics using consumers and producers to ingest data into the application for Spark to process the data and create Kafka topics for application and system logs.
  • Utilize Play framework to build web applications that combines easily with Akka.
  • Configure Zookeeper to coordinate and support the distributed applications as it offers high throughput and availability with low latency.
  • Create and update the Terraform scripts to create the infrastructure and Consul scripts to enable Service Discovery for the application’s systems.
  • Configure Nginx to serve the static content of the web pages reducing the load on the web server for the static content.
  • Write SQL queries to perform CRUD operations on the PostgreSQL to save, store, update and delete rows in tables using Play Slick.
  • Perform database migrations as and when needed.
  • Use SBT to build the Scala project.
  • Involve in creating and updating stories for each sprint in Agile, suggest the technical direction to be taken for each story.
  • Demo the application once every month to customers explaining new features of the application and answer any questions that might arise from the discussions and take suggestions to improve the application for better user experience.
  • Create and update Jenkins jobs to develop pipelines to deploy the application in different environments like develop, QA and Production.
  • Use Git commands extensively for code check-in.

Environment: SPARK, Scala, Python, Intellij IDE, KAFKA, Play Framework, Slick, PostgreSQL, AWS CLI, Terraform, Consul, SBT, HBase, Akka.

Confidential, CA

Hadoop Developer

Responsibilities:

  • Used Sqoop to extract data from Oracle SQL server and MySQL databases to HDFS.
  • Developed workflows in Oozie for business requirements to extract the data using Sqoop.
  • Developed MapReduce(YARN) jobs for cleaning, accessing and validating the data.
  • Wrote MapReduce jobs using Pig Latin, Optimized the existing Hive and Pig Scripts.
  • Used Hive and Impala to query the data in HBase.
  • Hive scripts were written in Hive QL to de-normalize and aggregate the data.
  • Index documents in HDFS using Solr Hadoop connectors.
  • Automated the work flows using shell scripts (Bash) to export data from databases into Hadoop.
  • Used JUnit framework to test the Unit testing of the application.
  • Hive queries for data were written to meet the business requirements.
  • Developed product profiles using Pig and commodity UDFs.
  • Designed workflows by scheduling Hive processes for Log file data, which is streamed into HDFS using Flume.
  • Developed schemas to handle reporting requirements using Tableau.
  • Actively participated in weekly meetings with the technical teams to review the code.
  • Involved in loading data from UNIX file system to HDFS.
  • Implemented test scripts to support test driven development and continuous integration.
  • Responsible to manage data coming from different sources.
  • Have deep and thorough understanding of ETL tools and how they can be applied in a Big Data environment.
  • Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings with various business users.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.

Environment: Hadoop, Map Reduce, Hive QL, Hive, HBase, Sqoop, Solr, Cassandra, Flume, Tableau, Impala, Oozie, MYSQL, Oracle SQL, Java, Unix Shell, YARN, Pig Latin.

Confidential, OH

JAVA/ J2EE Developer

Responsibilities:

  • Involved in all the phases of SDLC including Requirements Collection, Design & Analysis of the Customer Specifications, Development and Customization of the Application.
  • Developed JSP, JSF and Servlets to dynamically generate HTML and display the data to the client side.
  • Extensively used JSP tag libraries, Used Spring Security for Authentication and authorization extensively.
  • Designed and developed Application based on Struts Framework using MVC design pattern.
  • Used Struts Validator framework for client-side validations.
  • Used Spring Core for dependency injection/Inversion of control (IOC).
  • Used Hibernate Framework for persistence onto oracle database.
  • Written and debugged the ANT Scripts for building the entire web application.
  • Used XML to transfer the application data between client and server.
  • XSLT style sheets for the XML data transformations.
  • Developed web services in Java and Experienced with SOAP, WSDL.
  • Used Log4j for logging Errors.
  • Used MAVEN as build tool.
  • Used Spring Batch for scheduling and maintenance of batch jobs.
  • Deployed the application in various environments DEV, QA and Production.
  • Used the JDBC for data retrieval from the database for various inquiries.
  • Performed purification of the application database entries using Oracle 10g.
  • Used CVS as source control.
  • Created Application Property Files and implemented internationalization.
  • Used Junit to write repeatable tests mainly for unit testing.
  • Involved in complete development of ‘Agile Development Methodology’ and tested the application in each iteration.
  • Wrote complex Sql and Hql queries to retrieve data from the Oracle database.
  • Involved in fixing System testing issues and UAT issues.
  • Responsible for logical dimensional data model and use ETL skills to load the dimensional physical layer from various sources including DB2, SQL Server, Oracle, Flat file etc .
  • Successfully collaborated with business users to capture & define business requirements and contribute to defining the data warehouse architecture (data models, data analysis, data sourcing and data integrity).
  • Analyzed source data for potential data quality issues and addressing these issues in ETL procedures.
  • Developed technical design documents and mappings specifications to build Informatica Mappings to load data into target tables adhering to the business rules.
  • Design, develop, test, maintain and organize complex Informatica mappings, sessions and workflows.
  • Complete technical documentation to ensure system is fully documented.
  • Design and develop ETL using CDC using Power Exchange 9.1 in Mainframe DB2 environment.
  • Created registration and data map for mainframe source.
  • Demonstrate in-depth understanding of Data Warehousing (DWH) and ETL concepts, ETL loading strategy.
  • Worked with SAP data Services for Data Quality and Data Integration.
  • Created Unix script for identifying CDC hanging workflows.
  • Participate in Developing PL/SQL procedures, Korn Scripts to automate the process for daily and nightly Loads.
  • Created sequential/concurrent Sessions/ Batches for data loading process and used Pre-& Post Session SQL Script to meet business logic.
  • Extensively used pmcmd commands on command prompt and executed Unix Shell scripts to automate workflows and to populate parameter files.
  • Developed complex mappings from varied transformation logics like Connected and Unconnected Lookups, Router, Aggregator, Joiner, Update Strategy etc.…
  • Worked with Mapping/Session/Worklet/Workflow variables and parameters, running Workflows in RHEL Unix shell script.
  • Developed complex mappings from varied transformation logics like Connected and Unconnected Lookups, Router, Aggregator, Joiner, Update Strategy etc.…
  • Created Informatica Power Exchange Restart token and enables recovery for all real-time sessions.
  • Worked on data warehouses with sizes from 2-3 Terabytes.
  • Worked on Teradata utilities like FLOAD, MLOAD and TPUMP to load data to stage and DWH.
  • Worked on Data Modeling using Star/Snowflake Schema Design, Data Marts, Relational and Dimensional Data Modeling, Slowly Changing Dimensions, Fact and Dimensional tables, Physical and Logical data modeling using Erwin.

We'd love your feedback!