We provide IT Staff Augmentation Services!

Data Engineer (spark And Scala)/hadoop Admin Resume

5.00/5 (Submit Your Rating)

Raleigh, NC

SUMMARY

  • 7+ years of IT Experience in Architecture, Analysis, design, development, implementation, maintenance and support, with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirements.
  • Experience in Test and Production Environments on various business domains like Financial, Insurance and Banking
  • Experience in writing distributed Scala code for efficient big data processing
  • 3 years of experience on BIG DATA using HADOOP and SPARK framework and related technologies such as HDFS, HBASE, MapReduce, HIVE, PIG, FLUME, OOZIE, SQOOP, and ZOOKEEPER.
  • Experience in data analysis using HIVE, Pig Latin, HBase and custom Map Reduce programs in Java.
  • Experience in writing custom UDFs in java for Hive and Pig to extend the functionality.
  • Experience in writing MAPREDUCE programs in java for data cleansing and preprocessing.
  • Excellent understanding/knowledge on Hadoop and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager (YARN).
  • Experience in working with Flume to load the log data from multiple sources directly in to HDFS.
  • Experience in working with Message Broker services like Kafka and Amazon SQS.
  • Experience in Real - time data analysis using Spark Streaming and Storm.
  • Worked with different file formats like flat files, Sequence, Avro and Parquet.
  • Extensive experience with Informatica (ETL Tool) for Data Extraction, Transformation and Loading.
  • Extensive experience in building Data Warehouses/Data Marts using ETL tools Informatica Power Center (9.0/8.x/7.x).
  • Worked on Webservices as a part of Informatica and obtained the successful connection of the external Webservice call.
  • Reporting experience with tools like Business Objects Crystal Reports XI and XI R2.
  • Around 6months of experience with design/developing ETL processes using BO Data Integrator.
  • Experience in Design & Development of Enterprise Data Warehouse infrastructure processes for extracting, transforming and loading data
  • Analyzed Source Systems, Business Requirements, Identify and Document Business Rules for Decision Support Systems.
  • Extensively used SQL and PL/SQL for development of Procedures, Functions, Packages and Triggers.
  • Wide experience in reengineering concepts and tools along with End-to-End understanding of Business Concepts.
  • Involved in the Analysis part of the new Data Feeds and Solution Scoping.
  • Experience in developing Unix Shell Scripts for automation of ETL process.
  • Experience in developing XML/XSD/XSLT as a part of Source XML files for Informatica and also input XML for Webservice Call.
  • Strong communication skills and demonstrated ability to work effectively with business users.
  • Experience with MAESTRO schedulers for scheduling nightly jobs.
  • Experience with AGILE Methodologies.

TECHNICAL SKILLS

Hadoop Ecosystem: MapReduce, HDFS, Hive, Pig, Sqoop Zookeeper, Oozie, Flume, HBase, Spark, Kafka

Language: C, C++, Java, J2EE, Python, Scala, UML

Web Technologies: JavaScript, JSP, Servlets, JDBC, Unix/Linux Shell Scripting, Python, HTML, XML

Methodologies: Waterfall, Agile/Scrum.

Databases: Oracle, MySQL, HBase

Application/Web server: Apache Tomcat, WebSphere and JBoss.

IDE’s: Eclipse, Netbeans

ETL & Reporting Tools: Informatica, SAP Business Objects, Tableau

Cloud Infrastructures: Amazon Web Services.

PROFESSIONAL EXPERIENCE

Confidential, Raleigh, NC

Data Engineer (Spark and Scala)/Hadoop Admin

Responsibilities:

  • Developed Spark SQL Scripts for data Ingestion from Oracle to Spark Clusters and relevant data joins using Spark SQL.
  • Experience building distributed high-performance systems using Spark and Scala
  • Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and HDFS.
  • Designed Distributed algorithms for identifying trends in data and processing them effectively.
  • Used Spark and Scala for developing machine learning algorithms which analyses click stream data.
  • Experience in developing machine learning code using spark MLLIB
  • Used Spark SQL for data pre-processing, cleaning and joining very large data sets.
  • Experience in creating data lake using spark which is used for downstream applications
  • Designed and Developed Scala workflows for data pull from cloud based systems and applying transformations on it.
  • Installed and configured multi-nodes on fully distributed Hadoop cluster.
  • Involved in Hadoop Cluster environment administration that includes De-commissioning and commissioning nodes, cluster capacity planning, balancing, performance tuning, cluster Monitoring and Troubleshooting.
  • Configured Fair Scheduler to provide service-level agreements for multiple users of a cluster.
  • Implemented the Hadoop Name-node HA services to make the Hadoop services highly available.
  • Developed the Cronjob for storing the Name-node metadata onto the NFS mount directory.
  • Worked on installing Hadoop Ecosystem components such as Sqoop, Pig, Hive, Oozie, and Hcatalog.
  • Involved in HDFS maintenance and administering it through Hadoop-Java API.
  • Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
  • Proficient in writing Flume and Hive scripts to extract, transform and load the data into Database.
  • Responsible for maintaining, managing and upgradation of Hadoop cluster connectivity and security.
  • Worked on Machine Learning Algorithms Development for analyzing click stream data using Spark and Scala.
  • Database migrations from Traditional Data Warehouses to Spark Clusters.
  • Data Workflows and Pipelines are created for transition and analyzing trends using Spark Mllib.
  • Entire Project is set up on Amazon Web Services Cloud and all the Algorithms are tuned for their best performances for better performance.
  • Analyzing Streaming data and identifying important trends in data for further analysis using Spark Streaming and Storm.
  • Collected and aggregated large amounts of web log data from different sources such as web servers, mobile and network devices using Apache Kafka and stored the data into HDFS for analysis.
  • Experience configuring spouts and bolts in various Apache Storm topologies and validating data in the bolts
  • Used Spark Streaming to collect data from Kafka in near-real-time and perform necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store
  • Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
  • Batch loading of data to NOSQL storage like MongoDB
  • Implemented Spark RDD transformations, actions to migrate Map reduce algorithms
  • Used Git to check-in and checkout code changes.
  • Used Jira for bug tracking

Environment:, Scala, Apache Spark, AWS, Spark Mllib, Spark SQL, PostgreSQL, Hive, Mongo DB, Apache Storm, Kafka, Git, Jira

Confidential, Des Moines, IA

Hadoop Developer/Hadoop Admin

Responsibilities:

  • Written the Apache PIG scripts to process the HDFS data.
  • Created Hive tables to store the processed results in a tabular format.
  • Developed the sqoop scripts in order to make the interaction between Pig and MySQL Database.
  • Involved in gathering the requirements, designing, development and testing
  • Writing the script files for processing data and loading to HDFS
  • Storing and retrieved data using HQL in Hive.
  • Developed the UNIX shell scripts for creating the reports from Hive data.
  • Data Ingestion using Kafka, Data pipeline architecture, Data cleansing, ETL, Processing and some visualization experience. Enable CDH to consume data from customer’s enterprise tool (I have worked with sources like RabbitMQ, IBM MQ, RDBMS, etc)
  • Use-case development (Hive, Pig, Spark, Spark Streaming); Implemented MapReduce to discover interesting patterns in data.
  • Installed and configured Hadoop cluster in Development, Testing and Production environments.
  • Performed both major and minor upgrades to the existing CDH cluster.
  • Responsible for monitoring and supporting Development activities.
  • Responsible for administering applications and their maintenance on daily basis. Prepared System Design document with all functional implementations.
  • Installation of various Hadoop Ecosystems and Hadoop Daemons.
  • Installation and configuration of Sqoop and Flume.
  • Involved in Data model sessions to develop models for HIVE tables.
  • Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MapReduce, HIVE, SQOOP and Pig Latin.
  • Developed Java Map Reduce Programs that includes use of custom data types, Input format, record reader etc.
  • Involved in writing Flume and Hive scripts to extract, transform and load the data into Database
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Converting ETL logic to Hadoop mappings.
  • Extensive hands on experience in Hadoop file system commands for file handling operations.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for hive performance enhancement and storage improvement.
  • Worked with SQOOP import and export functionalities to handle large data set transfer between DB2 database and HDFS.
  • Used Sentry to control access to databases/data-sets.
  • Worked on security of the Hadoop cluster and also tuning the cluster to meet necessary performance standards.
  • Configured backups and performed Name node recoveries from previous backups
  • Experienced in managing and analyzing Hadoop log files.
  • Providing documentation on the architecture, deployment and all details the customer would require to run the CDH cluster as part of a “delivery document(s)”
  • RDBMS: MySQL and Postgresql (some experience to support it as a backend for Hive Metastore, Cloudera Manager related components, Oozie etc.)
  • Provide Subject Matter Expertise on Linux (To support running CDH/Hadoop optimally on the underlying OS).
  • Training customers/partners when required.
  • Understanding customer requirements and identifying how the Hadoop eco-system could be leveraged to implement their requirements into Hadoop, how CDH can fit into their current infrastructure, where Hadoop can complement existing products, etc.

Environment: Cloudera Hadoop Framework, MapReduce, Hive, Pig, HBase, Business Objects, Platfora, HParser, Java, Python, UNIX Shell Scripting.

Confidential, Omaha, Nebraska

ETL Developer

Responsibilities:

  • Hands on experience on Star Schema Modeling, Snow-Flake Modeling, FACT and Dimensions Tables, Physical and Logical Data Modeling using Erwin.
  • Hands on experience on Dimensional Data modeling and Relational Data modeling.
  • Experience in Administration activities like Creating and Managing Repositories, Users, User Groups, Folders, Working with Administrator functions of Repository Manager.
  • Extensive experience in Data Warehouse, Data Mart and Business Intelligence using OLAP/DSS tools
  • Implemented the RALPH KIMBELL design strategies on metadata
  • Involved in developing Data modeling, Logical/Physical model & ER-Diagrams using ERWIN
  • Extensive experience in Client/Server technology area with Oracle Database, SQL Server and PL/SQL for the back end development of Packages, Stored Procedures, Functions and Triggers.
  • Involved in complete System Software development Life Cycle (SDLC) of Data warehousing, Decision Support System.
  • Hands on experience on Integrating Informatica with Salesforce data
  • Expertise in UML, Rational Unified Process (RUP) and Rational Rose.
  • Good knowledge and experience in Postgres DB and Salesforce data.
  • Knowledge of Teradata utilities (SQL, BTEQ/ BTEQ Win, FastLoad, MultiLoad, FastExport, Tpump, Queryman, etc)
  • Expertise in implementing complex Business rules by creating complex mappings/mapplets, shortcuts, reusable transformations and Partitioning Sessions.
  • Worked with Onsite and Offshore model and lead the team
  • Worked on Agile Methodology
  • Extensive experience in data analysis to improve overall efficiency to support decision making
  • Effective problem-solver. Organized. Team player

Environment: Informatica, PostGres, Oracle, Teradata, Netezza, DataStage, DB2, Java, Powercenter, Informatica IDQ, Informatica MDM, UNIX Shell Scripting, Python.

Confidential, Des Moines, IA

ETL Developer

Responsibilities:

  • Converting business rules into ETL technical specifications
  • Design Informatica Interfaces
  • Primarily responsible for Code Reviews
  • Design suggestion changes for DI HUB processes
  • Coming up with Design plan and Preparing the ETL Design document
  • Creating Unit test cases and plan for Unit testing approach
  • Analyzing the data for defects and fixing the defects
  • Created Mappings using Mapping Designer to load data from various sources, using different transformations like Source Qualifier, Expression, Lookup, Aggregator, and Update Strategy, Joiner, Normalizer, Filter and Router transformations,Union Transformations etc
  • Designed mapplets using Mapplet Designer and used those Mapplets for reusable business logic
  • Implemented Error Handling Strategy
  • Created re-usable transformations and Worklets.
  • Come up with DDL Scripts to create Postgres, Salesforce objects in Oracle DB
  • POC for Liferay, Canvas and Salesforce ETL move
  • Come up with High level and Low level Design document
  • Design Informatica Interfaces
  • Primarily responsible for ETL design, coding and testing strategy
  • Design suggestion changes
  • Analyzing the data for defects and fixing the defects
  • Created Mappings using Mapping Designer to load data from various sources, using different transformations like Source Qualifier, Expression, Lookup, Aggregator, and Update Strategy, Joiner, Normalizer, Filter and Router transformations, Union Transformations etc
  • Designed mapplets using Mapplet Designer and used those Mapplets for reusable business logic
  • Implemented Audit approach
  • Created re-usable transformations and Worklets
  • Involved in data analysis
  • Worked on Postgres, Salesforce connection issues
  • Worked on extracting Salesforce data to Stage on incremental basis

Environment: Informatica Power Center 9.x, Workflow Designer, Data Stage, Power center Designer, Repository Manager, Microsoft SQL server Management Studio 8.0,SSIS,SSRS, Windows Scripting, Windows NT/2000/XP.

Confidential

ETL Developer

Responsibilities:

  • Involved in Requirements review sessions and working with data modeler for design changes
  • Converting business rules into ETL technical specifications
  • Design Informatica Interfaces
  • Proving the requirements to team and clarifying the requirements and managing the team
  • Primarily responsible for Design and Code Reviews
  • Performance tuning and design suggestion changes for existing hpXr processes
  • Coming up with Design plan and Preparing the ETL Design document
  • Creating Process and Data flow diagrams
  • Creating Unit test cases and plan for Unit testing approach
  • Analyzing the data for defects and fixing the defects
  • Created Mappings using Mapping Designer to load data from various sources, using different transformations like Source Qualifier, Expression, Lookup, Aggregator, and Update Strategy, Joiner, Normalizer, Filter and Router transformations,Union Transformations etc
  • Designed mapplets using Mapplet Designer and used those Mapplets for reusable business logic
  • Implemented Error Handling Strategy
  • Created re-usable transformations and Worklets
  • Wrote shell scripts to perform pre-session and post-session operations
  • Co-ordinating with middleware team for migrating Informatica objects into test environment
  • Impact analysis for hpXr upgrade and testing
  • Impact analysis on Facets upgrade and testing
  • Involved in Tidal scheduling sessions coming up with various test cases
  • Automating the ETL applications using Tidal tool
  • Creating SCR’s and CR’s for ETL artifacts

Environment: Informatica Power Center 8.x, WorkFlow Designer, Powercenter Designer, Repository Manager, Oracle 10g, SQL, PL/SQL, SQL Loader, UNIX Shell Scripting, Windows NT/2000/XP, Visio, Business Objects,PVCS, TIDAL Enterprise scheduler, Informatica Data Quality(IDQ), Informatica Data Explorer (IDE),Toad, SQL Developer.

We'd love your feedback!