Sr. Big Data Architect Resume
Minneapolis, MN
SUMMARY
- Around 10 years of experience with strong emphasis on Design, Development, Implementation, Testing and Deployment of Software Applications.
- Over 4+ years of comprehensive IT experience in BigData and BigData Analytics, Hadoop, HDFS, MapReduce, YARN, Hadoop Ecosystem and Shell Scripting.
- 5+ years of development experience using Java, J2EE, JSP and Servlets.
- Highly capable for processing large sets of Structured, Semi - structured and Unstructured datasets and supporting BigData applications.
- Hands on experience with Hadoop Ecosystem components like Map Reduce (Processing), HDFS (Storage), YARN, Sqoop, Pig, Hive, HBase, Oozie, ZooKeeper and Spark for data storage and analysis.
- Expertise in transferring data between a Hadoop ecosystem and structured data storage in a RDBMS such as MY SQL, Oracle, Teradata and DB2 using Sqoop.
- Experienced in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, and EMR, Elasticsearch), Hadoop, Python, Spark and effective use of map-reduce, SQL and Cassandra to solve big data type problems.
- Experience in NoSQL databases like Mongo DB, HBase and Cassandra.
- Hands on experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, PIG, HIVE, HBASE, Apache Crunch, ZOOKEEPER, SCIOOP, Hue, Scala. Solr, Git, Maven, AVRO, JSON and CHEF.
- Experience in Apache Spark cluster and streams processing using Spark Streaming.
- Expertise in moving large amounts of log, streaming event data and Transactional data using Flume.
- Good experience in handling data manipulation using python Scripts and experience in developing Python scripts for system management.
- Excellent expereince of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MRA and MRv2 (YARN).
- Experience in developing MapReduce jobs in Java for data cleaning and preprocessing.
- Expertise in writing Pig Latin, Hive Scripts and extended their functionality using User Defined Functions (UDF's).
- Expertise in handling arrangement of data within certain limits (Data Layout's) using Partitions and Bucketing in Hive.
- Expertise in preparing Interactive Data Visualization's using Tableau Software from different sources.
- Hands on experience in developing workflows execute MapReduce, Sqoop, Pig, Hive and Shell Scripts using Oozie.
- Experience working with Cloudera Hue Interface and Impala.
- Hands on experience developing Solr Indexes using MapReduce Indexer Tool, Akka, Play, Scalaz library.
- Experience in developing and designing POC's using Scala, Spark SQL and MLlib libraries then deployed on the Yarn cluster.
- Expertise in Object-Oriented Analysis and Design (OOAD) like UML and use of various design patterns.
- Experience in Java, JSP, Servlets, EJB, Web Logic, Web Sphere, Hibernate, Spring, JBoss, JDBC, RMI, Java Script, Ajax, JQuery, XML and HTML.
- Fluent with the core Java concepts like I/O, Multi-Threading, Exceptions, Reg Ex, Data Structures and Serialization.
- Extensive experience in Java and J2EE technologies like Servlets, JSP, JSF, JDBC, JavaScript, ExtJS, spring, hibernate, and Junit testing.
- Performed Unit Testing using Junit Testing Framework and Log4J to monitor the error logs.
- Experience in process Improvement, Normalization/De-normalization, Data extraction, cleansing and Manipulation.
- Converting requirement specification, Source system understanding into Conceptual, Logical and Physical Data Model, Data flow (DFD).
- Expertise in working with Transactional Databases like Oracle, SQL server, My SQL, and Db2.
- Expertise in developing SQL queries, Stored Procedures and excellent development experience with Agile Methodology.
- Ability to adapt to evolving technology, Strong sense of Responsibility and Accomplishment.
- Excellent leadership, interpersonal, problem solving and time management skills.
- Excellent communication skills both Written (documentation) and Verbal (presentation)
TECHNICAL SKILLS
Technology: Hadoop Ecosystem/ J2SE/ J2EE/ Oracle.
Languages: Hadoop, HDFS, MapReduce, Hive, Pig, HBase, Spark, Scala, Impala, kafka, Hue, Sqoop, Oozie, Flume, Zookeeper, Cassandra, Cloudera CDH5, Python, PySpark, Solrand Horton works.
DBMS/Databases: Oracle, MySQL, SQL Server, DB2, Mongo DB, Teradata, HBase, Cassandra.
Programming Languages: C, C++, JSE, XML, JSP/Servlets, Struts, Spring, HTML, JavaScript, jQuery, Web services.
Big Data Ecosystem: HDFS, Map Reduce, Oozie, Hive, Pig, Sqoop, Flume, Zookeeper and Hbase, Storm, Kafka, Spark, Scala.
Methodologies: Agile, WaterFall.
NOSQL Databases: Cassandra, MongoDB, HBase.
Version Control Tools: SVN, CVS, VSS, PVCS.
Reporting Tools: Crystal Reports, SQL Server Reporting Services and Data Reports, Business Intelligence and Reporting Tool (BIRT)
PROFESSIONAL EXPERIENCE
Confidential, Minneapolis MN
Sr. Big Data Architect
Responsibilities:
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
- Utilize AWS services with focus on big data Architect /analytics / enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
- Design AWS architecture, Cloud migration, AWS EMR, Dynamo DB, Redshift and event processing using lambda function.
- Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, Pair RDD's, Spark YARN.
- Responsible for importing log files from various sources into HDFS using Flume and Worked on tools Flume, Storm and Spark.
- Involved in AWS, implementing solutions using services like (EC2, S3, RDS, Redshift, and VPC).
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Identify query duplication, complexity and dependency to minimize migration efforts Technology stack: Oracle, Hortonworks HDP cluster, Attunity Visibility, Cloudera Navigator Optimizer, AWS Cloud and Dynamo DB.
- Developed Spark streaming application to pull data from cloud to hive table and used Spark SQL to process the huge amount of structured data.
- Wrote programs in Scala using Spark and worked on migrating MapReduce programs into Spark using Scala
- Assigned name to each of the columns using case class option in Scala and implemented Spark GraphX application to analyze guest behavior for data science segments.
- Automated the cloud deployments using chef, python and AWS Cloud Formation Templates.
- Involved in analyzing and Optimizing RDD's by controlling partitions for the given data and expert in performing business analytical scripts using Hive SQL.
- Implemented continuous integration & deployment (CICD) through Jenkins for Hadoop jobs.
- Worked in writing Hadoop Jobs for analyzing data using Hive, Pig accessing Text format files, sequence files, Parquet files.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Developed automation scripting in Python using Chef to deploy and manage applications.
- Extensively worked with Spark Streaming (version 1.5.2) through core Spark API running Scala, Java to transform raw data from several data sources into forming baseline data.
- Involved in different Hadoop distributions like Cloudera (CDH3 & CDH4) and Horton Works Distributions (HDP) and MapR.
- Developed proto type for Big Data analysis using Spark, RDD, Data Frames and Hadoop ecosystem with .csv, Json, parquet and hdfs files.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
Environment: Big Data, SparkSpark, YARN, HIVE, Pig, Scala, Python, Chef, Hadoop, AWS, Dynamo DB, Kibana, Cloudera, AWS EMR, AWS S3, JDBC, Redshift, NOSQL, Sqoop, MYSQL, Cassandra, MongoDB, HBase, Data Warehouse, ETL and Hadoop Framework.
Confidential, Irving TX
Sr. Hadoop/Big Data Architect
Responsibilities:
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved Low level design for MR, Hive, Impala, Shell scripts to process data.
- Handling Hive queries using Spark SQL that integrate with Spark environment implemented in Scala.
- Used Spark Streaming API with Kafka to build live dashboards; Worked on Transformations & actions in RDD, Spark Streaming, Pair RDD Operations, Check-pointing, and SBT.
- Implemented POC to migrate map reduce jobs into Spark RDD transformation using Scala IDE for Eclipse
- Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
- Installing and configuring Hive, Sqoop, Flume, Oozie on the Hadoop clusters and involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Developed a process for the Batch ingestion of CSV Files, Sqoop from different sources and also generating views on the data source using Shell Scripting and Python.
- Integrated a shell script to create Collections/morphline, SolrIndexes on top of table directories using MapReduce Indexer Tool within Batch Ingestion Framework.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using Python.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL, Python and Scala.
- Configured the Message Driven Beans (MDB) for messaging to different clients and agents who are registered with the system.
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shellscripts (for scheduling of few jobs) extracted and loaded data into DataLake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
- Developed Hive Scripts to create the views and apply transformation logic in the Confidential Database.
- Involved in the design of Data Mart and Data Lake to provide faster insight into the Data.
- Involved in using Stream Sets Data Collector tool and created Data Flows for one of the streaming application.
- Wrote Python code using HappyBase library of Python to connect to HBASE and use the HAWQ querying as well.
- Using Kafka as a data pipeline between JMS (Producer) and Spark Streaming Application (Consumer) and implemented partitioning, dynamic partitions and buckets in HIVE.
- Involved in the development of Spark streaming application for one of the data source using Scala, Spark by applying the transformations.
- Developed Web services and web services clients using both SOAP and REST implantations and designed and Developed web based applications using Hibernate, XML, EJB, and SQL to setup new web services.
- Worked on Apache spark writing Python applications to convert txt, xls files and parse.
- Developed a script in Scala to read all the Parquet Tables in a Database and parse them as Json files, JNI, SWING, HawtJNI, JNA, BridJ, JNAerator
- Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
- Configured Zookeeper for Cluster co-ordination services and distributed message for Spray, RabbitMQ.
- Developed a unit test script to read a Parquet file for testing PySpark on the cluster.
- Involved in exploration of new technologies like AWS, Apache Flink, and Apache NIFIetc which can increase the business value.
- Design & implement ETL process using Informatica to load data from Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Zookeeper, Impala, Java (jdk1.6), Cloudera, Oracle, SQL Server, UNIX Shell Scripting, Flume, Oozie, Scala, Spark, ETL, Sqoop, Python, kafka, PySpark, AWS, S3, MongoDB, Oracle, SQL, Hortonworks, XML.
Confidential, Burlington NJ
Sr. Hadoop Admin/Developer
Responsibilities:
- Responsible for Writing MapReduce jobs to perform operations like copying data on HDFS and defining job flows on EC2 server, load and transform large sets of structured, semi-structured and unstructured data.
- Developed a process for Sqooping data from multiple sources like SQL Server, Oracle and Teradata and responsible for creation of mapping document from source fields to destination fields mapping.
- Developed a shell script to create staging, landing tables with the same schema like the source and generate the properties which are used by Oozie jobs.
- Developed Oozie workflow's for executing Sqoop and Hive actions and worked with NoSQL databases like Hbase in creating Hbase tables to load large sets of semi structured data coming from various sources.
- Performance optimizations on Spark/Scala. Diagnose and resolve performance issues.
- Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow.
- Developed scripts to run Oozie workflows, capture the logs of all jobs that run on cluster and create a metadata table which specifies the execution times of each job.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Extracted the feeds from social media sites such as Facebook, Twitter using Python scripts.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Developed MapReduce (YARN) jobs for cleaning, accessing and validating the data and Installed Oozie workflow engine to run multiple Hive and Pig jobs.
- Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWSprofiling.
- Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform
- Developed Hive scripts for performing transformation logic and also loading the data from staging zone to final landing zone.
- Worked on Parquet File format to get a better storage and performance for publish tables and involved in loading transactional data into HDFS using Flume for Fraud Analytics.
- Developed Python utility to validate HDFS tables with source tables and designed and developed UDF'S to extend the functionality in both PIG and HIVE.
- Import and Export of data using Sqoop between MySQL to HDFS on regular basis.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications..
- Developed MapReduce Jobs in data cleanup, validating and to perform ETL and wrote Hive/Impala queries for ad-hoc reporting, summarizations and ETL.
- Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows and Involved in using CA7 tool to setup dependencies at each level (Table Data, File and Time).
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, Pair RDD's, Spark YARN.
- Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
- Developed scripts to automate routine DBA tasks using Linux Shell scripts and Python.
Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Kafka, Zookeeper, Oozie, Impala, Java (jdk1.6), Cloudera, Oracle, Teradata SQL Server, Python, UNIX Shell Scripting, ETL, Flume, Scala, Spark, Sqoop, Python, AWS, S3, EC2, Kafka, Oracle, MySQL, Hortonworks, YARN, Python.
Confidential
Hadoop Administrator/Developer
Responsibilities:
- Responsible for Managing, Analyzing and Transforming petabyte's of data and also quick validation check on FTP file arrival from S3 Bucket to HDFS.
- Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs.
- Involved in creation of Hive tables and loading data incrementally into the tables using Dynamic Partitioning and Worked on Avro Files, JSON Records.
- Involved in using Pig for data cleansing and developed Pig Latin scripts to extract the data from web server output files to load into HDFS.
- Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon DynamoDB.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and PigUDF's.
- Worked on Hive by creating external and internal tables, loading it with data and writing Hive queries.
- Offline Analysis was performed on HDFS and sent the results to MongoDB databases to update the information on the existing table, From Hadoop to MongoDB move was done using MapReduce, Hive/ Pigscripts by connecting with Mongo-Hadoop connectors.
- Involved in development and usage of UDTF's and UDAF's for decoding Log Record Fields and Conversion's, Generating Minute Buckets for the specified Time Interval's and JSON Field Extractor.
- Responsible for Debug, Optimization of Hive Scripts and also implementing duplication Logic in Hive using a Rank Key Function (UDF) and developed Pig and Hive UDF's to analyze the complex data to find specific user behavior.
- Experienced in writing Hive Validation Scripts which are used in validation framework (for daily analysis through graphs and presented to business users).
- Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
- Involved for Cassandra Database Schema design and using BULK LOAD Utility data pushed to Cassandra databases.
- Responsible for creating Dashboards on Tableau Server and generated reports for hive tables in different scenarios using Tableau
- Responsible for Scheduling using Active Batch jobs and Cron jobs and involved in Jar builds that can be triggered by commits to Github using Jenkins.
- Exploring new tools for data tagging like Tealium (POC Report)
- Actively updated the upper management with daily updates on the progress of project that include the classification levels that were achieved on the data.
Environment: Hadoop, Map Reduce, HDFS, Pig, Hive, HBase, Zookeeper, Oozie, Impala, Cassandra, Java (jdk1.6), Cloudera, Oracle 11g/10g, Windows NT, UNIX Shell Scripting, Tableau, Tealium, AWS, S3, SQL, Python.
Confidential
Sr. Java Developer
Responsibilities:
- Developed detail design document based on design discussions and involved in designing the database tables and java classes used in the application.
- Involved in development, Unit testing and system integration testing of the travel network builder side of application.
- Involved in design, development and building the travel network file system to be stored in NAS drives.
- Setup Linux environment for to interact with route smart library (.so) file and NAS drive file operations using JNI.
- Implemented and configure Hudson as Continuous Integration server and Sonar for maintaining code and remove redundant code.
- Extensively worked with Hibernate Query Language (HQL) to store and retrieve the data from Oracle database.
- Developed Java Web Applications using JSP and Servlets, Struts, Hibernate, spring, Rest Web Services, SOAP.
- Provide support in all phases of Software development life cycle (SDLC), quality management systems and project life cycle processes. Utilizing Database Such as MYSQL, Following HTTP and WSDL Standards to Design the REST/ SOAP Based Web API’S using XML, JSON, HTML, and DOM Technologies.
- Involved in Migrating existing distributed JSP framework to Struts Framework, designed and involved in research of Struts MVC framework
- Designed Graphical User Interface (GUI) applications using HTML, JSP, JavaScript (JQuery), CSS and AJAX.
- Worked with Route-smart C++ code to interact with Java application using SWIG and Java Native interfaces.
- Developed the user interface for requesting a travel network build using JSP and Servlets.
- Build business logic to users can specify which version of the travel network files to be used for the solve process.
- Used Spring Data Access Object to access the data with data source and build an independent property sub-system to ensure that the request always picks the latest set of properties.
- Implemented thread Monitor system to monitor threads. Used JUnit to do the Unit testing around the development modules.
- Wrote SQL queries and procedures for the application, interacted with third party ESRI functions to retrieve map data and building and Deployment of JAR, WAR, EAR files on dev, QA servers.
- Bug fixing (Log 4j for logging) and testing support after the development and prepared requirements and research to move the map data using Hadoop framework for future usage.
Environment: Java 1.6.21, J2EE, Oracle 10g, Log4J 1.17, Windows 7 and Red Hat Linux, Sub version, Spring 3.1.0, Icefaces 3, ESRI, Weblogic 10.3.5, Eclipse Juno, Junit 4.8.2, Maven 3.0.3, Hudson 3.0.0 and Sonar 3.0.0, HTML, CSS, JSON, JSP, JQuery, JavaScript.
Software Programmer
Confidential
Responsibilities:
- Involved in the analysis & design of the application using Rational Rose and developed the various action classes to handle the requests and responses.
- Designed and created Java Objects, JSP pages, JSF, JavaBeans and Servlets to achieve various business functionalities and created validation methods using JavaScript and Backing Beans.
- Involved in writing client side validations using JavaScript, CSS.
- Involved in the design of the Referential Data Service module to interface with various databases using JDBC.
- Used Hibernate framework to persist the employee work hours to the database.
- Developed classes and interface with underlying web services layer and prepared documentation and participated in preparing user's manual for the application.
- Prepared Use Cases, Business Process Models and Data flow diagrams, User Interface models.
- Back end server side coding and development using Java data structure as a Collections including Set, List, Map, Exception Handling, Vaadin, Spring with dependency injection, Struts Framework, Hibernate, Servlets, Action, Action Forms &Java beans, etc.
- Responsible to enhance the UI using HTML, Java Script, XML, JSP, CSS as per the requirements and providing the client side using JQuery validations.
- Involved in write application level code to interact with APIs, Web Services using AJAX, JSON and XML.
- Wrote lots of JSP's for maintains and enhancements of the application. Worked on Front End using Servlets, JSP and also backend using Hibernate.
- Gathered & analyzed requirements for EAuto, designed process flow diagrams.
- Defined business processes related to the project and provided technical direction to development workgroup.
- Analyzed the legacy and the Financial Data Warehouse and participated in Data base design sessions, Database normalization meetings.
- Managed Change Request Management and Defect Management and managed UAT testing and developed test strategies, test plans, reviewed QA test plans for appropriate test coverage.
- Involved in Developing JSP's, action classes, form beans, response beans, EJB's and extensively used XML to code configuration files.
- Developed PL/SQL stored procedures, triggers and performed functional, integration, system and validation testing.
Environment: Java, J2EE, JSP, JCL, DB2, Struts, SQL, PL/DSQL, Eclipse, Oracle, Windows XP, HTML, CSS, JavaScript, and XML.