Sr. Spark Aws /big Data Developer Resume
Dallas, TX
SUMMARY
- 10+ years Professional Software developer wif of technical expertise in all phases of Software
- Development cycle (SDLC), in various Industrial sectors expertise in Big data analyzing Frame works and Java/J2EE technologies.
- 4+ years of industrial experience in of Row keys &, Schema Design wif NOSQL databases like MongoDB, HBase, Cassandra.
- Good Experience on Building Python Rest APi. Has noledge in ablnitio
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of
- Hadoop performed advanced analytical application by making use of Spark wif Hive and SQL/Oracle.
- Excellent Programming skills at a higher level of abstraction using Scala, Java and Python.
- Experience in using D - Streams, Accumulator, Broadcast variables, RDD caching for Spark Streaming.
- Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL
- Knowledge on Ab Initio Design, Configuration experience in Ab Initio ETL .
- Experience wif Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory(ADF), SSIS
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python and PySpark.
- Development of company´s internal CI system, providing a comprehensive API for CI/CD.
- Strong experience and noledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Working noledge of Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python
- Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data
- Developed Spark applications using Python utilizing Data frames and Spark SQL API for faster processing of data.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Running of Apache Hadoop, CDH and Map-R distributions, Elastic Map Reduce (EMR) on (EC2).
- Expertise in developing Pig Latin scripts and Hive Query Language.
- Developed Customized UDF and UDAF in java to extend HIVE and Pig core functionality.
- Created Hive tables to store structured data into HDFS and processed it using HiveQL.
- Experience in validating and cleansing the data using Pig statements and hands-on experience in
- Developing Pig MACROS.
- Working noledge in installing and maintaining Cassandra by configuring the Cassandra, yaml file as per the business requirement and performed reads/writes using Java JDBC connectivity.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and
- Aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other
- Compressed file formats Codecs like Zip, Snappy, Lzo.
- Strong hands-on experience in AWS Glue.
- Identify improvements to enhance CI/CD
- Good experience in optimizing Map Reduce algorithms using Mappers, Reducers, combiners and
- Partitioned to deliver the best results for the large datasets.
- Good noledge on build tools like Maven, Log4j and Ant.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics,Azure SQL Database, spandAzure SQL Data warehouseand controlling and granting database access and Migrating On premise databases toAzure Data Lake storeusing Azure Data factory.
- Hands on experience in using various Hadoop distributions (Cloudera (CDH 4/CDH 5), Horton works,
- Map-R, IBM Big Insights, Apache and Amazon EMR Hadoop distributions.
- Expertise in the analysis, design, and development of custom solutions/applications using Microsoft Azure technology stack basically on Virtual Machine, Azure Data Factory and Azure Data Bricks.
- Development level experience in Microsoft Azure, PowerShell, Python, Azure Data Factory, Data Bricks.
- Experienced in writing Ad Hoc queries using Cloudera Impala also used Impala analytical functions.
- In depth understanding/noledge of Hadoop Architecture and various components such as HDFS,
- Map Reduce Programming Paradigm and YARN architecture.
- Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
- Proficient in developing, deploying and managing the Solar from development to production.
- Used various Project Management services like JIRA for tracking issues, GitHub for various code
- Reviews and Worked on various version control tools like CVS, GIT, and SVN.
- Hands-on noledge in Core Java concepts like Exceptions, Collections, Data-structures, I/O. Multi-
- Threading, Serialization and deserialization of streaming applications.
- Hands on experience in NOSQL databases like HBase, Cassandra and MongoDB.
- Experience in Software Design, Development and Implementation of Client/Server Web based
- Applications using JSTL, jQuery, JavaScript, Java Beans, JDBC, Struts, PL/SQL, SQL, HTML, CSS,
- PHP, XML, AJAX and had a bird’s eye view on React Java Script Library.
- Experience in maintaining an Apache Tomcat MYSQL, LDAP, Web service environment.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Done Clustering, regression and Classification using Machine learning libraries Mahout, MLlib (Spark).
- Good experience wif use-case development, wif Software methodologies like Agile and Waterfall.
- Proven ability to manage all stages of project development Strong Problem Solving and Analytical skills And abilities to make Balanced and Independent Decisions.
TECHNICAL SKILLS
Hadoop Technologies and Distributions: HDP, Cloudera
Hadoop Ecosystem: HDFS, Hive, Pig, Sqoop, Oozie, Flume, Spark, Zookeeper, Map-Reduce, Spark-SQL, Spark Streaming and Spark MLib.
NoSQL Databases: HBase, Cassandra
Programming: C, C++,Python, Java, SCALA,PL/SQL,SBT,MAVEN
RDBMS: ORACLE, MySQL, SQL Server
Web Development: HTML, JSP, Servlets, JavaScript, CSS, XML
IDE: Eclipse4.x, NetBeans, Microsoft Visual Studio
Operating Systems: Linux (Red Hat, CentOS), Windows XP/7/8 and Z/OS(Main Frames)
Web Servers: Apache Tomcat
Cluster Management Tools: Cloudera Manager, Horton Works Ambari and Hadoop Security Tools
PROFESSIONAL EXPERIENCE
Confidential, Dallas TX
Sr. Spark AWS /Big Data Developer
Responsibilities:
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on, data lake amazon glue and Amazon EMR to perform the necessary transformations based on the STMs developed.
- Worked wif the Spark for improving performance and optimization of the existing algorithms in Hadoop
- Using Spark Context, Spark-SQL, Spark MLib, Data Frame, Pair RDD, Spark YARN.
- Strong hands-on experience in AWS Glue.
- Development of company´s internal CI system, providing a comprehensive API for CI/CD.
- Involved in Ab Initio Design, Configuration experience in Ab Initio ETL and Data Mapping.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common
- Learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python and PySpark.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics,Azure SQL Database, Hadoop andAzure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake storeusing Azure Data factory.
- Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI
- Updates. Developed Spark applications using Python utilizing Data frames and Spark SQL API for faster processing of data.
- Identify improvements to enhance CI/CD
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output Response.
- Develop framework for converting existing Power Center mappings and to PySpark(Python and Spark) Jobs.
- Expertise in the analysis, design, and development of custom solutions/applications using Microsoft Azure technology stack basically on Virtual Machine, Azure Data Factory and Azure Data Bricks.
- Development level experience in Microsoft Azure, PowerShell, Python, Azure Data Factory, Data Bricks
- Experience wif Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory(ADF), SSIS
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming wif Kafka as a
- Data pipe-line system. Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small Data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.
- Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed data
- Transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Used the Spark Data Stax Cassandra Connector to load data to and from Cassandra.
- Worked from Scratch in Configurations’ of Kafka such as Mangers and Brokers
- Experienced in creating data-models for Clients transactional logs, analyzed the data from Casandra
- Tables for quick searching, sorting and grouping using the Cassandra Query Language (CQL).
- Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
- Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables
- Stored in Hive to perform data analysis to meet the business specification logic.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in
- Downstream systems for Data analysis and engineering type of roles.
- Worked in Implementing Kafka Security and Boosting its Performance
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive and Pig.
- Developed Custom Pig UDF in Java and used UDFs from PiggyBank for sorting and preparing the data.
- Developed Custom Loaders and Storage Classes in PIG to work on several data formats like JSON,
- XML, CSV and generated Bags for processing using pig etc.
- Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
- Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using Java API, also Used Jenkins for Continuous integration.
- Setting up and worked on Kerberos authentication TEMPprincipals to establish secure network communication
- On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, MapR, HDFS, Hive, Pig, Apache KafkaSqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, JenkinsEclipse, Oracle, Git, Oozie, Tableau, MySQL,docker Soap, Cassandra and Agile Methodologies.
Confidential, Dallas Texas
Sr. Spark/AWS Developer
Responsibilities:
- Writing Spark Job reading sensor data from Kafka, implementing machine learning model and persisting the data wif prediction in Cassandra Database.
- Worked on building Spark (which TEMPhas multiple Spark Sources and Spark Sinks) Workflow which helps data analysts to run the predictive maintenance workflow.
- Extensively worked wif Kafka for reading and producing the data from the different sources like IoT devices, sensor and OBD devices.
- Worked wif different algorithms of classification, regression and clustering. (We can specify the algorithm names belong to the specific category of the ml algorithms).
- Maintained the models in the HDFS and made the predictions for the live streaming data from the sensors and stored the predicted values for visualization purpose.
- Implemented multiple use cases wif spark core, spark SQL, spark streaming and spark ML.
- Migrated Azure platform to AWS Platform as per Business Requirement.
- Implemented process builder Spring Services, Controllers to trigger Spark-Submit (start & stop job) through REST call.
- Worked on ELK stack for log monitoring for analyzing the spark logs and different sensor devices logs.
- Worked ETL phases of the data like data cleansing, data massaging and data cleanup, filtering the data which is useful for model building.
- Involved in different phases of model building activities like verifying the performance of a model by comparing the different features of a problem statement.
- Involved in the performance tuning of the model by verifying the model results time to time and worked wif different attributes and features for getting the better performance and in model rebuilding activities.
- Worked wif Avro, Parquet and other file formats. involved in Amazon EMR cluster management and data lake .
- Deployed the spark jobs and monitor their status in the cluster and analyzing the EMR logs after the job processing done.
- Good noledge on Microsoft Azure ML studio and done multiple POCs to the clients for different problem statements.
- Good noledge on the AWS s3 file system and stored the models in the s3 buckets and done the predictions from their.
- Good noledge on some of the visualization tools like Kibana and tableau
Environment: Jdk1.8, Spring wif REST API, Spring boot Data jpa, My SQL, Hadoop 2.x, HDFS, Map Reduce, Spark core, spark SQL, Spark Streaming, Spark ML lib, Cassandra, Elastic Search, ELK stack, Bit Bucket, Gradle, Oozie.
Confidential, PA
Hadoop/Spark Developer
Responsibilities:
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis.
- Experienced in querying data using SparkSQL on top of Spark engine for faster data sets processing.
- Worked on implementing Spark Framework a Java based Web Frame work.
- Worked and learned a great deal from AWS cloud services like EC2, S3, EBS, RDS and VPC.
- Implemented Elastic Search on Hive data warehouse platform.
- Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
- Worked wif ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
- Written java code to format XML documents, uploaded them to Solve server for indexing.
- Optimized Hive QL Scripts by using execution engine like Tez.
- Worked on Ad hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Processed the Web server logs by developing Multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis, also extracted files from MongoDB through Flume and processed.
- Expert noledge on MongoDB, NoSQL data modeling, tuning, disaster recovery backup used it for distributed storage and processing using CRUD.
- Extracted and restructured the data into MongoDB using import and export command line utility tool.
- Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
- Experience in creating tables, dropping and altered at run time wifout blocking updates and queries using HBase and Hive.
- Hands on experience in NOSQL databases like HBase, Cassandra and MongoDB.
- Experience in working wif different join patterns and implemented both Map and Reduce Side Joins.
- Wrote Flume configuration files for importing streaming log data into HBase wif Flume.
- Imported several transactional logs from web servers wif Flume to ingest the data into HDFS. Using
- Flume and Spool directory for loading the data from local system(LFS) to HDFS.
- Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Loading Data into HBase using Bulk Load and Non-bulk load.
- Worked on continuous Integration tools Jenkins and automated jar files at end of day.
- Worked wif Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup and debug log stash to send Apache logs to AWS
- Elastic search.
- Used Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Experienced noledge over designing Restful services using java-based APIs like JERSEY.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Supported in setting up QA environment and updating configurations for implementing scripts wif Pig,
- Hive and Sqoop
Environment: HDP 2.3, Hadoop, HDFS, Hive, Map Reduce, AWS Ec2, SOLR, Impala, MySQL, OracleSqoop, Flume, Spark, SQL Talend, Python, PySpark, Yarn, Pig, Oozie, Linux-Ubuntu, Scala, Ab InitioTableau, Maven, Jenkins, Java (JDK 1.6), Cloudera, JUnit, agile methodologies
Confidential
Big Data Hadoop Developer
Responsibilities:
- Analyzing and writing Hadoop Map reduce jobs using Java API, Pig and Hive.
- Exported data using Sqoop from HDFS to Teradata on regular basis.
- Write scripts to automate application deployments and configurations. Monitoring YARN applications.
- Wrote map reduce programs to clean and pre-process the data coming from different sources.
- Implemented various output formats like Sequence file and parquet format in Map reduce programs.
- Also, implemented multiple output formats in the same program to match the use cases.
- Using Pig to apply transformations, cleaning and deduplication of data from raw data sources.
- Installation of Oozie workflow to run multiple Hive.
- Implemented test scripts to support Test Driven Development(TDD) and continuous integration.
- Converted text files into Avro then to parquet format for the file to be used wif another Hadoop Eco system tools.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Exported the analyzed data to HBase using Sqoop and to generate reports for the BI team.
- Analysed large amounts of data sets to determine optimal way to aggregate and report on it.
- Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application,
- Reference implementation.
- Development of JavaScript for client end data entry validations and Front-End Validation.
- Deployed Web, presentation and business components on Apache Tomcat Application Server.
- Developed PL/SQL procedures for different use case scenarios
- Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
- Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings wif various business users.
Environment: Hadoop 1.0.4, Python, MapReduce, HDFS, Hive 0.10, Pig, Hue, Spark, Kafka, Oozie, Core Java, Eclipse, HBase, Flume, Cloudera Manager, Greenplum DB, IDMS, VSAM, SQL PLUS, Toad, Putty, Windows NT, UNIX Shell Scripting, Linux 5, PentahoBigdata, YARN, HawQ, SpringXD, Eclipse, Java SDK 1.6
Confidential
Software Engineer
Responsibilities:
- Involved in Requirements Analysis and design an Object-oriented domain model.
- Involvement in the detailed Documentation, written functional specifications of the module.
- Involved in development of Application wif Java and J2EE technologies.
- Develop and maintain elaborate services-based architecture utilizing open source technologies like
- Hibernate, ORM and Spring Framework.
- Developed server-side services using Java multithreading, Struts MVC, Java, EJB, Spring, Webservices (SOAP, WSDL, AXIS).
- Responsible for developing DAO layer using Spring MVC and configuration XMLs for Hibernate and toalso manage CRUD operations (insert, update, and delete).
- Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application,
- Reference implementation.
- Development of JavaScript for client end data entry validations and Front-End Validation.
- Deployed Web, presentation and business components on Apache Tomcat Application Server.
- Developed PL/SQL procedures for different use case scenarios
- Involvement in post-production support, Testing and used JUNIT for unit testing of the module.
Environment: Java/J2EE, JSP, XML, Spring Framework, Hibernate, Eclipse (IDE), Java Script, Ant, SQL, PL/SQL, Oracle, Windows, UNIX, Soap, Jasper reports.