We provide IT Staff Augmentation Services!

Sr. Spark Developer/hadoop Resume

Austin, TX

SUMMARY:

  • 8+ years of IT experience in a variety of industries, which includes hands on experience in Hadoop technologies like HDFS, MapReduce, Pig Latin, Hive, Hbase, Sqoop, Oozie, Flume and Zookeeper as well as Apache Spark application development both on - prem and Cloud.
  • Expertise on coding in different technologies i.e. Python, Java, Scala and Unix shell scripting.
  • Well versed experience in Amazon Web Services (AWS) Cloud services like EC2, S3, EMR, DynamoDB.
  • Integration of visualizations into Spark applications using Databricks and visualization libraries like ggplot and matplotlib.
  • Developing PySpark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats.
  • Good experience in working with concepts of Hadoop Architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, and MapReduce concepts.
  • Experience in handling different file formats like Text files, Sequence files, Avro data files using different SerDe in Hive.
  • Mastered in using different columnar file formats like RCFile, ORC and Parquet formats.
  • Experience in working with MapReduce Framework and Spark execution model.
  • Hands-on experience in programming with Resilient Distributed Datasets (RDDs), data frames and dataset API.
  • Experienced in Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Experienced in writing custom Hive UDF's to in corporate business logic with Hive queries.
  • Experience in process improvement, Normalization/deNormalization, data extraction, data cleansing, data manipulation on HIVE.
  • Experience in loading data files from HDFS to Hive for reporting.
  • Experience in writing Sqoop commands to import data from relational databases to HDFS.
  • Having experience in SQL Server and Oracle Database and in writing queries.
  • Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Extensively used Pig for data cleansing.
  • Developed the Pig UDF's to pre-process the data for analysis.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Used Python scripts to build a workflow in Autosys to automate the tasks in three zones in the cluster.
  • Experienced working with the Business team for gathering the requirements and fully understand the business requirements.
  • Designed and created data extracts, supporting Power BI, Tableau, or other visualization tools reporting applications.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
  • Experience in Database design, Data analysis, Programming SQL.
  • Hands on experience in designing the REST based Micro services using the Spring Boot, Spring Data with JPA.

TECHNICAL SKILLS:

Big Data Technologies: HDFS, MapReduce, Yarn, HIVE, PIG, Pentaho, HBase, Oozie, Zookeeper, Sqoop, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro, Parquet, Snappy.

NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis.

Cloud Services: Amazon AWS, Google Cloud.

Languages: C, C++, Java, Scala, Python, HTML, SQL, PL/SQL, Pig Latin, HiveQL, UNIX, JavaScript, Shell Scripting.

ETL Tools: Informatica, IBM DataStage, Talend.

Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.

Databases: Oracle, MySQL, DB2, Teradata, Microsoft SQL Server.

Operating Systems: UNIX, Windows, iOS, LINUX.

Build Tools: Jenkins, Maven, ANT, Azure.

Frame works: MVC, Struts, Spring, Hibernate.

Version Controls: Subversion, Git, Bitbucket, GitHub

Methodologies: Agile, Waterfall.

PROFESSIONAL EXPERIENCE:

Sr. Spark Developer/Hadoop

Confidential, Austin, TX

Responsibilities:

  • Responsible for mapping the Hive tables and designing the data transformations to move to AWS Redshift.
  • Working as a part of AWS build team to create, configure and manage S3 bucket (storage).
  • Experience on AWS EC2, EMR, LAMBDA and CloudWatch.
  • Involved in moving the collections from mongo dB to hive through Spark MongoDB connect.
  • Involved in running and scheduling spark submits through Airflow and shell scripts through Korn Autosys.
  • Involved in performance tuning the Spark Sql and spark-submit jobs.
  • Developed Hive tables on data using different storage format and compression techniques.
  • Optimized the data sets by creating Partitioning and Bucketing in Hive and performance tuning of Hive queries.
  • Created RDD's in Spark technology and extracted data from data warehouse on to the Spark RDD's.
  • Working with Spark SQL and creating RDD using PySpark spark Context & Spark Session.
  • Developed data pipeline using Kafka, Sqoop, Hive and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Developed SQOOP scripts for importing and exporting data into HDFS and Hive.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Created a Lambda script to create spot and transitional EMR clusters to run scheduled Spark autosys jobs and send alerts using SNS.
  • Integrated Glue jobs with Spark to create the crawlers for each data ingestion on S3 data.
  • Responsible for building BI queries from Redshift/PostgreSQL and making the data available in S3 for downstream users.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Work with Hive on pySpark, Scala spark API's to create tables, loading TB of historical data and creating Nifi ETL process for daily update from BAC- S3. using Spark Core for joining the data to deliver the reports and for delivering the fraudulent activities.
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
  • Setting up and working on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access clusters for users.
  • Generated various kinds of reports using Power BI and Tableau based on Client specification.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.

Environment: Hadoop 2.4.X, Map Reduce, Hive, Impala, Pig, Oozie,, Kafka, Sqoop, Scala, HBase, Java, Maven, AWS, EMR, S3, Spark, Pyspark, spark SQL, Apache Nifi.

Spark Developer/AWS/Hadoop

Confidential, Orlando, FL

Responsibilities:

  • Created Partitioning, Bucketing, and Map Side Join, Vectorization, Parallel execution for optimizing the hive queries decreased the time of execution from hours to minutes.
  • Developed PySpark Data Pipeline to process the NPT data which lands on the partitioned S3 buckets then transform the data and store it into S3 which runs on a transient cluster.
  • Created and automated the scala Spark Jobs execution from Step Function/state machine though CloudFormation and schedule from Control- M.
  • For streaming Kafka- spark jobs we have long-running EMR clusters with the required applications, where the event-based proxy data get processed into S3, and then scala Spark API job will transform the data and push the data back to S3/Redshift database.
  • We also have few pyspark sql jobs to join the multiple tables and create the analytical data on S3 and then move the data into Redshift.
  • Developed Apache Spark applications by using Scala and PySpark for data processing from various streaming sources.
  • Worked on Scala programming in developing spark streaming jobs for building stream data platforms integrating with Kafka.
  • Experienced in building Apache spark for CDC data by spinning EMR cluster transitional and spot instance types specs using CloudFormation script.
  • Developed Sqoop scripts to import and export data from MySQL and handled incremental and updated changes into HDFS and Hive tables.
  • Worked on Amazon EC2 and EMR instances to retrieve and store data in the Amazon S3 cloud and write the Athena SQL queries to unload the data from Redshift to S3.
  • Created a Lambda script to create spot and transitional EMR clusters to run scheduled Spark autosys jobs and send alerts using SNS.
  • Integrated Glue jobs with Spark to create the crawlers for each data ingestion on S3 data.
  • Responsible for building BI queries from Redshift/PostgreSQL and making the data available in S3 for downstream users.
  • Worked on creating the Glue catalogs for S3 datalake and transforming the data.
  • Build Nifi, Kafka flow for streaming data and transform the data using Pyspark and load into S3.
  • Involved in converting Hive/SQL queries into Spark transformations using SparkSQL, RDD's and Data frames using Python.
  • Developed AWS Lambda to trigger the Spark code whenever an XML object came into the S3 bucket using Python automated script.
  • Implemented Oozie workflows using Sqoop, hive, shell actions and oozie coordinator to automate tasks.
  • Responsible for writing the data monitoring scripts using Python for data validation.
  • Responsible for maintaining the documentation of each project from scratch to end.

Environment: Spark, Java, Nifi, Python, scala, Glue, Tez Hive, Teradata, PostgreSQL, Cloudera Hadoop, Yarn, Sqoop, Oozie, SNS, EC2, EMR, S3, PySpark, Redshift, CloudWatch, Data pipeline, Shell, Bamboo, Maven.

Hadoop/Spark Developer

Confidential

Responsibilities:

  • Responsible for building customer centric Data Lake in Hadoop which would serve as the Analysis and Data Science Platform.
  • Responsible for building scalable distributed data solutions on Cloudera distributed Hadoop.
  • Used Sqoop, Kafka for migrating data and incremental import into HDFS and Hive from various other data sources.
  • Modeled and built Hive tables to combine and store structured data and unstructured sources of data for best possible access.
  • Integrated Cassandra file system to Hadoop using Map Reduce to perform analytics on Cassandra data.
  • Used Cassandra to store billions of records to enable faster & efficient querying, aggregates & reporting.
  • Developed Spark Jobs using Python (Pyspark) APIs.
  • Migrated Python programs into Spark Jobs for Various Processes.
  • Involved in Job management and Developed job processing scripts using Oozie workflow.
  • Implemented ETL pipelines to ingest data from traditional EDW Hadoop.
  • Implementing optimization techniques in hive like partitioning tables, De-normalizing data & Bucketing.
  • Used Spark SQL to create structured data by using data frame and querying from other data sources and Hive.
  • Involved in migration of Map Reduce programs into Spark transformations using PySpark, SparkSQL and Scala and Java.
  • To support Data Scientists with Data and Platform Setup for their analysis and finally migrating their finished product to Production.
  • Worked on cleansing and extracting meaningful information from clickstream data using Spark and Hive.
  • Involved in performance tuning of Spark Applications for setting the right level of Parallelism and memory tuning.
  • Used optimization techniques in spark like Data Serialization and Broadcasting.
  • Optimizing of existing algorithms in Hadoop using Spark, Spark-SQL and Data Frames.
  • Implemented POC in persisting click stream data with Apache Kafka.
  • Developed a POC on transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers.
  • Implemented data pipelines to move processed data from Hadoop to RDBMS and No-Sql Databases.
  • Followed Agile & Scrum principles in developing the project.

Environment: Hadoop (2.6.5), HDFS, Spark (2.0.2), Spark-SQL, Sqoop, Hive, Apache Kafka, Python, Scala (2.11), Pyspark, Cassandra and Oozie, Cloudera(CDH5) Java JDK 1.8.

ETL Spark Developer/Hadoop

Confidential

Responsibilities:

  • Extracted data from different flat files, database tables and transformed the data based on user requirement using Informatica Power Center and loaded data into target, by scheduling the sessions.
  • Worked on Informatica Power Center tool - Source Analyzer, Data Warehousing designer, Mapping Mapplet Designer and Transformation Designer. Developed Informatica mappings and also in tuning of mappings for better performance.
  • Worked extensively with Sqoop for importing data into Hadoop FS from different data sources like Oracle, Teradata, SQL server etc.
  • Involved in creating Hive tables, loading data into Hadoop FS and analysing data using hive queries.
  • Used Tableau to connect with Hive for generating daily reports.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE and reading/writing data to/ from Hadoop FS.
  • Created UDF in python scripts on SPARK environment for data processing.
  • Developed PySpark scripts for data Aggregation, queries and writing data back into the OLTP system through Sqoop.
  • Developed DataFrame/Dataset/RDD using Scala using high order functions like Map, FlatMap, Filter etc.
  • Created SPARK jobs using scala in Eclipse and build/ packaging via Spark SBT.
  • Experience in using Avro, Parquet, ORC File and JSON file formats and UDFs using Hive.
  • Converted existing Hadoop mapR job in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Improved performance of Hadoop jobs written in Sqoop also worked on tuning of Hive queries and Spark jobs.
  • To understand requirements with BAs and modify / develop database objects as per the requirements for backend and frontend. Also create/Modify Oracle procedure, Function, Packages. Refcursors, Views etc.
  • Converted existing BO reports to tableau dashboards.
  • Developed Interfaces using UNIX Shell Scripts to automate the bulk load & update Processes.
  • Developed Tableau data visualization using Cross tabs, Heat maps and Whisker charts
  • Utilized Tableau server to publish and share the reports with the business users.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE for data warehouse systems.
  • To create logical and physical data models using Erwin.
  • Created DFD and database documentation Also provide production /UAT support for issues.
  • Importing and exporting data through SQL loader and Catalog Data analysis
  • Used Oracle BULK BINDING,Hints, INDEXs, table partition etc. for performance improvement.
  • To follow best programming practices to avoid future performance issues.
  • Created SQL scripts for Prod/UAT/QA deployment.
  • Used JIRA for Agile Project Management.

Environment: Cloudera 4.4, Spark, Hive, Pig, Scala, Jira, Agile, Unix, Map Reduce, Teradata BTEQ, Python, Sqoop, Oozie, Eclipse, Informatica Power Center 7.x/8.x, Oracle 11g, Excel files, Flat Files, Autosys, HP Quality Center.

ETL Developer

Confidential

Responsibilities:

  • Gathered requirements, analysed and wrote the design documents.
  • Prepared High Level Logical Data Models using Erwin, and later translated the model into physical models using the Forward Engineering technique.
  • Involved in Data mapping specifications to create and execute detailed system test plans. The data mapping specifies what data will be extracted from an internal data warehouse, transformed, and sent to an external entity.
  • Analyzed business requirements, system requirements, data mapping requirement specifications, and responsible for documenting functional requirements and supplementary requirements in Quality Center.
  • Setting up of environments to be used for testing and the range of functionalities to be tested as per technical specifications.
  • Involved with data profiling for multiple sources and answered complex business questions by providing data to business users.
  • Worked with data investigation, discovery, and mapping tools to scan every single data record from many sources.
  • Delivered file in various file formatting systems (ex. Excel file, Tab delimited text, Comma separated text, Pipe delimited text etc.)
  • Perform data reconciliation between integrated systems.
  • Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from Teradata database.
  • Worked with data investigation, discovery and mapping tools to scan every single data record from many sources.
  • Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from Mongo DB through Sqoop and placed in HDFS and processed.
  • Metrics reporting, data mining and trends in helpdesk environment using Access
  • Create and Monitor workflows using workflow designer and workflow monitor.
  • Developed Java Beans and Utility Classes for interacting with the database using JDBC.
  • Involved in extensive DATA validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues.
  • Identify & record defects with required information for issue to be reproduced by development team.

Environment: PL/SQL, Business Objects XIR2, ETL Tools Informatica9.5/8.6/9.1 Oracle 11G, Teradata V2R12/R13.10, Teradata SQL Assistant 12.0, DB2, Java, Business Objects, SQL, SQL Server 2000/2005, UNIX, Shell Scripting, Quality Center 8.2

Hire Now