We provide IT Staff Augmentation Services!

Spark Developer Resume

3.00/5 (Submit Your Rating)

Piscataway, NJ

SUMMARY:

  • Extensive experience working as Big Data Developer and BI D eveloper across multiple industries such as finance, insurance and health care institutions.
  • Worked with different Hadoop distribution platform including AWS S3/EMR/EC2, Cloudera CDH.
  • Solid technical skills in data engineer with hadoop ecosystem including (HDFS, MapReduce, Hive, Hbase, Impala, Spark (Spark core, SparkSQL, Spark Streaming), Kafka, Sqoop, F lume, Phoenix,Oozie.
  • Solid programming skills in both OOP and functional languages such as JAVA, Python, Scala, C#, R, Matlab.
  • Proficency in writing different SQL queries such as HQL, Impala SQL, SparkSQL, T - SQL, MySQL.
  • Hands on experience in creating Hive tables (External table and internal table) including enable partitioning (dynamic partitioning) and bucketing.
  • Expensive experience in creating T-SQL objects such as UDF, Stored procedures, Triggers, Common table expression (CTE).
  • Hands on experience in batch load data between RDBMS and Hadoop Hive using Sqoop.
  • Hands on experience in creating Hive UDF/UDAF/UDTF using JAVA to accomplish different transformations on data.
  • Experience in experience in writing MapReduce programs in Java and Scala within Spark context.
  • Implemented data cleansing, data transformation by using Spark RDD transformations such as Map, Reduce, Filter, Fold, ReduceByKey, CombineByKey and optimize RDD transformation tasks with DAG (Directed Acyclic Graph).
  • Familiar with interaction/transformation between Spark RDD, dataframe and dataset with SparkSQL API.
  • Implemented massive ETL (Extract, Transform, Load) workflow from different sourses to different destinations (AWS redshift S3, Microsoft SQL server database, Oracle database, Hadoop HDFS, flat files) using SQL server Integration Service (SSIS) and Informatica.
  • Experience in working with various data formats such as ORC, Parqut, Avro, Sequence file, JSON, XML.
  • Experience in developing scaleble solutions using NoSQL databases such as Hbase, Cassandra.
  • Deep understanding about EDW architecture (Snowflake schema & Star schema) and data modeling (OLTP & OLAP) schema.
  • Knowledge of machine learning algorithms such as linear regression, logistic regression, SVM (support vector machine), random forest classifier, k-mean clustering classifier.
  • Experience in developing different types of drill down, drill through, cascaded parameter business reports, dashboards, graphs using SQL Server Reporting Service (SSRS) and Power BI.
  • Familiar with software development tool such as JIRA, Git, TFS.

TECHNICAL SKILLS:

Hadoop/Spark Ecosystem: Hadoop 2.x, MapReduce, Spark 2.x, Pig 0.12, \ Git, SVN, JIRA, Jenkins Hive 0.14, Sqoop 1.4.6, Flume 1.6.0, Kafka 0.9.x, Yarn, Mesos, Zookeeper 3.4.x Impala1.2+, HBase 0.96x,Cassendra 2.x, Oozie 3.0+

SQL Database: Oracle 11g, MS SQL server2016, MySQL5.x Java, Scala, Python, C#, Unix/Bash shell, T-SQL\

ETL tools: SQL Server Integration Service 2016 (SSIS), SQL Server Reporting Service 2016 (SSRS), Informatica 8.x\ Power BI

Cloud Platform: Amazon Web Service (AWS) S3/EC2/ EMR, \ Intellij, Eclipse, Microsfot Visual Studio, Cloudra CDH\ Spider, Pycharm

PROFESSIONAL EXPERIENCE:

Confidential, Piscataway, NJ

Spark developer

Responsibilities:

  • Ingested daily data from assets management systems such as FIS, Quantifi through JDBC and Apache Kafka.
  • Performed different data transformations by utilizing different RDD transformation such as map, filter, reduce, ReduceBykey, CombineBykey.
  • Used Spark 2.1x dataframe API to generate ad-hoc queries
  • Performance turned with Spark application using Catalyst optimizer, cacheing, rebalancing partitions, broadcasting variables
  • Involved in establishing real-time data pipelines using Kafka producer API and Spark Streaming.
  • Delivered real-time data about market orders from different sources into Kafka messaging system.
  • Developed Spark scripts by Scala shell as per the requirement.
  • Created Hbase tables to load unstructured and semi-structured datasets.
  • Deeply involved in various phases of Software Development Life Cycle (SDLC) of the application like Requirement gathering, Design, Analysis and Code development.

Environment: HDFS2.x, Hive2.0, Scala2.12.x, Spark2.1, kafka0.11, MySQL5.7, YARN, JIRA 6.4, Linux, Git

Big data developer

Confidential, East Setauket, NY

Responsibilities:

  • Collaborated closely with business analyst, data architect, data scientist and medical practitioner on functional requirement documents.
  • Worked on AWS EMR5.0.x with Agile development cycle.
  • Built NIFI applications with inbuilt processors to ingest log messages into Kafka topics.
  • Utilized Kafka, Flume, Spark streaming Dstreams to build ETL pipeline s to implement real-time analysis then load data into HDFS and Cassendra
  • Created internal and external Hive tables configured with dynamic partition & bucketing to boost query performance.
  • Developed UDF, UDAF, UDTF in SparkSQL to process, transform, aggregate data in ETL
  • Implemented Kafka producers , create custom partition s, configured brokers and implemented High level consumers to implement data platform.
  • Used Cassendra CQL with Java API to manipulate data in Cassendra table.
  • Utilized SparkSQL dataframe to process large sets of structured data with different formats (Text file, Sequence file, Avro file, JSON file, XML, Parquet, ORC)
  • Developed Bash s hell scripts to automate incremental load jobs from OLTP into EMR.
  • Worked with scheduling Oozie workflows to run spark jobs and monitored work load.
  • Involved in migrate HiveQL into Impala for better performance of data query.
  • Performed unit test about function validation by following test-driven development (TDD) methodology and ScalaTest.
  • Assisted with machine learning applications by providing solutions on data quality.

Environment: AWS EMR, HDFS2.5.x, Hive2.0, Spark2.1.x, Cassendra3.0.x, Kafka 0.9, Oracle 10g, Oozie4.1.x, Java 8, Impala 2.0, Scala2.12.x, YARN, Git hub, Eclipse, Shell Scripting, Linux, ScalaTest, Git

Confidential

Hadoop developer

Responsibilities:

  • Created Sqoop jobs to import/export daily data between HDFS and RDBMS (MySQL).
  • Developed Hadoop/Spark applications under Cloudera environment to automate deployment, configuration and integration data across different systems.
  • Developed UDF, UDAF, and UDTF to transform data formats in Hive tables by writing Java programs.
  • Used Flume to collect and push log data from different log servers
  • Implemented HBase Co-processors, Observers to work as event based analysis.
  • Utilized HBase Java Api to create tables, column families and DML operations on data.
  • Implemented different data transformation using Spark RDD, SparkSQL Dataframe API.
  • Developed Shell Script to perform Data Profiling on the ingested data with the help of hive.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Exported data from Hadoop HDFS into Power BI using Power query.
  • Involved in create reports, dashboards, graphs using Power BI including writing complex DAX functions and Power pivot.
  • Wrote calculated columns, measures query’ s in Power BI to display critical KPI according to business.
  • Involved in managing nodes on Hadoop cluster and monitor Hadoop cluster job performance using Cloudera CDH .

Environment: Cloudra CDH, HDFS2.1.x, Hive2.0, Spark2.1.x, Cassendra3.0.x, Kafka 0.9, Oracle 10g, Zookeeper 3.4.10, Oozie4.1.x, Java 8, Scala2.12.x, YARN, Git hub, Eclipse, Shell Scripting, Linux, TFS

Confidential, Indianapolis, IN

BI developer

Responsibilities:

  • Worked with business stakeholders, application developers, and production teams and across functional units to identify business needs and discuss solution option.
  • Created SQL stored procedures , temp tables , and views for the development of Reports.
  • Extensively used Joins, sub-queries, common table expression (CTE) for complex queries involving multiple tables from different Database. Optimized the database by creating various clustered, non-clustered indexes and indexed views.
  • Developed over 20 SSIS packages to perform ETL from different OLTP into data warehouse.
  • Implemented SCD type 1, 2 (Slowly changing dimension) for incremental loading of dimension tables using SSIS lookup transformation and T-SQL merge statements.
  • Developed over 30 SSRS reports with visualization features including bar charts (stacked), pie charts, line charts and conditional formatting.
  • Applied drill-down, drill-through, cross-tab, cascaded parameterized methods to highlight general information, hidden detailed records and enable user interaction which fits the business logic.
  • Customized each SSRS report with user defined parameters, filtering, grouping procedure conditional format from both blank tab and drop-down list.
  • Used Snapshots and Caching options to improve performance of Report Server.
  • Involved in trouble shooting, performance turning with QA team.

Environment: SQL server 2008, Oracle database 10g, T-SQL, SSMS, SSRS, SharePoint, SSIS, Windows, SVN

We'd love your feedback!