Adding extra metadata. Postprocessing Filter or Dispatcher? #307

diegoeche · 2021-09-06T17:20:41Z

diegoeche
Sep 6, 2021

Hello everybody. Thanks for maintaining this amazing project! :D

In our team we use some custom properties to track user/app stuff. Something like:

%scala
spark.conf.set("spark.app.name", "Spline Test 01")

I think I saw an example that used a Postprocessing Filter to add this information. But now I see that most of what I need to do could be done either with a Dispatcher or overriding a simple method:

override protected def userExtraMetadataProvider = new UserExtraMetaDataProvider {
    override def forExecEvent(event: ExecutionEvent, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forExecPlan(plan: ExecutionPlan, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: ReadOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: WriteOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
    override def forOperation(op: DataOperation, ctx: HarvestingContext): Map[String, Any] = Map("foo" -> "bar")
  }

Source: #35 (comment)

I also noticed declarative configuration of properties will be added in a next release. What's the preferred way? (Is anybody using Postprocessing filters for that?)

Answered by cerveada

Sep 7, 2021

LineageDispatcher

The goal of the dispatcher is to send the lineage data to the server. Currently we support kafka dispatcher, http dispatcher and some others for debuging and testing like logging dispatcher and console dispatcher. Typically you would implement your own dispatcher when you need another method of sending data like RabbitMq or store the data to S3 for some reason.

PostProcesingFilter

The goal of the filter is to change the lineage data before it is sent to the dispatcher. This might mean filtering out some sensitive data or adding additional information. Filters have access to the the original Spark LogicalPlan and the SparkSession as means of metadata.

UserExtraMetaDataPro…

View full answer

cerveada · 2021-09-07T07:09:08Z

cerveada
Sep 7, 2021
Maintainer

LineageDispatcher

The goal of the dispatcher is to send the lineage data to the server. Currently we support kafka dispatcher, http dispatcher and some others for debuging and testing like logging dispatcher and console dispatcher. Typically you would implement your own dispatcher when you need another method of sending data like RabbitMq or store the data to S3 for some reason.

PostProcesingFilter

The goal of the filter is to change the lineage data before it is sent to the dispatcher. This might mean filtering out some sensitive data or adding additional information. Filters have access to the the original Spark LogicalPlan and the SparkSession as means of metadata.

UserExtraMetaDataProvider

UserExtraMetaDataProvider is a predecessor of the PostProcesingFilter, and is now deprecated. It has very limited subset of capabilities of a filter. It might be removed in one of the next releases, so I don't recommend using it.

Declarative way for adding user extra metadata

This is planned, but it may take some time before it's ready.

Currently, a custom PostProcessingFilter is the best way to add extra parameters to the execution plan:

import org.apache.commons.configuration.Configuration
import za.co.absa.spline.harvester.HarvestingContext
import za.co.absa.spline.harvester.postprocessing.PostProcessingFilter
import za.co.absa.spline.producer.model.v1_1._
import za.co.absa.spline.harvester.ExtraMetadataImplicits._

class MyExtraAppendingPostProcessingFilter(conf: Configuration) extends PostProcessingFilter {

  override def processExecutionEvent(event: ExecutionEvent, ctx: HarvestingContext): ExecutionEvent =
    event.withAddedExtra(Map("foo" -> "bar"))

  override def processExecutionPlan(plan: ExecutionPlan, ctx: HarvestingContext ): ExecutionPlan =
    plan.withAddedExtra(Map("foo" -> "bar"))

  override def processReadOperation(op: ReadOperation, ctx: HarvestingContext ): ReadOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processWriteOperation(op: WriteOperation, ctx: HarvestingContext): WriteOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processDataOperation(op: DataOperation, ctx: HarvestingContext  ): DataOperation =
    op.withAddedExtra(Map("foo" -> "bar"))
}

1 reply

zacayd Oct 26, 2023

HI @wajda
Can you please guide me if i want to add the above code to capture the name of the notebook and other stuff on the notebook of Databricks - where i change it on the Scala Agent code?
2.can build and load it to the Databricks cluster instead of using Maven?

diegoeche · 2021-09-07T19:47:36Z

diegoeche
Sep 7, 2021
Author

Perfect answer. Thanks so much!

0 replies

diegoeche · 2021-09-07T21:12:51Z

diegoeche
Sep 7, 2021
Author

For being able to use it I had to add the constructor. Full code:

package za.co.absa.spline.harvester.postprocessing

import org.apache.commons.configuration.Configuration
import scala.util.matching.Regex
import za.co.absa.commons.CaptureGroupReplacer
import za.co.absa.commons.config.ConfigurationImplicits.ConfigurationRequiredWrapper
import za.co.absa.spline.harvester.ExtraMetadataImplicits._
import za.co.absa.spline.harvester.HarvestingContext
import za.co.absa.spline.harvester.postprocessing.PostProcessingFilter
import za.co.absa.spline.producer.model.v1_1._

class CustomFilter extends PostProcessingFilter {
  def this(conf: Configuration) = this()

  override def processExecutionEvent(event: ExecutionEvent, ctx: HarvestingContext): ExecutionEvent =
    event.withAddedExtra(Map("foo" -> "bar"))

  override def processExecutionPlan(plan: ExecutionPlan, ctx: HarvestingContext ): ExecutionPlan =
    plan.withAddedExtra(Map("foo" -> "bar"))

  override def processReadOperation(op: ReadOperation, ctx: HarvestingContext ): ReadOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processWriteOperation(op: WriteOperation, ctx: HarvestingContext): WriteOperation =
    op.withAddedExtra(Map("foo" -> "bar"))

  override def processDataOperation(op: DataOperation, ctx: HarvestingContext  ): DataOperation =
    op.withAddedExtra(Map("foo" -> "bar"))
}

5 replies

cerveada Sep 8, 2021
Maintainer

I added the (conf: Configuration) constructor into the answer's example.

zacayd Mar 1, 2023

@diegoeche -Do you have a code that can capture the name of the notebook in Databricks on spline 1.0.0 and higher?
if so i would be glad if you can share with me
thanks

wajda Jun 22, 2023
Maintainer

With the agent version 1.2.0 there is now an easy workaround for capturing the notebook name. You can set the spark.app.name session config property and it will be seen as a job name by the new agent. See #653

spark.conf.set("spark.app.name", "Blah")

# ... or using a session builder
spark = SparkSession.builder.appName("Blah").getOrCreate()

zacayd Oct 25, 2023

@wajda - Do you mean that the "Blah" is the notebook name?

wajda Oct 25, 2023
Maintainer

yes

zacayd · 2023-06-26T05:18:09Z

zacayd
Jun 26, 2023

I used to put on the config props of th cluster

spline.plugins.za.co.absa.spline.harvester.plugin.embedded.NonPersistentActionsCapturePlugin.enabled true
spark.spline.lineageDispatcher.http.producer.url http://10.0.19.4:8080/producer
spark.spline.lineageDispatcher http

what do i need to put in order to caputre also the databrics notebook name and workspace url?
do i need to put this line on each notebook at the top in order to capture the lineage?

sc._jvm.za.co.absa.spline.harvester.SparkLineageInitializer.enableLineageTracking(spark._jsparkSession)

0 replies

wajda · 2023-06-26T08:06:21Z

wajda
Jun 26, 2023
Maintainer

what do i need to put in order to caputre also the databricks notebook name and workspace url?

As it was said before there is no out-of-the-box support for capturing databricks notebook or workspace Url. You still need to do it yourself and pass into Spline in one of available ways depending on your convenience (either using userExtraMeta.rules, or implementing your own custom filter). What I mentioned in the previous post was that since 1.2.0 release agent can capture the job name from the spark.app.name property that you can setup yourself in your notebook. This is just another easy way to capture the job name, but if you need more stuff you still need to write some code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding extra metadata. Postprocessing Filter or Dispatcher? #307

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Adding extra metadata. Postprocessing Filter or Dispatcher? #307

diegoeche Sep 6, 2021

LineageDispatcher

PostProcesingFilter

UserExtraMetaDataPro…

Replies: 5 comments · 6 replies

cerveada Sep 7, 2021 Maintainer

LineageDispatcher

PostProcesingFilter

UserExtraMetaDataProvider

Declarative way for adding user extra metadata

zacayd Oct 26, 2023

diegoeche Sep 7, 2021 Author

diegoeche Sep 7, 2021 Author

cerveada Sep 8, 2021 Maintainer

zacayd Mar 1, 2023

wajda Jun 22, 2023 Maintainer

zacayd Oct 25, 2023

wajda Oct 25, 2023 Maintainer

zacayd Jun 26, 2023

wajda Jun 26, 2023 Maintainer

diegoeche
Sep 6, 2021

Replies: 5 comments 6 replies

cerveada
Sep 7, 2021
Maintainer

diegoeche
Sep 7, 2021
Author

diegoeche
Sep 7, 2021
Author

cerveada Sep 8, 2021
Maintainer

wajda Jun 22, 2023
Maintainer

wajda Oct 25, 2023
Maintainer

zacayd
Jun 26, 2023

wajda
Jun 26, 2023
Maintainer