Segmentation training & eval data #460

Lauler · 2024-01-25T09:56:38Z

Lauler
Jan 25, 2024

I was browsing through the repo trying to get a sense of

how the segmentation of protocols into utterances, notes, margins, etc is done.
where the training and eval data is and how it's formatted.

As I understand it you train a text based classifier on curated data in input/curation. Here are some questions I had:

Is the segmentation evaluated against prot-segment-classification.csv ?
The training data seems to contain image+text. But I cannot find
1. Bounding boxes for the segments.
2. Unique identifier links for the detected textbox in the source document (e.g. betalab has URI for every detected OCR box).
  I was hoping to explore using both image+text to segment protocols. It is however difficult to use the supplied images without either bounding box information for the annotations, or a link back to source document's textbox URIs so that bounding boxes can be retrieved via the original OCR.

Is the bounding box info included somewhere in riksdagen-corpus? If it isn't a request would be to consider including this info when preprocessing the protocols. It would be very useful to at least include it for the manually curated train/eval data, as this information already exists for the scanned/OCR:ed protocols. Including bbox would allow people to make use of the image modality.

Answered by ninpnin

Jan 25, 2024

We are in the process of creating a proper training, validation and test set for paragraph classification for the whole 1867-2023 period (#189), and subsequently training a new, improved classifier on it (#462)
The training set you found is for an older version of the classifier. Images were included in the data set to make the annotators' job easier. We have links now, so it's not necessary anymore to have images. The classification itself is (and probably will remain) text-based
The code for text-based paragraph classification is here https://github.com/welfare-state-analytics/bert-riksdagen-classifier/ . There you have another data set, which IIRC is sampled from the 1920-1989 era, di…

View full answer

BobBorges · 2024-01-25T13:12:29Z

BobBorges
Jan 25, 2024
Maintainer

@ninpnin You could probably answer this best.

0 replies

ninpnin · 2024-01-25T13:37:54Z

ninpnin
Jan 25, 2024
Maintainer

We are in the process of creating a proper training, validation and test set for paragraph classification for the whole 1867-2023 period (Create a training, validation and test set for note/seg #189), and subsequently training a new, improved classifier on it (Improve note/seg classification #462)
The training set you found is for an older version of the classifier. Images were included in the data set to make the annotators' job easier. We have links now, so it's not necessary anymore to have images. The classification itself is (and probably will remain) text-based
The code for text-based paragraph classification is here https://github.com/welfare-state-analytics/bert-riksdagen-classifier/ . There you have another data set, which IIRC is sampled from the 1920-1989 era, directly from the original ALTO files.
In addition to classifying paragraph, we have a separate BERT-based classifier just to classify speaker introductions (https://www.diva-portal.org/smash/get/diva2:1669191/FULLTEXT01.pdf). I am not sure if the training data for it is available in our corpus, but we can find it if we need it for something

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation training & eval data #460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Segmentation training & eval data #460

Lauler Jan 25, 2024

Replies: 2 comments

BobBorges Jan 25, 2024 Maintainer

ninpnin Jan 25, 2024 Maintainer

Lauler
Jan 25, 2024

BobBorges
Jan 25, 2024
Maintainer

ninpnin
Jan 25, 2024
Maintainer