Train model to Get Treatment Date #12583
Replies: 3 comments 1 reply
-
I've been thinking about this a bit, but it's hard to give one-size-fits-all advice, and you'll need to experiment a bit with your specific data/task. In our experience, this kind of task often doesn't work well if you try to model it solely as NER, at least not with spaCy's built-in We'd usually suggest dividing this into two steps:
The task of finding dates is typically well-suited for a combination of rule-based patterns ("April 5, 2003") and NER ("the last day of May 2015", if this kind of date even shows up in this kind of report?). I think there are a number of existing python libraries for rule-based date finding. For classifying date spans into treatment date vs. other, you could consider a If you do use from typing import Optional, Iterable, cast
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d
from spacy.pipeline.spancat import Suggester
from spacy.tokens import Doc
from spacy.util import registry
@registry.misc("date_suggester.v1")
def build_date_suggester(candidates_key: str) -> Suggester:
def date_suggester(docs: Iterable[Doc], *, ops: Optional[Ops] = None) -> Ragged:
if ops is None:
ops = get_current_ops()
spans = []
lengths = []
for doc in docs:
length = 0
for ent in doc.ents:
if ent.label_ == "DATE":
# date spans +/- 3 tokens (you could also consider sentence boundaries or
# some other method to create spans around date, and you might also want to
# stop when this span runs into nearby dates)
spans.append([max(0, ent.start - 3, min(len(doc), ent.end + 3])
length += 1
lengths.append(length)
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
if len(spans) > 0:
output = Ragged(ops.asarray(spans, dtype="i"), lengths_array)
else:
output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
return output
return date_suggester You'd need to annotate your training with spans created in the exact same way, and for training you'd need to add any components that set the And I'm not really sure how well this would work with |
Beta Was this translation helpful? Give feedback.
-
Wow. How helpful ! Thank you so much Adriane! I have a lot of digest here!
Sent from my T-Mobile 5G Device
Get Outlook for Android<https://aka.ms/AAb9ysg>
From: Adriane Boyd ***@***.***>
Sent: Friday, May 5, 2023 1:42:22 AM
To: explosion/spaCy ***@***.***>
Cc: standenman ***@***.***>; Author ***@***.***>
Subject: Re: [explosion/spaCy] Train model to Get Treatment Date (Discussion #12583)
I've been thinking about this a bit, but it's hard to give one-size-fits-all advice, and you'll need to experiment a bit with your specific data/task.
In our experience, this kind of task often doesn't work well if you try to model it solely as NER, at least not with spaCy's built-in ner component. ner can find dates pretty well, but it will easily mix up different types of dates where a lot of context is required. I think it would be a good idea to try out only ner, and if it works well then you don't need a more complicated solution.
We'd usually suggest dividing this into two steps:
* find all date spans
* classify these date spans into treatment vs. other
The task of finding dates is typically well-suited for a combination of rule-based patterns ("April 5, 2003") and NER ("the last day of May 2015", if this kind of date even shows up in this kind of report?). I think there are a number of existing python libraries for rule-based date finding. Matcher patterns with spaCy are sometimes hard to write for dates because of the default tokenization around punctuation/numbers with and without whitespace, so writing regex patterns over the document text may be easier, and you can use Doc.char_span to convert regex matches into spans to add to doc.ents.
For classifying date spans into treatment date vs. other, you could consider a spancat_singlelabel component that classifies the phrase or sentence surrounding a date, which I think would work better than trying to just classify the date spans on their own as spans (similar to how NER struggles). Basically you'd want to model this as text classification for Treatment Date: ##/##/#### rather than token classification for ##/##/####. It would probably also make sense to include some rule-based methods to find the easy cases like Treatment Date: DATE.
If you do use spancat, you'd need a custom suggester that suggests spans around DATE ents to the spancat_singlelabel component, which might look a bit like this (note: untested!):
from typing import Optional, Iterable, cast
from thinc.api import get_current_ops, Ops
from thinc.types import Ragged, Ints1d
from spacy.pipeline.spancat import Suggester
from spacy.tokens import Doc
from spacy.util import registry
@registry.misc("date_suggester.v1")
def build_date_suggester(candidates_key: str) -> Suggester:
def date_suggester(docs: Iterable[Doc], *, ops: Optional[Ops] = None) -> Ragged:
if ops is None:
ops = get_current_ops()
spans = []
lengths = []
for doc in docs:
length = 0
for ent in doc.ents:
if ent.label_ == "DATE":
# date spans +/- 3 tokens (you could also consider sentence boundaries or
# some other method to create spans around date, and you might also want to
# stop when this span runs into nearby dates)
spans.append([max(0, ent.start - 3, min(len(doc), ent.end + 3])
length += 1
lengths.append(length)
lengths_array = cast(Ints1d, ops.asarray(lengths, dtype="i"))
if len(spans) > 0:
output = Ragged(ops.asarray(spans, dtype="i"), lengths_array)
else:
output = Ragged(ops.xp.zeros((0, 0), dtype="i"), lengths_array)
return output
return date_suggester
You'd need to annotate your training with spans created in the exact same way, and for training you'd need to add any components that set the DATE entities (entity_ruler, span_ruler, ner, etc.) to [training.annotating_components].
And I'm not really sure how well this would work with spancat. The textcat component might actually work better than spancat, which focuses more on the first and last tokens in the spans. There's a related example in the Healthsea project (developed prior to spancat, and part of the motivation behind spancat) that uses textcat on clauses, with a very nice blog post describing the details: https://explosion.ai/blog/healthsea
—
Reply to this email directly, view it on GitHub<#12583 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEH6TMRFA6PTIHCS3ZKU7GDXESOM5ANCNFSM6AAAAAAXPWWQQI>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I am so appreciative of this feedback - I amreally thinking aloud here, and I have some further thoughts. A set of medical records -those that I want to parse based upon treatment - are usually arranged in some date order -either descending or ascending. I am wondering about something like this in the process of finding for any given page what is the treatment date. So we know that a given page could have a hosts of date. Some could be elimiinated from the running for treatment date - like ones styled "date of Birth: ##/##/###" or "date of accident: ##/##/###. In other words, try to pare down the possible candidates on a given page for Now, since the page number for this collection of medical records is correlated - either acending or descending - with treatment date, could we create set pairs of "Date, PageNumber". Then could look for a pattern such that in that record set, "Date, PageNumber" follows a pattern. That is, "date" is "going down and going up" over the page number sequence of 1 to n for that gtiven pdf file? |
Beta Was this translation helpful? Give feedback.
-
I am trying to figure out a way to train a spacy model to find a date on a medical treatment record that is the date of treatment. So I want to be able to label as "treatment date". Any given medical records could have lots of dates. The date of treatment from diferent medical record sources could be different. Some are in the form: Treatment Date: ##/##/#### or Date of Visit: ##/##/#### or Encounter Date: ##.##.#### and obviously forms that might be a surprise to me.
I am editing this post - perhaps not the thing to do - because I am thinking about the issued further. As I note above, I am trying to indentify each set of medical records in a collection taht represent a distinct visir, or "patient encounter" date. But I have really left out the most obvious characteristic: each visit date record is going to follow a distinct pattern> For example I am looking now at medical records in which an ecounter record has the same header, and has the same sections going down the page.
I am wondering if spacy to deter the pattern for a set of given medical records from a given source, and make it easy to identify the discrete visits. Then looking fot that actual treatment date might be easy
Beta Was this translation helpful? Give feedback.
All reactions