Ingest Large File of Training Data to Create Doc #12461
jon2718
started this conversation in
Help: Best practices
Replies: 1 comment 4 replies
-
You can read in tsv or jsonl, process them into doc objects as desired, and then write them as a set of doc objects to disk. You can control the sentence boundaries as part of reading the raw text and writing out doc objects into docbin. Then with your directory of docbin, depending on what you want to do, for example if you need to train models, you can write your own custom loader that lazily loads them from disk rather than having spacy read a single docbin into memory. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am loading a large dataset from a tsv or jsonl file into a space doc. Is there a a recommended way to do this efficiently from a file. Most examples I've seen have just small example text.
Also, how does Space handle datasets that cannot fit into memory?
Finally, I already have my sentence boundaries defined as they are just lines in my tsv/jsonl file. How can I impose those sentence boundaries. That is, I don't want Space to infer those boundaries.
Thanks in advance.
J
Beta Was this translation helpful? Give feedback.
All reactions