Ingest Large File of Training Data to Create Doc #12461

jon2718 · 2023-03-23T17:47:39Z

jon2718
Mar 23, 2023

Hi,

I am loading a large dataset from a tsv or jsonl file into a space doc. Is there a a recommended way to do this efficiently from a file. Most examples I've seen have just small example text.

Also, how does Space handle datasets that cannot fit into memory?

Finally, I already have my sentence boundaries defined as they are just lines in my tsv/jsonl file. How can I impose those sentence boundaries. That is, I don't want Space to infer those boundaries.

Thanks in advance.

J

jkgenser · 2023-03-23T18:30:38Z

jkgenser
Mar 23, 2023

You can read in tsv or jsonl, process them into doc objects as desired, and then write them as a set of doc objects to disk. You can control the sentence boundaries as part of reading the raw text and writing out doc objects into docbin.

Then with your directory of docbin, depending on what you want to do, for example if you need to train models, you can write your own custom loader that lazily loads them from disk rather than having spacy read a single docbin into memory.

4 replies

jon2718 Mar 23, 2023
Author

Thanks. Do you have any example code or documentation on this you could point me to?

adrianeboyd Mar 24, 2023

Docs on streaming corpora: https://spacy.io/usage/training#custom-code-readers-batchers

jon2718 Mar 24, 2023
Author

Thanks. One issue I don't understand is that I would like all text in one doc rather than many so I can search over that doc and do pattern matching etc. My understanding is that with the docbin it is a collection of separate docs, in this case one doc for each line in the data file. Is there a way to load all the data into one doc?

jkgenser Mar 24, 2023

You mentioned your doc doesn't fit into memory. A doc is a collection of text that must fit into memory. Your could chunk the original doc and stream the docs into memory calling pattern matching function on a chunk at a time. Keep track of the doc chunks and map them back to the original doc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest Large File of Training Data to Create Doc #12461

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Ingest Large File of Training Data to Create Doc #12461

jon2718 Mar 23, 2023

Replies: 1 comment · 4 replies

jkgenser Mar 23, 2023

jon2718 Mar 23, 2023 Author

adrianeboyd Mar 24, 2023

jon2718 Mar 24, 2023 Author

jkgenser Mar 24, 2023

jon2718
Mar 23, 2023

Replies: 1 comment 4 replies

jkgenser
Mar 23, 2023

jon2718 Mar 23, 2023
Author

jon2718 Mar 24, 2023
Author