Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training set caching / corpusrepresentation caching #86

Open
johann-petrak opened this issue Oct 8, 2018 · 0 comments
Open

Training set caching / corpusrepresentation caching #86

johann-petrak opened this issue Oct 8, 2018 · 0 comments

Comments

@johann-petrak
Copy link
Collaborator

Add a parameter (maybe just something to be used as an "algorithmParameter") to enable training set caching: whatever corpus representation the chosen algorithm uses, that representation will get saved to the data directory (using a name specific to the type of representation) after the trainingset is complete, but before training set finalizing and training itself is done.

If caching is enabled, then if a cache file already exists it should get read in before processing the documents starts to initialize the instance list and then the documents add to that instance list. Caching should probably also save the feature information and compare to the feature info used for the new documents and throw an error if there is a mismatch.

Rationale: this helps in at least two situations:

  • one wants to re-train a model on the same corpus without changing the features. Without caching the time-consuming extraction will have to run again
  • one wants to retrain a model on a slightly larger corpus after adding a few new instances. So far this would only be possible by adding the documents to the corpus and converting the whole corpus, with this just the new documents can be added to the cache each time.

Note: caching does not make sense with out-of-memory corpus representations like the dense json representation used for Pytorch/Keras since the saved corpus representation already is a kind of cache. The difference/problem is the metadata: in some cases, adding to the corpus would require updating the metadata, so in order to support this properly, we need two functions in the python data backend:

  • combine data files into one large data file: this is essentially just appending
  • combine meta files into one metafile: this needs to combine the stats wherever possible and live with not being able to combine everything properly otherwise. The most important parts to combine: word counts, min/max, counts. Averages can maybe be combined by using the number of instances as weights.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant