Training set caching / corpusrepresentation caching #86

johann-petrak · 2018-10-08T15:28:29Z

Add a parameter (maybe just something to be used as an "algorithmParameter") to enable training set caching: whatever corpus representation the chosen algorithm uses, that representation will get saved to the data directory (using a name specific to the type of representation) after the trainingset is complete, but before training set finalizing and training itself is done.

If caching is enabled, then if a cache file already exists it should get read in before processing the documents starts to initialize the instance list and then the documents add to that instance list. Caching should probably also save the feature information and compare to the feature info used for the new documents and throw an error if there is a mismatch.

Rationale: this helps in at least two situations:

one wants to re-train a model on the same corpus without changing the features. Without caching the time-consuming extraction will have to run again
one wants to retrain a model on a slightly larger corpus after adding a few new instances. So far this would only be possible by adding the documents to the corpus and converting the whole corpus, with this just the new documents can be added to the cache each time.

Note: caching does not make sense with out-of-memory corpus representations like the dense json representation used for Pytorch/Keras since the saved corpus representation already is a kind of cache. The difference/problem is the metadata: in some cases, adding to the corpus would require updating the metadata, so in order to support this properly, we need two functions in the python data backend:

combine data files into one large data file: this is essentially just appending
combine meta files into one metafile: this needs to combine the stats wherever possible and live with not being able to combine everything properly otherwise. The most important parts to combine: word counts, min/max, counts. Averages can maybe be combined by using the number of instances as weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training set caching / corpusrepresentation caching #86

Training set caching / corpusrepresentation caching #86

johann-petrak commented Oct 8, 2018

Training set caching / corpusrepresentation caching #86

Training set caching / corpusrepresentation caching #86

Comments

johann-petrak commented Oct 8, 2018