genre

Public repository for documents and code associated with the project "Understanding Genre in a Collection of a Million Volumes," supported by the NEH and ACLS 2013-15.

Some of these materials were originally developed in other repositories, so you'll find that subfolders often include links pointing back to the original repo. This should only matter if you want to trace development history.

The workflow we followed moves from browser -> munging-> features -> pages -> confidencefilter.

But probably most users of this repo will be primarily interested in the utilities subfolder, which contains Python utilities that can be used to align these genre predictions with HTRC feature files or HathiTrust zip files.

For fuller documentation, see the interim project report.

1700-1899features and 1900-1922features

Training data we used to create two models. The 1900-1922 folder actually contains a lot of earlier training data; we overlapped the two models.

browser

A GUI we used to create page-level training data. Saves data in ARFF format. Written by Michael L. Black and Boris Capitanu.

confidencefilter

After page-level predictions had been produced by the code in /pages, we used these Python scripts to train models of their likely accuracy at a volume level, and used those confidence metrics to filter datasets of drama, fiction, and poetry.

features

Python code that extracted features at a page level. This really needs to be refactored; it's probably the least portable part of the workflow right now.

munging

This helpfully-named repository contains a range of Python scripts we used, for instance, to select features or to evaluate the performance of cross-validated models.

pages

Contains the Java code actually used for page-level classification.

report

Contains the LaTeX files for the interim project report.

utilities

Contains Python utilities for extracting tabular metadata from MARCxml, aligning page-level predictions with pages in a HathiTrust zip file, and so on. A very early working version of the pagealigner utility is now up, but let's call it a very early alpha! Better version on the way within a week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genre

1700-1899features and 1900-1922features

browser

confidencefilter

features

munging

pages

report

utilities

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
1700-1899features		1700-1899features
1900-1922features		1900-1922features
browser		browser
confidencefilter		confidencefilter
data		data
features		features
munging		munging
pages		pages
report		report
utilities		utilities
.gitignore		.gitignore
LICENSE		LICENSE
blogpost.txt		blogpost.txt
readme.md		readme.md

License

tedunderwood/genre

Folders and files

Latest commit

History

Repository files navigation

genre

1700-1899features and 1900-1922features

browser

confidencefilter

features

munging

pages

report

utilities

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages