- Install Go from http://golang.org
- clone this repo into
$GOHOME/src/github.com/Kingsford-Group
cd biblint/biblint
go build
You can do step two via go get github.com/Kingsford-Group/biblint/biblint
(need to list path to the biblint command, which is why biblint
appears twice
at the end)
Currently, biblint
has no external dependencies beyond Go and its standard library.
BibLint's clean
command tries to format the bib file in a consistent way. It
tries to correct common mistakes, and removes information that is not part of
the "citation".
Note that clean
does NOT guarantee no data loss. In fact, the typical situation is that
data will be lost (e.g. abstracts and many other fields are removed from the database).
Usage:
biblint clean in.bib > out.bib
Specifically, clean
does the following:
-
@preamble entries are moved to the top of the file (in the order they appear in the file)
-
@string entries immediately follow any preamble entries. They are listed alphabetically sorted by the symbol they define
-
Entries follow, sorted in reverse chronological order (i.e. by year, by default). Use
-sort
option to change the field that will be sorted on, and the-reverse
option to reverse the order (e.g.-sort journal -reverse false
) will sort alphabetically by journal string. Note that the default sorted order isreverse=true
, so if you want alphabetical, you must turn off reverse.biblint tries to be minimally smart about sorting: symbols are expanded to their defined value (recursively, up to depth 10), strings are compared ignoring {}, and if an int and a string are compared, the int is converted to a string and the comparison is done as strings.
-
Fields that are empty are removed
-
Non-blessed fields are removed. A field is blessed if it is a required or known optional field for any entry type or one of "key", "note", "url", "doi", "pmc", "pmid", "keywords", "issn", "isbn". Note that "abstract" tags are removed. Use
-blessed f1,f2,f3...
to add additional blessed fields. -
Titles that end with
[[:lower:]]\.
have the terminating "." removed. -
Pages entries that look like NUMBER -[-] NUMBER are changed to NUMBER--NUMBER
-
Pages that are aaaa--bb, where len(a) > len(b), are replaced by aaaa--aabb
-
Exact duplicates are removed. Exact dups are those that have the same entry type, same key, the same fields, and the same exact values for each field
-
If entry A has all the fields of B, with the same values, then A will be deleted. (if A and B have the same fields, one of them will be deleted arbitrarily). This will be caught if the entries either have the same key or the same title
-
{} is used to delimit fields
-
If an entire field is braced, they are removed. This can be wrong, but the more common problem is that someone has double braced every field to avoid dealing with BibTeX quirks.
-
Individual words in
title
orbooktitle
entires that are in strange case will be surrounded by {}. Specifically, {} surrounds any word with a " or that has "sTrange" case (an uppercase letter anyplace except the first non-punctuation character that is not preceded by a hyphen). This won't brace things like "(Strange" or "Hyphenated-Word", but will brace "mRNA" -
If an author field ends with
\set\s*al.?
it is replaced by " and others". -
Author names in the "author" field are always given as von Last, First or von Last, Jr., First (names in the "editor" field are not changed)
-
Plain integer values are unquoted
-
If a month field is {Jan} or {January}, it will be converted to the predefined symbol "jan"
-
If the value of a field uniquely matches the definition of a symbol, it will be replaced by the symbol
-
Consecutive, unbraced whitespace will be replaced by a single space character
-
Non-quoted whitespace is removed from the start and end of any value
-
Missing commas after "tag=value" pairs are added
-
if an entry contains duplicate "tag=" entries, the first one is kept (as in BibTeX) with a warning
-
Lowercase, non-"small" words are capitalized in journal titles (as long as they are outside {} regions). Small words are "the", "a", "an", "but", "for", "and", "or", "nor", "to", "from", "on", "in", "of", "at", "by". (This list is likely to grow.)
-
@comment
lines and non-entry text are removed
The check
command looks for problems that can't be fixed by clean
.
Usage:
biblint check in.bib
Specifically, it will report the following problems:
-
A lone, white-space-surrounded - instead of ---
-
"et al" in an author list
-
Non-ASCII characters anyplace
-
Years that are not integers
-
Use of undefined symbols
-
Duplicate defined symbols
-
Duplicate keys
-
Page ranges x--y where y < x
-
Missing required fields for each entry type
-
Fields that have an odd number of un-escaped (with \) dollar signs
-
@string
definitions that define the same thing -
Last names that have all uppercase, all lowercase, or are empty (trying to catch last names resulting from the common mistake of an author =
Smith J H
, which is parsed by BibTeX as first name = "Smith", last name = "J H".)
Errors are reported grouped by key in the following format:
Key "salmon":
2105: key "salmon" is defined more than once
1178:volume: missing required field "volume" in article
Each group starts with Key
followed by the key in quotes. Each error is of
the two forms:
LINE: message
LINE:TAG: message
where LINE
is the line number of the entry (the line the @
appears on)
and TAG
, if present, is the tag within the entry that contains the error.
The line number is given for each message because, in the case of duplicate
keys, errors can be reported for any of those entries. They key is "" if
the error doesn't involve an entry.
The dups
command tries to find duplicate entries by looking for pairs of entries
that look like they have the same title. Usage:
biblint dups in.bib
This finds entries where the titles map to the same string, once case, punctuation, and small words are removed. It reports the dups by key and title, but does not remove or modify the entries.
To clean up a bib file, a set of steps to take are:
- Run
biblint clean in.bib > tmp.bib
- Fix any errors reported by
clean
intmp.bib
; goto 1 until no errors - Run
biblint dups tmp.bib
- Remove or fix any true dups reported by
dups
intmp.bib
- Run
biblint check tmp.bib
- Fix any errors reported by
check
in tmp.bib mv in.bib bad.bib && mv tmp.bib in.bib
The way to merge two or more files is to cat
them and then run the cleaning
pipeline above. If the bib files contain true duplicates or one file contains
strictly more information than the other for an entry, the dups or less
informative entries will be removed by clean
.
Use -quiet
to prohibit printing of the banner.
The biblint parser accepts some bib syntax that is not officially supported by bibtex. This is done for a combination of reasons: sometimes the bib file can be parsed correctly and sometimes forcing non-bibtex syntax to be rejected would complicate the parser too much. For example:
-
Commas separating tag=value pairs in an entry are optional --- they will be added by
clean
if they are missing -
BibTeX allows both {} and () to eliminate string, preamble, and entry types (but not key values, which must be either {} or ""). That is, you can say
@article(key,title="foo")
. We also allow both () and {} but we also allow {) and (}. We will convert all these to {}. -
@comment
in BibTeX comments to the end of the line. This is what we do as well. We strip all comments from the output .bib file. Someday, it might be nice to preserve @comment comments (but not non-entry junk) in the output.
-
We do not yet support the
#
string concatenation operator. -
A title of the form
"strange {title here"
will be converted tostrange {title here}
. That is unmatched opening{
will be closed at the end of a string. Escaping with\{
doesn't stop this from happening.