Skip to content

Releases: bitextor/warc2text

v1.3.1

07 Feb 14:55
Compare
Choose a tag to compare

v1.3.0

06 Feb 16:32
Compare
Choose a tag to compare

What's Changed

  • Fail when WARC file, tagfilters or urlfilters can't be opened by @ZJaume in #55
  • Replacing boost::json with nlohmann::json (added --encoding-errors handling option and not producing invalid utf8 anymore) by @ZJaume in #57
  • EasyBuild configs and installation instructions in the README by @nvanva in #60
  • Filter by http status code by @ZJaume in #61
  • Recover after a WARC file fails to be opened by @ZJaume in #63
  • Add detected encoding to the metadata by @ZJaume in #64
  • Fix html missing in JSONL stdout when skipping extraction by @ZJaume in #66

New Contributors

Full Changelog: v1.2.0...v1.3.0

v1.2.0

02 Feb 14:41
Compare
Choose a tag to compare

What's Changed

  • Add --robotspass shunt for records related to robots.txt by @jelmervdl in #43
  • Add --jsonl option by @jelmervdl in #35
  • warc2html changes by @ZJaume in #50
  • ZSTD compression and compression level support by @ZJaume in #51
  • Move JSONL output to --stdout and allow file-based output with JSONL by @ZJaume in #52

Full Changelog: v1.1.0...v1.2.0

v1.1.0: Merge pull request #36 from jelmervdl/fasttext-option

01 Aug 13:09
eac887e
Compare
Choose a tag to compare

Changes:

  • Add option to use a FastText model as a language identifier
  • Record identified by CLD2 as Unknown are classified as unk instead of dropped.

v1.0.0

01 Aug 13:08
673e371
Compare
Choose a tag to compare
Paragraph indexes now start in 1