Releases: bitextor/warc2text
Releases · bitextor/warc2text
v1.3.1
v1.3.0
What's Changed
- Fail when WARC file, tagfilters or urlfilters can't be opened by @ZJaume in #55
- Replacing boost::json with nlohmann::json (added
--encoding-errors
handling option and not producing invalid utf8 anymore) by @ZJaume in #57 - EasyBuild configs and installation instructions in the README by @nvanva in #60
- Filter by http status code by @ZJaume in #61
- Recover after a WARC file fails to be opened by @ZJaume in #63
- Add detected encoding to the metadata by @ZJaume in #64
- Fix html missing in JSONL stdout when skipping extraction by @ZJaume in #66
New Contributors
Full Changelog: v1.2.0...v1.3.0
v1.2.0
What's Changed
- Add
--robotspass
shunt for records related to robots.txt by @jelmervdl in #43 - Add
--jsonl
option by @jelmervdl in #35 - warc2html changes by @ZJaume in #50
- ZSTD compression and compression level support by @ZJaume in #51
- Move JSONL output to --stdout and allow file-based output with JSONL by @ZJaume in #52
Full Changelog: v1.1.0...v1.2.0
v1.1.0: Merge pull request #36 from jelmervdl/fasttext-option
Changes:
- Add option to use a FastText model as a language identifier
- Record identified by CLD2 as Unknown are classified as
unk
instead of dropped.