🚀 Get millions of labeled images in just a few hours^* 🚀

This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.

^*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.

Image Extraction

Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images

Check out our supported publishers to find out which publishers are supported.

New Publishers for `it`, `ch`, `jp`, `es`, `dk`, `tz`, `be`

With this major release, Fundus now offers support for 124 publishers from 22 different countries

`IT`

Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
add CorriereDellaSera by @addie9800 in #677
Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700

`CH`

Add 20 Minuten by @MaxDall in #673

`JP`

Add Taipei Times by @MaxDall in #674
Add AsahiShimbun by @MaxDall in #682
Add ChunichiShimbun and TokyoShimbun by @MaxDall in #683
Add MainichiShimbun by @MaxDall in #685
add Nikkei by @MaxDall in #686
Add SankeiShimbun by @MaxDall in #688
Add NikkanGeadai by @MaxDall in #689

`ES`

Add El Mundo by @MaxDall in #675
Add ABC by @addie9800 in #681
Add LaVanguardia by @addie9800 in #684

`DK`

Add DK by @addie9800 in #696

`TZ`

Add Tanzanian Publishers by @addie9800 in #691

`BE`

Add BE by @addie9800 in #697

Update Publishers

Update FreiePresse by @addie9800 in #663
Fix Metro by @addie9800 in #665
Update BoersenZeitung parser by @MaxDall in #666
Update BBC by @addie9800 in #668
Layout Change SRF by @addie9800 in #680
Add parser v1_1 - iNews by @addie9800 in #693
Update Dagbladet by @addie9800 in #695

Bug fixes

Reraise exceptions in main thread when error handling is set to raise by @MaxDall in #662
Fix a bug returning None for empty values in xpath_search by @MaxDall in #671
Add IST to tzinfo by @MaxDall in #690
Fix article serialization for images by @MaxDall in #703

Improvements

Add octet-stream to decompressor by @MaxDall in #660

New Contributors

@ruggsea made their first contribution in #670

Full Changelog: v0.4.6...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

🚀 Get millions of labeled images in just a few hours^* 🚀

Image Extraction

New Publishers for `it`, `ch`, `jp`, `es`, `dk`, `tz`, `be`

`IT`

`CH`

`JP`

`ES`

`DK`

`TZ`

`BE`

Update Publishers

Bug fixes

Improvements

New Contributors

Contributors

v0.5.0

🚀 Get millions of labeled images in just a few hours* 🚀

Image Extraction

New Publishers for it, ch, jp, es, dk, tz, be

IT

CH

JP

ES

DK

TZ

BE

Update Publishers

Bug fixes

Improvements

New Contributors

Contributors

🚀 Get millions of labeled images in just a few hours^* 🚀

New Publishers for `it`, `ch`, `jp`, `es`, `dk`, `tz`, `be`

`IT`

`CH`

`JP`

`ES`

`DK`

`TZ`

`BE`