Skip to content

v0.5.0

Latest
Compare
Choose a tag to compare
@MaxDall MaxDall released this 15 Feb 15:40
· 13 commits to master since this release
fa4342a

πŸš€ Get millions of labeled images in just a few hours* πŸš€

This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.

*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.

Image Extraction

Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).

images-log-scale
Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images

Check out our supported publishers to find out which publishers are supported.

New Publishers for it, ch, jp, es, dk, tz, be

With this major release, Fundus now offers support for 124 publishers from 22 different countries

IT

  • Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
  • add CorriereDellaSera by @addie9800 in #677
  • Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700

CH

JP

ES

DK

TZ

BE

Update Publishers

Bug fixes

  • Reraise exceptions in main thread when error handling is set to raise by @MaxDall in #662
  • Fix a bug returning None for empty values in xpath_search by @MaxDall in #671
  • Add IST to tzinfo by @MaxDall in #690
  • Fix article serialization for images by @MaxDall in #703

Improvements

New Contributors

Full Changelog: v0.4.6...v0.5.0