π Get millions of labeled images in just a few hours* π
This release adds image extraction and new publishers, updates existing ones, and fixes several bugs.
*Testing involved crawling 1 million images including at least a caption or description, which took 1 hour and 20 minutes. This was done on a machine using 10Gbit/s bandwidth and the CC-NEWS crawl running with 50 processes. Results may vary based on the use case and bandwidth.
Image Extraction
Thanks to @addie9800, Fundus now provides image extraction for most of our publishers. Each crawled article automatically parses image links and metadata, allowing users to retrieve millions of labeled images in just a few hours. Parsed images include the caption, description, author, and various image versions (sorted by size).
Language distribution of one million crawled images, excluding languages with fewer than 1000 entries images
Check out our supported publishers to find out which publishers are supported.
New Publishers for it
, ch
, jp
, es
, dk
, tz
, be
With this major release, Fundus now offers support for 124 publishers from 22 different countries
IT
- Initial support for Italian publishers, starting with La Repubblica by @ruggsea in #670
- add
CorriereDellaSera
by @addie9800 in #677 - Support for 2 new italian newspapers - Corriere della Sera & Il Giornale by @ruggsea in #700
CH
JP
- Add
Taipei Times
by @MaxDall in #674 - Add
AsahiShimbun
by @MaxDall in #682 - Add
ChunichiShimbun
andTokyoShimbun
by @MaxDall in #683 - Add
MainichiShimbun
by @MaxDall in #685 - add
Nikkei
by @MaxDall in #686 - Add
SankeiShimbun
by @MaxDall in #688 - Add
NikkanGeadai
by @MaxDall in #689
ES
- Add
El Mundo
by @MaxDall in #675 - Add
ABC
by @addie9800 in #681 - Add
LaVanguardia
by @addie9800 in #684
DK
- Add
DK
by @addie9800 in #696
TZ
- Add Tanzanian Publishers by @addie9800 in #691
BE
- Add
BE
by @addie9800 in #697
Update Publishers
- Update
FreiePresse
by @addie9800 in #663 - Fix
Metro
by @addie9800 in #665 - Update
BoersenZeitung
parser by @MaxDall in #666 - Update BBC by @addie9800 in #668
- Layout Change
SRF
by @addie9800 in #680 - Add parser
v1_1
-iNews
by @addie9800 in #693 - Update
Dagbladet
by @addie9800 in #695
Bug fixes
- Reraise exceptions in main thread when error handling is set to
raise
by @MaxDall in #662 - Fix a bug returning
None
for empty values inxpath_search
by @MaxDall in #671 - Add
IST
to tzinfo by @MaxDall in #690 - Fix article serialization for
images
by @MaxDall in #703
Improvements
New Contributors
Full Changelog: v0.4.6...v0.5.0