...TBC... ... license: odc-by language:
- en pretty_name: BBC News from FineWeb size_categories:
- 10K<n<100K
This dataset provides a filtered subset of BBC News articles from the realnewslike subset of the FineWeb dataset, containing approximately 77k articles from BBC News domains.
- Curated by: Louis Maddox (@permutans on HuggingFace and X/Twitter)
- License: ODC-BY (inherited from FineWeb)
- Language: English
- Repository: https://huggingface.co/datasets/permutans/fineweb-bbc-news
- Source Dataset: HuggingFaceFW/fineweb
- Paper: https://arxiv.org/abs/2406.17557 (FineWeb paper)
Suitable for text analysis and NLP tasks focused on news content, particularly when working with BBC News articles. The dataset provides cleaned article text without metadata like bylines or publication dates.
This dataset should not be used as a comprehensive archive of BBC News content, as it represents only articles captured in FineWeb's crawl (from CommonCrawl between 2013-2024). It should not be assumed to contain all articles from any given time period.
Example format:
{
'url': 'news.bbc.co.uk/news/article-path',
'text': 'Article content...'
}
url
: URL of the article with query parameters removedtext
: Full article text content
- Contains approximately 77k articles
- No validation split in current version
Created to provide an easily accessible dataset of BBC news articles while offering a focused view into the FineWeb dataset's coverage of major news sources. Enables analysis of FineWeb's completeness and motivates investigation of alternative data acquisition methods.
- Filtered from FineWeb's dated subsets (i.e. not default subset nor sample subsets)
- Limited to domains: news.bbc.co.uk, www.bbc.co.uk/news, www.bbc.com/news
- URL cleaning: removed query parameters
- Regional news content excluded due to sparse coverage in source data
- No modifications to article text content
Article texts contain only the main content body, without bylines or metadata.
- No validation split in current version
- Original publication dates not available (FineWeb timestamps were crawl dates)
- Section/index pages not yet filtered out from article pages
- Regional news content explicitly excluded due to sparse coverage
- Relationship between news.bbc.co.uk and bbc.co.uk/news domains needs investigation
- Coverage may be incomplete compared to full BBC News archive
Users should be aware that this represents a subset of BBC News content which appears to be from around 2019 and earlier. For applications requiring comprehensive coverage or accurate publication dates, additional data sources should be considered.
- Potential expansion using fineweb dataset for more recent content
- Addition of publication dates through targeted crawling
- Filtering to distinguish between section pages and article pages
- Creation of validation split
Please cite the original FineWeb dataset when using this data. A reference to this one would be welcome but not necessary, I consider this a derivative work.
Louis Maddox (@permutans)