Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sites with no articles in Sept-Dec 2018 #193

Closed
dongpng opened this issue May 2, 2019 · 1 comment
Closed

Sites with no articles in Sept-Dec 2018 #193

dongpng opened this issue May 2, 2019 · 1 comment
Assignees
Labels
config Issue with site configurations

Comments

@dongpng
Copy link
Collaborator

dongpng commented May 2, 2019

Hi,

We used the following query to find websites in the database with no articles in our time period (Sept 1, 2018 - Dec 31,2018).


DECLARE @MinDate DATE = '20180901',
        @MaxDate DATE = '20181231';

SELECT DISTINCT([site_name]) 
        FROM [dbo].[articles_v5] WHERE [site_name] NOT IN (
                SELECT [site_name] FROM [dbo].[articles_v5] 
                WHERE [publication_datetime] >= @MinDate
                AND [publication_datetime] < @MaxDate);

Out of these sites, we found that most sites were dead or didn't had content in the period. However, there are 3 sites that should have articles:

  • aljazeera.com
  • huffingtonpost.com
  • notallowedto.com

notallowedto.com has hidden pagination. Scraping could be done using https://notallowedto.com/news/page1, https://notallowedto.com/news/page2, https://notallowedto.com/news/page3

@jemrobinson jemrobinson added the config Issue with site configurations label May 17, 2019
@edwardchalstrey1
Copy link
Collaborator

edwardchalstrey1 commented May 24, 2019

  • aljazeera.com: James has fixed the issue already
  • huffingtonpost.com: I think the problem is that the website index pages only go back as far as page 17, which currently is articles from 2019-05
  • notallowedto.com: see Problematic site: notallowedto.com #147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config Issue with site configurations
Projects
None yet
Development

No branches or pull requests

3 participants