Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot crawl pressherald.com #258

Open
edwardchalstrey1 opened this issue Jun 11, 2019 · 0 comments
Open

Cannot crawl pressherald.com #258

edwardchalstrey1 opened this issue Jun 11, 2019 · 0 comments
Labels
problematic-site Site is broken or structure has changed

Comments

@edwardchalstrey1
Copy link
Collaborator

Site from list 1 #239 - PR

Unable to crawl and getting various response errors e.g.

2019-06-11 15:31:15 [scrapy.core.engine] INFO: Spider opened
2019-06-11 15:31:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:31:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-11 15:32:15 [scrapy.extensions.logstats] INFO: Crawled 371 pages (at 371 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:33:15 [scrapy.extensions.logstats] INFO: Crawled 724 pages (at 353 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:34:15 [scrapy.extensions.logstats] INFO: Crawled 1032 pages (at 308 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:35:01 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.pressherald.com/2018/04/07/police-say-westbrook-armed-robbery-likely-linked-to-others/bquimby@pressherald.com>: HTTP status code is not handled or not allowed
2019-06-11 15:35:15 [scrapy.extensions.logstats] INFO: Crawled 1341 pages (at 309 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:36:15 [scrapy.extensions.logstats] INFO: Crawled 1647 pages (at 306 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 15:36:58 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.pressherald.com/2018/11/03/world-war-i-sacrifices-of-mainers-go-digital/digitalmaine.com>: HTTP status code is not handled or not allowed
@edwardchalstrey1 edwardchalstrey1 added the problematic-site Site is broken or structure has changed label Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
problematic-site Site is broken or structure has changed
Projects
None yet
Development

No branches or pull requests

1 participant