Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot crawl kansascity.com #255

Open
edwardchalstrey1 opened this issue Jun 11, 2019 · 0 comments
Open

Cannot crawl kansascity.com #255

edwardchalstrey1 opened this issue Jun 11, 2019 · 0 comments
Labels
problematic-site Site is broken or structure has changed

Comments

@edwardchalstrey1
Copy link
Collaborator

edwardchalstrey1 commented Jun 11, 2019

It's possible there is a preventative measure stopping us scraping this site, from list 1 #239 - see PR #256

2019-06-11 14:16:18 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: misinformation)
2019-06-11 14:16:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.2 (default, Dec 29 2018, 00:00:04) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-06-11 14:16:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'misinformation', 'CONCURRENT_REQUESTS': 8, 'FEED_EXPORT_ENCODING': 'utf-8', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'misinformation.spiders', 'SPIDER_MODULES': ['misinformation.spiders'], 'URLLENGTH_LIMIT': 850, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36'}
2019-06-11 14:16:18 [scrapy.extensions.telnet] INFO: Telnet Password: 38df1f9601e13043
2019-06-11 14:16:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-06-11 14:16:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'misinformation.middlewares.JSLoadButtonMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'misinformation.middlewares.CloudFlareMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'misinformation.middlewares.DelayedRetryMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-06-11 14:16:19 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-06-11 14:16:19 [scrapy.middleware] INFO: Enabled item pipelines:
['misinformation.pipelines.ArticleJsonFileExporter']
2019-06-11 14:16:19 [scrapy.core.engine] INFO: Spider opened
2019-06-11 14:16:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-06-11 14:16:19 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-06-11 14:17:14 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.kansascity.com/news/politics-government/>
Traceback (most recent call last):
  File "/anaconda3/envs/misinfo-dev/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2019-06-11 14:17:14 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-11 14:17:14 [ScattergunSpider] INFO: Spider closed: kansascity.com (finished)
2019-06-11 14:17:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 924,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 6, 11, 13, 17, 14, 513341),
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'memusage/max': 77508608,
 'memusage/startup': 77504512,
 'retry/count': 2,
 'retry/max_reached': 2,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 2,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2019, 6, 11, 13, 16, 19, 834908)}
2019-06-11 14:17:14 [scrapy.core.engine] INFO: Spider closed (finished)
@edwardchalstrey1 edwardchalstrey1 added the problematic-site Site is broken or structure has changed label Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
problematic-site Site is broken or structure has changed
Projects
None yet
Development

No branches or pull requests

1 participant