Skip to content Skip to sidebar Skip to footer

Scrapy Crawler Spider Doesn't Follow Links

For this, I used example in Scrapy crawl spider example: http://doc.scrapy.org/en/latest/topics/spiders.html I want to get links from a web page and follow them to parse table with

Solution 1:

Scrapy is misinterpreting the content type of the start url.

You can verify this by using scrapy shell:

$ scrapy shell 'http://www.euroleague.net/main'2013-11-1816:39:26+0900 [scrapy] INFO: Scrapy 0.21.0 started (bot: scrapybot)
...

AttributeError: 'Response'object has no attribute 'body_as_unicode'

See my previous answer about the missing body_as_unicode attribute. I notice that the server does not set any content-type header.

CrawlSpider ignores non-html responses, so the responses are not processed and no links are followed.

I would suggest opening a issue on github, as I think Scrapy should be able to handle this case transparently.

As a work around you could override the CrawlSpider parse method, create an HtmlResponse from the response object passed, and pass that to the superclass parse method.

Solution 2:

prepend "www" to allowed domains.

Post a Comment for "Scrapy Crawler Spider Doesn't Follow Links"