There is cache built in the spider. It could be helpful on development stage. When you need to scrape same documents for many times to check the results and to fix bugs. Also you can crawl whole web-site, put it into cache and then work only with cache.
Keep in mind that if the web-site is large, millions of web pages then working with cache could be slower than working with live web-site. This is because of limited disk I/O where the cache storage is hosted.
Also keep in mind the the spider cache is very simple:
it allows to cache only GET requests
- it does not allow to differentiate documents with same URL but
it does not support max-age and other cache headers
Spider Cache Backends¶
You can choose what storage to use for the cache. You can use mongodb, mysql and postgresql.
bot = ExampleSpider() bot.setup_cache(backend='mongo', database='some-database') bot.run()
In this example the spider is configured to use mongodb as cache storage. The name of database is “some-database”. The name of collection would be “cache”.
All arguments except backend, database and use_compression go to database connection constructor. You can setup database name, host name, port, authorization arguments and other things.
Example of custom host name and port for mongodb connection:
.. code:: python
bot = SomeSpider() bot.setup_cache(backend=’mongo’, port=7777, host=’mongo.localhost’)
By default cache compression is enabled. That means that all documents placed in the cache are compressed with gzip library. Compression decreases the disk space required to store the cache and increases the CPU load (a bit).