python - Scrapy - parse a page to extract items - then follow and store item url contents -


i have question on how thing in scrapy. have spider crawls listing pages of items. every time listing page found, items, there's parse_item() callback called extracting items data, , yielding items. far good, works great.

but each item, has among other data, url, more details on item. want follow url , store in item field (url_contents) fetched contents of item's url.

and i'm not sure how organize code achieve that, since 2 links (listings link, , 1 particular item link) followed differently, callbacks called @ different times, have correlate them in same item processing.

my code far looks this:

class myspider(crawlspider):     name = "example.com"     allowed_domains = ["example.com"]     start_urls = [         "http://www.example.com/?q=example",     ]      rules = (         rule(sgmllinkextractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),         rule(sgmllinkextractor(allow=('item\/detail', )), follow = false),     )       def parse_item(self, response):         main_selector = htmlxpathselector(response)         xpath = '//h2[@class="title"]'          sub_selectors = main_selector.select(xpath)          sel in sub_selectors:             item = exampleitem()             l = exampleloader(item = item, selector = sel)             l.add_xpath('title', 'a[@title]/@title')             ......             yield l.load_item() 

after testing , thinking, found solution works me. idea use first rule, gives listings of items, , also, important, add follow=true rule.

and in parse_item() have yield request instead of item, after load item. request item detail url. , have send loaded item request callback. job response, , there yield item.

so finish of parse_item() this:

itemloaded = l.load_item()  # fill url contents url = sel.select(item_url_xpath).extract()[0] request = request(url, callback = lambda r: self.parse_url_contents(r)) request.meta['item'] = itemloaded  yield request 

and parse_url_contents() this:

def parse_url_contents(self, response):     item = response.request.meta['item']     item['url_contents'] = response.body     yield item 

if has (better) approach, let know.

stefan


Comments