i have question on how thing in scrapy. have spider crawls listing pages of items. every time listing page found, items, there's parse_item() callback called extracting items data, , yielding items. far good, works great.
but each item, has among other data, url, more details on item. want follow url , store in item field (url_contents) fetched contents of item's url.
and i'm not sure how organize code achieve that, since 2 links (listings link, , 1 particular item link) followed differently, callbacks called @ different times, have correlate them in same item processing.
my code far looks this:
class myspider(crawlspider): name = "example.com" allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/?q=example", ] rules = ( rule(sgmllinkextractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'), rule(sgmllinkextractor(allow=('item\/detail', )), follow = false), ) def parse_item(self, response): main_selector = htmlxpathselector(response) xpath = '//h2[@class="title"]' sub_selectors = main_selector.select(xpath) sel in sub_selectors: item = exampleitem() l = exampleloader(item = item, selector = sel) l.add_xpath('title', 'a[@title]/@title') ...... yield l.load_item()
after testing , thinking, found solution works me. idea use first rule, gives listings of items, , also, important, add follow=true rule.
and in parse_item() have yield request instead of item, after load item. request item detail url. , have send loaded item request callback. job response, , there yield item.
so finish of parse_item() this:
itemloaded = l.load_item() # fill url contents url = sel.select(item_url_xpath).extract()[0] request = request(url, callback = lambda r: self.parse_url_contents(r)) request.meta['item'] = itemloaded yield request
and parse_url_contents() this:
def parse_url_contents(self, response): item = response.request.meta['item'] item['url_contents'] = response.body yield item
if has (better) approach, let know.
stefan
Comments
Post a Comment