i'm trying find first 30 ted videos (name of video , url) using following beautifulsoup script:
import urllib2 beautifulsoup import beautifulsoup total_pages = 3 page_count = 1 count = 1 url = 'http://www.ted.com/talks?page=' while page_count < total_pages: page = urllib2.urlopen("%s%d") %(url, page_count) soup = beautifulsoup(page) link = soup.findall(lambda tag: tag.name == 'a' , tag.findparent('dt', 'thumbnail')) outfile = open("test.html", "w") print >> outfile, """<head> <head> <title>ted talks index</title> </head> <body> <br><br><center> <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>""" print >> outfile, "<tr><th style='border-bottom:2px solid #e16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #e16543; border-right:1px solid #000;'>name</th><th style='border-bottom:2px solid #e16543;'>url</th></tr>" ted_link = 'http://www.ted.com/' anchor in link: print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href']) count = count + 1 print >> outfile, """</table> </body> </html>""" page_count = page_count + 1
the code looks alright minus 2 things:
count doesn't seem incremented. goes through , finds first page's content, ie: first ten, not thirty, videos. why?
this bit of code gives me lot of errors. don't know how else implement want here logically (with urlopen("%s%d"):
code:
total_pages = 3 page_count = 1 count = 1 url = 'http://www.ted.com/talks?page=' while page_count < total_pages: page = urllib2.urlopen("%s%d") %(url, page_count)
first, simplify loop , eliminate few variables, amount boilerplate cruft in case:
for pagenum in xrange(1, 4): # 4 annoying, write 3+1 if like. url = "http://www.ted.com/talks?page=%d" % pagenum # stuff url
but let's open file outside of loop, instead of reopening each iteration. why saw 10 results: talks 11-20 instead of first ten thought. (it would've been 21-30, except looped on page_count < total_pages
, processed first 2 pages.)
and gather links @ once, write output afterwards. i've stripped out html styling, makes code easier follow; instead, use css, possibly inline <style> element, or add if like.
import urllib2 cgi import escape # important! beautifulsoup import beautifulsoup def is_talk_anchor(tag): return tag.name == "a" , tag.findparent("dt", "thumbnail") links = [] pagenum in xrange(1, 4): soup = beautifulsoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum)) links.extend(soup.findall(is_talk_anchor)) out = open("test.html", "w") print >>out, """<html><head><title>ted talks index</title></head> <body> <table> <tr><th>#</th><th>name</th><th>url</th></tr>""" x, in enumerate(links): print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"])) print >>out, "</table>" # or, ordered list: print >>out, "<ol>" in links: print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], true), escape(a["title"])) print >>out, "</ol>" print >>out, "</body></html>"
Comments
Post a Comment