python - Beautiful Soup And While Statement -


i'm trying find first 30 ted videos (name of video , url) using following beautifulsoup script:

import urllib2 beautifulsoup import beautifulsoup  total_pages = 3 page_count = 1 count = 1  url = 'http://www.ted.com/talks?page='  while page_count < total_pages:      page = urllib2.urlopen("%s%d") %(url, page_count)      soup = beautifulsoup(page)      link = soup.findall(lambda tag: tag.name == 'a' , tag.findparent('dt', 'thumbnail'))      outfile = open("test.html", "w")      print >> outfile, """<head>             <head>                     <title>ted talks index</title>             </head>              <body>              <br><br><center>              <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""      print >> outfile, "<tr><th style='border-bottom:2px solid #e16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #e16543; border-right:1px solid #000;'>name</th><th style='border-bottom:2px solid #e16543;'>url</th></tr>"      ted_link = 'http://www.ted.com/'      anchor in link:             print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])      count = count + 1      print >> outfile, """</table>                     </body>                     </html>"""      page_count = page_count + 1 

the code looks alright minus 2 things:

  1. count doesn't seem incremented. goes through , finds first page's content, ie: first ten, not thirty, videos. why?

  2. this bit of code gives me lot of errors. don't know how else implement want here logically (with urlopen("%s%d"):

code:

total_pages = 3 page_count = 1 count = 1  url = 'http://www.ted.com/talks?page='  while page_count < total_pages:  page = urllib2.urlopen("%s%d") %(url, page_count) 

first, simplify loop , eliminate few variables, amount boilerplate cruft in case:

for pagenum in xrange(1, 4):  # 4 annoying, write 3+1 if like.   url = "http://www.ted.com/talks?page=%d" % pagenum   # stuff url 

but let's open file outside of loop, instead of reopening each iteration. why saw 10 results: talks 11-20 instead of first ten thought. (it would've been 21-30, except looped on page_count < total_pages, processed first 2 pages.)

and gather links @ once, write output afterwards. i've stripped out html styling, makes code easier follow; instead, use css, possibly inline <style> element, or add if like.

import urllib2 cgi import escape  # important! beautifulsoup import beautifulsoup  def is_talk_anchor(tag):   return tag.name == "a" , tag.findparent("dt", "thumbnail") links = [] pagenum in xrange(1, 4):   soup = beautifulsoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))   links.extend(soup.findall(is_talk_anchor))  out = open("test.html", "w")  print >>out, """<html><head><title>ted talks index</title></head> <body> <table> <tr><th>#</th><th>name</th><th>url</th></tr>"""  x, in enumerate(links):   print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))  print >>out, "</table>"  # or, ordered list: print >>out, "<ol>" in links:   print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], true), escape(a["title"])) print >>out, "</ol>"  print >>out, "</body></html>" 

Comments