I like to use Beautiful Soup in combination with urllib2 to parse HTML from Python scripts. The problem is that I spend like 30 minutes relearning how to use it every time I do a new project. So, for my own use (and maybe yours) here’s my quick tips for syntax.
I always start off the same way, two lines of code to snag and objectify the HTML:
html = urllib2.urlopen("<http://hackaday.com/comments/feed>").read()
soup = BeautifulSoup(html)
From there it’s a matter of working with the ‘soup’ data object. This one gets an RSS feed of comments. They have are partitioned into <item>
tags which you can traverse like this:
soup('item')[0]
Which is an array with an index (this is item 0). But you can also iterate through the list using:
for item in soup('item'):
From there just walk through the tree hierarchy. Here’s how you can get the publish date (string surrounded by <pubdate>
tags) for the item. Notice that you need to index the pubdate in order to access its string data:
soup('item')[0]('pubdate')[0].string
The part that always confuses me is the need for the index. It identifies which tag you’re accessing in case there are multiples in this part of the tree. You can get the number of tags found by wrapping your tag term in the length funtion:
len(soup('item'))
Should always return 15 because that’s the number of comments WordPress is set to publish in the RSS feed.
There are other ways to do this using soup.findAll, but I find this one usually works the best.