Skip to content Skip to sidebar Skip to footer

Parsing Web Page's Search Results With Python

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugation

Solution 1:

When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html. Try to use lxml.html.

Solution 2:

Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.

I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

Post a Comment for "Parsing Web Page's Search Results With Python"