Skip to content Skip to sidebar Skip to footer

Apostrophes Are Printing Out As Â\x80\x99

import requests from bs4 import BeautifulSoup import re source_url = requests.get('http://www.nytimes.com/pages/business/index.html') div_classes = {'class' :['ledeStory' , 'story

Solution 1:

In ISO 8859-1 and related code sets (there are many of them), â has code point 0xE2. When you interpret the three bytes 0xE2, 0x80, 0x99 as a UTF-8 encoding, the character is U+2019, RIGHT SINGLE QUOTATION MARK (which is ’ or , as distinct from ' or ' — you may or may not be able to spot the difference).

I see a few possibilities for the source of your difficulties, any one or more of which could be the source of your trouble:

  1. Your terminal is not set up to interpret UTF-8.
  2. Your source code should use ' (U+0027, APOSTROPHE).
  3. You're using Python 2.x rather than Python 3.x and it is having issues because of the use of Unicode (UTF-8). Against this (as Cory Maddenpointed out), the code ends with print(h4) which is Python 3, so it probably isn't the issue.

It may be simplest to change the quotation mark into an ASCII apostrophe.

On the other hand, if you are analyzing HTML from elsewhere, you may have to consider how your script is going to handle UTF-8. Using quote marks from the Unicode U+20xx range is a very common choice; maybe your scraper needs to handle it?

Solution 2:

I have come across the same problem while scraping data with requests, then parsing it with BeautifulSoup.

This solution from here works well for me:

soup = BeautifulSoup(r.content.decode('utf-8'),"lxml")

If this doesn't work, adding .encode('latin1').decode('utf-8') after the .get_text() or .text also solves the issue.

Post a Comment for "Apostrophes Are Printing Out As Â\x80\x99"