Why Request.get() Returning Wrong Page Content?

May 24, 2024 Post a Comment

I have been trying to parse a webpage using BeautifulSoup. When I import urlopen fromm urllib.request and open https://pbejobbers.com it returns following instead of webpage itself

Solution 1:

Please try this.

Python code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

upc_codes = ['USC4215', 'USC4225', 'USC12050']

defretrunh1(upc):
    payload = {'search': upc }
    r = requests.get('https://pbejobbers.com/product', params=payload)
    matches = re.search(r'document\.location\.href=\"(:?.*)=1\";', str(r.text), re.M|re.S)
    url = matches[1]

    response = requests.get(url)

    for resp in response.history:
      r = requests.post(resp.headers['Location'])
      soup = BeautifulSoup(r.content, 'html.parser')
      print(soup.prettify())

if __name__=='__main__':
    for upc in upc_codes:
        retrunh1(upc)

Output:

<divclass="page-area-container"><divclass=" middlebar"><divclass=" middlebar__left"><aclass=" logo"href="/"><imgalt="PBE Jobbers"class=" logo-img"src="/bundles/pjfrontend/pbejobbers/images/logo/pbe-logo.svg?version=9d4c5d60"/></a></div>
        ...
    </div>
    ...
</div>

Solution 2:

The javascript probably populates the html portion of the page dynamically when the browser starts executing it, so urllib can't download the complete source.

Your python script needs to use a headless browser framework like Selenium to load the page as a browser would and then extract what you need.

As others mentioned, please do not violate their terms of service, especially if the data is private/behind a login page

Baca Juga

Solution 3:

when i manually search USC4215, the url is https://pbejobbers.com/product/search?search=USC4215&_rand=0.35863039778309025

The website is appending a random secret _rand to prevent robot web-crawling. u need to make a request with a valid random secret to receive response.

In fact, usually the secret is generated with a set of cookies, if u click Inspect ==> Network ==> Doc and Ctrl + R for refreshing the website, you would find more about the network traffic as you make another request, precisely what is your http request and response content.

howtostartbloggingformoney

Why Request.get() Returning Wrong Page Content?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Why Request.get() Returning Wrong Page Content?"

Widget HTML #3