Why Request.get() Returning Wrong Page Content?
Solution 1:
Please try this.
Python code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
upc_codes = ['USC4215', 'USC4225', 'USC12050']
defretrunh1(upc):
payload = {'search': upc }
r = requests.get('https://pbejobbers.com/product', params=payload)
matches = re.search(r'document\.location\.href=\"(:?.*)=1\";', str(r.text), re.M|re.S)
url = matches[1]
response = requests.get(url)
for resp in response.history:
r = requests.post(resp.headers['Location'])
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())
if __name__=='__main__':
for upc in upc_codes:
retrunh1(upc)
Output:
<divclass="page-area-container"><divclass=" middlebar"><divclass=" middlebar__left"><aclass=" logo"href="/"><imgalt="PBE Jobbers"class=" logo-img"src="/bundles/pjfrontend/pbejobbers/images/logo/pbe-logo.svg?version=9d4c5d60"/></a></div>
...
</div>
...
</div>
Solution 2:
The javascript probably populates the html portion of the page dynamically when the browser starts executing it, so urllib
can't download the complete source.
Your python script needs to use a headless browser framework like Selenium to load the page as a browser would and then extract what you need.
As others mentioned, please do not violate their terms of service, especially if the data is private/behind a login page
Solution 3:
when i manually search USC4215
, the url is https://pbejobbers.com/product/search?search=USC4215&_rand=0.35863039778309025
The website is appending a random secret _rand
to prevent robot web-crawling. u need to make a request with a valid random secret to receive response.
In fact, usually the secret is generated with a set of cookies, if u click Inspect ==> Network ==> Doc
and Ctrl + R
for refreshing the website, you would find more about the network traffic as you make another request, precisely what is your http request and response content.
Post a Comment for "Why Request.get() Returning Wrong Page Content?"