Forums

Different result from pythonanywhere than local script

Hi, I'm baffled by this problem. I'm using Beautiful Soup to get information from builtwith.com, and I'm getting a completely different HTML page on PythonAnywhere (it claims there are no results). When I run the same script on my local machine, it works perfectly. I've already tried using headers to make it seem like a browser visit.

Main code in question:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url_to_request = input('Enter the URL you would like to scrape > ')
r = requests.get('https://builtwith.com/' + url_to_request, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
allH3 = soup.findAll('h3')

Except on PythonAnywhere, url_to_request is received by an API call. I've already tested to make sure that it goes to the right URL.

Thanks in advanced for your help.

You could try to debug this with WGET or CURL...

Sorry, I'm not knowledgeable enough to know how to do that without some instructions.

open a bash console and type the following:

curl https://builtwith.com/your/request/url

Press enter and see what the result is.

Alternatively, you should print r.text to see what is being returned.

So, the curl returns the right result (the same as local script). Weird thing: this python script occasionally works. I ran it a few times and the first time, it returned what I expected. Then it didn't the subsequent times.

That definitely sounds like you are getting rate limited by builtwith.com

Yeah, I ran some more tests and it appears that's the case. Is there any way around this?

My only other option is to build the builtwith functionality myself (which I was attempting to do before). The problem is that the only way I've found to do this reliably is to get all network requests that the target page sends, and from what I've heard the only way to do that is spoof a browser visit with a proxy and grab the info. I tried doing that with selenium here and found that it didn't work.

Yes- unfortunately if builtwith wants to limit how frequently you can access their site unless you pay them... Then that is most likely what you will have to do!