Forums

How to fix "ValueError at /search/ read of closed file"?

Hay! Last night opened myself "Hacker". becouse "Beginner" did not give access to external sites. But I still can not download external internet sites. Why? Here's my application: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

Hi there,

Can you try closing all your consoles, logging out and logging back in, and then restarting your web app?

harry, I closed all the consoles,logged out and logged back in. But the error stayed: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html :(

Did you restart your web app?

Yes

Can you rewrite your code so that we can see the specific error that's happening here? It's hard to tell whether the problem is happening because of a proxy setting, or because of something else...

p=MyOpener().open(page).read()

For example, it could be because whichever url library you're using has a bug. Try swapping it out for requests?

import requests
response = requests.get(page)
assert response.status == 200
p = response.text

I tried to use your code. Here's what happened: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

ok so that's telling us the get request isn't returning a 200, which means there's a problem. try just printing the response text?

assert False, "status was {}, response text was {}".format(response.status, response.text)

So, the result: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

So, status code 400 means "bad request" (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.1)

Could it be that booking.com are trying to block scrapers, so they are preventing access if they think it's not coming from a "real" web browser?

Maybe. But this code is working on my computer (locally):

from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'   
p=MyOpener().open(page).read()
print(p)

What do I do?

It looks like your local version uses a custom user-agent (the mozilla/5.0 etc etc line). Try adding that to the requests call:

requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})

http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html Oh, error. Maybe I'm wrong to follow your advice? page = "http://www.booking.com/reviewlist.ru.html?pagename="+your_hotel_pagename+";cc1="+your_hotel_cc1+";rows=100" import requests response = requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})#requests.get(page) assert False, "status_code was {}, response text was {}".format(response.status_code, response.text) #assert response.status_code == 200 p = response.text

Sorry for my English. I want to say I have some mistakes

     page = "http://www.booking.com/reviewlist.ru.html?pagename="+your_hotel_pagename+";cc1="+your_hotel_cc1+";rows=100" 
import requests 
response = requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})#requests.get(page) 
assert False, "status_code was {}, response text was {}".format(response.status_code, response.text) 
#assert response.status_code == 200 
p = response.text

Ah. you misunderstood me. Put the full user-agent header in:

response = requests.get(page, headers={
    'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
})

Still get the error :(

http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

Even if that write, there is an error

response = requests.get(page, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})

Hm. My only theory is that booking.com have blocked our IP then...

Is there an API you can use, instead of trying to scrape their home page?

harry, I just need to download a page, such as these:

http://www.booking.com/reviewlist.ru.html?pagename=boutique-hotel-kavalier;cc1=ua

in the variable :)

Looks to me like booking.com is blocking non-browser requests. That's a pretty good indication that they don't want people scraping their site.

I do not know all the intricacies of python, but my kompyutore this code in the console works. Console - this is not a browser?

from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'   
p=MyOpener().open(page).read()
print(p)

I know. But try it in a PythonAnywhere console and you'll see it doesn't work. booking.com are blocking the request somehow. I just don't think you're going to be able to get it to work on PythonAnywhere in the same way as it does on your PC.

So your choices are either to try and find an API to get reviews from booking.com, or to try another host. I suppose you could also try to use some sort of VPN...

Ultimately, though, as Glenn says, it looks like booking.com don't want people to scrape their website in that way -- you'll probably find it's against their Terms and Conditions. And, really, we can't help you to break other site's Ts & Cs...