How to fix "ValueError at /search/ read of closed file"? : Forums : PythonAnywhere

How to fix "ValueError at /search/ read of closed file"?

Hay! Last night opened myself "Hacker". becouse "Beginner" did not give access to external sites. But I still can not download external internet sites. Why? Here's my application: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:12 a.m. | permalink

Hi there,

Can you try closing all your consoles, logging out and logging back in, and then restarting your web app?

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 12:08 p.m. | permalink

harry, I closed all the consoles,logged out and logged back in. But the error stayed: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html :(

deleted-user-453714 | 13 posts | Nov. 19, 2014, 12:34 p.m. | permalink

Did you restart your web app?

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 12:49 p.m. | permalink

Yes

deleted-user-453714 | 13 posts | Nov. 19, 2014, 12:58 p.m. | permalink

Can you rewrite your code so that we can see the specific error that's happening here? It's hard to tell whether the problem is happening because of a proxy setting, or because of something else...

p=MyOpener().open(page).read()

For example, it could be because whichever url library you're using has a bug. Try swapping it out for requests?

import requests
response = requests.get(page)
assert response.status == 200
p = response.text

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 1:16 p.m. | permalink

I tried to use your code. Here's what happened: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

deleted-user-453714 | 13 posts | Nov. 19, 2014, 1:41 p.m. | permalink

ok so that's telling us the get request isn't returning a 200, which means there's a problem. try just printing the response text?

assert False, "status was {}, response text was {}".format(response.status, response.text)

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 2:06 p.m. | permalink

So, the result: http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:02 p.m. | permalink

So, status code 400 means "bad request" (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.1)

Could it be that booking.com are trying to block scrapers, so they are preventing access if they think it's not coming from a "real" web browser?

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 5:07 p.m. | permalink

Maybe. But this code is working on my computer (locally):

from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'   
p=MyOpener().open(page).read()
print(p)

What do I do?

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:17 p.m. | permalink

It looks like your local version uses a custom user-agent (the mozilla/5.0 etc etc line). Try adding that to the requests call:

requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 5:23 p.m. | permalink

http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html Oh, error. Maybe I'm wrong to follow your advice? page = "http://www.booking.com/reviewlist.ru.html?pagename="+your_hotel_pagename+";cc1="+your_hotel_cc1+";rows=100" import requests response = requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})#requests.get(page) assert False, "status_code was {}, response text was {}".format(response.status_code, response.text) #assert response.status_code == 200 p = response.text

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:31 p.m. | permalink

Sorry for my English. I want to say I have some mistakes

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:33 p.m. | permalink

     page = "http://www.booking.com/reviewlist.ru.html?pagename="+your_hotel_pagename+";cc1="+your_hotel_cc1+";rows=100" 
import requests 
response = requests.get(page, headers={'User-agent': 'Mozilla/5.0 etc etc'})#requests.get(page) 
assert False, "status_code was {}, response text was {}".format(response.status_code, response.text) 
#assert response.status_code == 200 
p = response.text

deleted-user-453714 | 13 posts | Nov. 19, 2014, 5:34 p.m. | permalink

Ah. you misunderstood me. Put the full user-agent header in:

response = requests.get(page, headers={
    'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
})

harry | 2710 posts | PythonAnywhere staff | Nov. 19, 2014, 6:11 p.m. | permalink

Still get the error :(

http://tauguard.pythonanywhere.com/search/?q=http://www.booking.com/hotel/ua/boutique-hotel-kavalier.ru.html

deleted-user-453714 | 13 posts | Nov. 19, 2014, 7:15 p.m. | permalink

Even if that write, there is an error

response = requests.get(page, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})

deleted-user-453714 | 13 posts | Nov. 19, 2014, 7:18 p.m. | permalink

Hm. My only theory is that booking.com have blocked our IP then...

Is there an API you can use, instead of trying to scrape their home page?

harry | 2710 posts | PythonAnywhere staff | Nov. 20, 2014, 10:55 a.m. | permalink

harry, I just need to download a page, such as these:

http://www.booking.com/reviewlist.ru.html?pagename=boutique-hotel-kavalier;cc1=ua

in the variable :)

deleted-user-453714 | 13 posts | Nov. 20, 2014, 1:23 p.m. | permalink

Looks to me like booking.com is blocking non-browser requests. That's a pretty good indication that they don't want people scraping their site.

glenn | 9498 posts | PythonAnywhere staff | Nov. 20, 2014, 1:44 p.m. | permalink

I do not know all the intricacies of python, but my kompyutore this code in the console works. Console - this is not a browser?

from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'   
p=MyOpener().open(page).read()
print(p)

deleted-user-453714 | 13 posts | Nov. 20, 2014, 2:55 p.m. | permalink

I know. But try it in a PythonAnywhere console and you'll see it doesn't work. booking.com are blocking the request somehow. I just don't think you're going to be able to get it to work on PythonAnywhere in the same way as it does on your PC.

So your choices are either to try and find an API to get reviews from booking.com, or to try another host. I suppose you could also try to use some sort of VPN...

Ultimately, though, as Glenn says, it looks like booking.com don't want people to scrape their website in that way -- you'll probably find it's against their Terms and Conditions. And, really, we can't help you to break other site's Ts & Cs...

harry | 2710 posts | PythonAnywhere staff | Nov. 20, 2014, 3:46 p.m. | permalink