Forums

Python session request works locally, but not on PythonAnywhere

After successfully updating code to account for changes in a website (that attempt to verify a page is running javascript, etc)*, I am attempting to transfer this update to PythonAnywhere. This partially works on PA (I am able to remotely login to the website), but subsequent session requests do not work as they do locally (i.e. without the javascript validation interruption).

*Particular validation cookies are retrieved for logging in and persist to allow other network requests without javascript checks. This works when run from my computer (and again, the login portion works on PA), but the subsequent session request does not work on PA.

The request is made within a thread launched from a scheduled task.

To avoid posting excessive detail, I’ll await any instruction or questions. If there are any potentially helpful details that I could post, please let me know. Thank you.

afaik there is limited support for threads ans subprocesses, they may be killed at any time if left dangling

@irtusb that is correct for webapps. However, it sounds like @Epere063 is talking about a scheduled task, which should run any threading/subprocessing fine.

When you say that you are able to login to a website but cannot make requests afterwards, are you using selenium? Or some sort of python requests/urllib library? What exactly is returned when you make requests afterwards? And you mentioned that this code works if you run it on your local machine?

Python requests are being used and are currently preferable to Selenium if there is a way to make it work.

To confirm, yes, the same session/request workflow is run in PyCharm on a mac (10.13.6) with the intended results.

Testing further, I retried this particular example (posted below) as its own scheduled task multiple times, and it unexpectedly worked each time. However, again, when transferring the code to the main function/thread launched from a scheduled task, I get the same results (javascript validation page).

A minimum working example is posted below followed by the 'javascript validation' response text returned when the request doesn't work. Let me know if other details are needed.

Note: when attempting the main request by itself (i.e. even without the preliminary request for IDs), it appears to work the first time only. So, if it works when testing, it would need to be run again to be sure the IDs are being validated.

If running the example, you'll know that it works if "-- listings" is printed after JSON results. "-- HTML following error" with the error HTML would be printed otherwise.

MWE:

*As this may have been needlessly cluttering this post, it was removed (but can be posted or sent privately upon request)

Response when not working (side note: validation IDs still show in session.cookies when this occurs):

<!DOCTYPE html>
<html>

<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=https://www.hubzu.com/distil_r_captcha.html?requestId=f06abe3b-f8ea-481b-969b-0ecb42465ecb&httpReferrer=%2FsearchResult%2Fcounty%2F1072431%2Ffl%2Fbroward-county%3FsrchBtnClk%3D1%26searchBy%3DBroward%2BCounty%2BFL%26pageNumber%3D3%26ajaxhtml%3Dtrue" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/gxgmeitaklswiexy.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#sfyaazqtacfrrw{display:none!important}</style><script type="text/javascript" src="https://n-cdn.areyouahuman.com/play/ZJFYkJE5SICN8qC78YCEaQsMw8PeMdzQFeIwtPBN?AYAH_P1=f06abe3b-f8ea-481b-969b-0ecb42465ecb&AYAH_P2=F3F9BF55-80F2-3663-B46E-FFBD19395FE5&AYAH_P3=18886A2C-D2F1-3CA0-B17D-4A9340440B2A&AYAH_F1=2968&AYAH_F2=12495"></script>

<noscript><img src="https://n-cdn.areyouahuman.com/noscript/ZJFYkJE5SICN8qC78YCEaQsMw8PeMdzQFeIwtPBN?AYAH_P1=f06abe3b-f8ea-481b-969b-0ecb42465ecb&AYAH_P2=F3F9BF55-80F2-3663-B46E-FFBD19395FE5&AYAH_P3=18886A2C-D2F1-3CA0-B17D-4A9340440B2A&AYAH_F1=2968&AYAH_F2=12495"></noscript></head>
<body>
<div id="distilIdentificationBlock">&nbsp;</div>
</body>
</html>

I'm afraid we can't give any insights into why a particular site decides that they need a verification page. However, if they're giving you some authentication page to check whether you're human, it probably means that they;re no happy with you scraping their site.

For verification/accuracy for other viewers, I believe the following is a pertinent clarification:

'Insights can't be given into why a particular site decides that they need a verification page even when PythonAnywhere is the only apparent culprit variable [1]'

[1] -- The validation page only arises when code is run from PythonAnywhere.

Side note: There is presently not reason to assume more than the face value of javascript validation (as stated on the page); As certain sites' sales may even be facilitated by certain robotic processes, it's usually not automatically assumed that data is being stolen -- They are probably happy when you follow what is delineated in fine print.

I don't think this is a PythonAnywhere thing. Sounds more like

  1. different library version
  2. the IP being checked more stringently because of too many accesses within a certain time period etc

It is also possible that other people on PythonAnywhere are also accessing the same site from the same PythonAnywhere IP and so you are getting less access attempts before being asked to validate again.

But in general, I would definitely agree with Glenn that any JS validation/check (eg: please fill in a captcha) to be guards against scraping.

Understood. While the same module versions are being used, what’s left are the underlying environmental variables (any differences [apparently aside from IPs]* in what is being sent to the website in the request) that are different within PythonAnywhere [1].

*Key here is that the request works as launched as its own scheduled task but not when launched from within a thread started from a scheduled task [so it doesn’t initially appear to be the IP].

[1] — As finding and making use of these differences may be untenable, it’s understood that another path forward may be in order.

Thank you for all of your reviews. If they arise, any other details would be appreciated. Otherwise, it appears that there may be options to navigate from here.

Re: JS validation, Agreed: it’s clear that it is one of the purposes. My point was that it should be treated case-by-case, which I thought might be conveyed better than simply ‘They’re probably not happy with you scraping their site.’ If they can prevent anyone from easily taking the data they’ve collated or extenuating their resources, they should do that of course. It's not my goal to do either.