Forums

Selenium Parser Code not working properly

I have free account.

I installed django at pythonanywhere.com.

I put a folder with the parser files in the Django project. I'm parsing a whitelist site.

Django should be a front-end visualization of the parser results.

I am using Selenium Chromedriver.

The code works for line:

browser.get(page_url)

After that I see:

[2022-09-06 09:00:07] urllib3.connectionpool | DEBUG | _make_request: 452 | http://localhost:39509 "POST /session/d***************************44d/url HTTP/1.1" 500 195
[2022-09-06 09:00:08] urllib3.connectionpool | DEBUG | _make_request: 452 | http://localhost:39509 "DELETE /session/d***************************44d HTTP/1.1" 200 14

And so it goes on and on.

What can be the problem?

Is this site blocking itself from parsing?

Or the problem could be that I'm running browsers as threads in the function:

from concurrent.futures import ThreadPoolExecutor, wait

def thread_pool(func, url, pages, parse_func, write_to_db_func):
    futures = []
    With ThreadPoolExecutor() as executor:
        For pages in pages:
            futures.append(
                executor.submit(func, url, page[0], page[1], parse_func, write_to_db_func))
            logger.debug(f'ThreadPoolExecutor take in work pages: {url} | {page[0]}-{page[1]}')
    # Wait for ending of all running processes
    wait(futures)

But I can see that the thread seems to work:

logger.debug(f'ThreadPoolExecutor take into work pages: {url} | {page[0]}-{page[1]}')

executed:

[2022-09-06 08:59:57] __main__ | DEBUG | thread_pool: 92 | ThreadPoolExecutor take in work pages: https://******************************* | 1-3

Thanks!

Threading would currently not work in a web app code on PythonAnywhere; also we don't recommend using selenium in a web app code. The whitelisted domains are usually APIs, so it should be easier to talk with them directly, instead of scraping. But I see you're hitting localhost instead?

Ok. Thanks!

So the problem is in the threads.

Why doesn't the code stop on the line next to ThreadPoolExecutor, but goes further - the browser starts and even tries to connect to the page?

I don't understand about localhost - these are console messages to pythonanywhere.com. I guess http://localhost:39509 is pythonanywhere.com?

Maybe consider moving it out of your web app with https://help.pythonanywhere.com/pages/AsyncInWebApps/ but that would not work on the free account.