ERROR: Article `download()` failed with HTTPSConnectionPool(host='apnews.com', port=443): Max retries exceeded with url: /99cbd726c093675111ecf130fab26b8d (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden')) : Forums : PythonAnywhere

Hello,

I have a python flask application deployed on PythonAnywhere. The use of the web app is to provide news feeds to mobile users. This web app uses the GNews library. Link to the GNews documentation : GNews

Through this library, the web app scrapes the websites and aggregates the response in JSON format. Through the GNews library, the web app calls the get_full_article.top_image. However, the image field does not return the 'https://path/to/image.jpg' and instead returns an 'NA'. But I also see that the GNews library scrapes the news URL easily and returns me a link as https://path/to/news_article. Please see image and url fields respectively.

Please do check the link to the response: http://newsfocus.pythonanywhere.com/api/v1/all_news/1

Here is the expectation vs reality scenario of the response:

Expectation:

{ "cat_id": 1, "description": "Afternoon brief: 48th GST Council meet begins, tax evasion to be discussed Hindustan Times", "id": 61, "image": "https://images.hindustantimes.com/img/2022/12/17/1600x900/The-GST-Council-began-its-48th-meeting-virtually-o_1671261841593_1671261841593_1671261872877_1671261872877.jpg", "pub_date": "17 Dec", "publisher_name": "Hindustan Times", "publisher_url": "https://www.hindustantimes.com", "title": "Afternoon brief: 48th GST Council meet begins, tax evasion to be discussed - Hindustan Times", "url": "https://www.hindustantimes.com/india-news/afternoon-brief-48th-gst-council-meet-begins-tax-evasion-among-issues-to-be-discussed-and-all-the-latest-news-101671260618173.html" },

Reality:

{ "cat_id": 1, "description": "Afternoon brief: 48th GST Council meet begins, tax evasion to be discussed Hindustan Times", "id": 61, "image": "NA", "pub_date": "17 Dec", "publisher_name": "Hindustan Times", "publisher_url": "https://www.hindustantimes.com", "title": "Afternoon brief: 48th GST Council meet begins, tax evasion to be discussed - Hindustan Times", "url": "https://www.hindustantimes.com/india-news/afternoon-brief-48th-gst-council-meet-begins-tax-evasion-among-issues-to-be-discussed-and-all-the-latest-news-101671260618173.html" },

By running the app using the bash console by command python run.py, I get to see this error:

ERROR: Articledownload()failed with HTTPSConnectionPool(host='apnews.com', port=443): Max retries exceeded with url: /99cbd726c093675111ecf130fab26b8d (Caused by ProxyError('Cannot connect to proxy.', OSError('Tunnel connection failed: 403 Forbidden'))

On further inspection and browsing this forum, I understood that I have to mention the GNews scraping library to PythonAnywhere to allow the scraping of websites by providing the link to the library's documentation.

I request PythonAnywhere staff allow the GNews library to download the article. Here is the link to the documentation:

https://github.com/ranahaani/GNews

P.S. If you don't see the 'NA' value in the image field, then probably I might have run the web app locally and uploaded the app.db file in PythonAnywhere. Request you to consider the above problem as the truth.

newsfocus | 4 posts | Dec. 17, 2022, 9:32 a.m. | permalink