Forums

Temporary Failure in Name Resolution

I'm running into this error with an 'Always-On' task - I NEVER see it if I run the task as a normal console task but once it appears in an 'Always On' task, it's permanent(at least until I stop/start the task when it usually disappears for a while)

My script accesses the Path of Exile API - that is rate-limited but I'm pretty sure this isn't an error at their end and I have loads of sleep commands to ensure I never trip the rate-limits (I've run this script in consoles without any errors - and on my local PC for DAYS without any errors)

The Python exception was HTTPSConnectionPool(host='api.pathofexile.com', port=443): Max retries exceeded with url: /character-window/get-characters?accountName=ohnpeat&realm=pc (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f903e233050>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

There are some older topics which suggest this is an Amazon issue perhaps - just not sure why it only seems to affect "Always On" tasks?

Ideas?

It's very hard to debug network problems like that, as everything happens outside.

I suspect this is a configuration error within PA itself and nothing to do with "the Internet" or the target server

Example Can start an Always-On job and it'll work for a while - pause it and restart and it will throw that error endlessly... Pause the job and restart again and it usually works

I've also seen that error after a script is restarted by the 'always on' processing

Add to that the fact I've NEVER seen this error when running in a normal console (or on my own PC) so...

I agree, this does sound like a problem with running your code in the context of the always-on task specifically, rather than a general network issue.

We did have a problem that had similar symptoms a year or so back, but we've put things in place so that we get alerted if it happens again, and those alerts have not been coming through. The logs that should show up problems like that are also clear. But that doesn't mean that this couldn't be a new problem with similar symptoms that doesn't show up on our side in the same way.

Could you give us some timelines about when it has happened? Perhaps then we can trawl the logs and see if there's anything there.

The logs I have here aren't time-stamped "per line" so I'd be guessing - I do know I had one incident in the 5-10 mins prior to posting my last message but what I'll do is make a note the next-time it fails and I'll post it here ASAP

I've restarted it twice "just to see" and it's been OK so this might not be instant but I've made a note!

Sorry, I should have posted here yesterday -- someone else in another thread (which I saw after posting here previously) was having a similar problem, and had some timestamped logs. It turned out that there was indeed a network-related issue on one of our nodes, which was not triggering the expected alerts, and manifested itself differently in the logs.

We were able to track it down, fix it, and sort out the monitoring systems so that it should be picked up if it happens in the future, but please do let us know if it happens again.

Ah - excellent - many thanks for that!

Waking this - the problem reappeared today (8/2/2021) between 11:05am and 12:15pm according to my 'always on' task log

Just FYI

@trjp, thanks for letting us know.