[warning: I am a python newbie]
A script harvests data from an API. It will require something like 18 hours wall clock / 1:30 CPU clock to execute, spending most of the time waiting for the API to return data.
I am trying to get better wall clock time using python's multiprocessing module.
import multiprocessing as mp
[snip]
while True :
## create a pool and parallel process the requests
pool = mp.Pool(processes=r)
results = [pool.apply_async(_processURL,args=(theURLs,x)) for x in range (0,r)]
try :
someVar = [p.get() for p in results]
except : pass
## close the pool for this iteration
pool.close()
if (some condition) : return
[snip]
def _processURL(theURLs,x):
[do stuff]
Works great on my machine. On PA, CPU clock is going through the roof (3:30+). not entirely clear to me why the same amount of work is significantly more costly if run through mp. Possibly a function of the number of parallel processes. I currently launch requests in batches of 20, which could be stupid on a dual core machine (which is what I believe to be the setup at PA).
If someone has a suggestions on how to effectively run mp scripts, please educate me (I've spent quite some time on stack overflow, ending up more confused, perhaps -- packages and implementations are evolving fast such that last year's guidelines appear to be obsolete)