Forums

Problems with lxml and requests.get

Hello,

Great service! I've just signed up today and have a 'Hacker' paid account.

I am trying to automate a task to check a website to see if anything has changed.

It seems like, for whatever reason, when I try to access the website using either lxml or requests, PA is 'hanging' / timing out.

I tried a half dozen other sites and been able to get a response of some kind...see below

r = requests.get( "http://www.pythonanywhere.com/terms" )
print 'REQUESTS HTTP response code for the url ', 'python anywhere', ' => ', r.status_code

r = requests.get( "http://www.cnn.com" )
print 'REQUESTS HTTP response code for the url ', 'cnn', ' => ', r.status_code

page = requests.get( "http://www.politico.com" )
print 'REQUESTS HTTP response code for the url ', 'Politico', ' => ', page.status_code

page = requests.get( "http://www.reddit.com" )
print 'REQUESTS HTTP response code for the url ', 'Reddit', ' => ', page.status_code

page = requests.get( "http://www.ticketmaster.com" )
print 'REQUESTS HTTP response code for the url ', 'Ticket Master', ' => ', page.status_code

page = requests.get( "http://www.apeconcerts.com" )
print 'REQUESTS HTTP response code for the url ', 'APE Concerts', ' => ', page.status_code

results in...

01:15 ~ $ python ticktateScraperv02.py
Completed Yesterday Module
REQUESTS HTTP response code for the url  python anywhere  =>  200
REQUESTS HTTP response code for the url  cnn  =>  200
REQUESTS HTTP response code for the url  Politico  =>  200
REQUESTS HTTP response code for the url  Reddit  =>  200
REQUESTS HTTP response code for the url  Ticket Master  =>  403

...and then nothing until I do a KeyboardInterrupt:

^CTraceback (most recent call last):
  File "ticktateScraperv02.py", line 64, in <module>
    page = requests.get( "http://www.apeconcerts.com" )
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 383, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 334, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 480, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 313, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1051, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 415, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt

The code works fine from my local machine, by the way.

Thoughts?

Thanks!

Sincerely, Michael

Hm. Is it always the same sites that hang? Could it be that "apeconcerts.com" has detected too many suspicious/scraping requests from our servers, and it's put our IPs into some kind of blocklist? What do their terms + conditions say about scraping?

Harry,

Thanks for the response. Yeah, it is the same sites (or at least, this one). APE doesn't have any public terms & conditions about scraping (it's a regional concert promotion company), but that could be the case, I suppose. I left the code running in a console over night and it's still hung.

I actually switched over to lxml and is it working to access the site, but now I'm running into an error using XPath and count. Again, this works fine locally, but it is throwing an error on PA.

import lxml
import lxml.etree
import lxml.html
from lxml import html

url = "http://www.apeconcerts.com"
parser = lxml.etree.HTMLParser(encoding='utf-8')
tree = lxml.etree.parse(url, parser)

print "got the tree"

count = tree.xpath("count(//*[@id='main']/div[2]/div)")

This results in the error:

15:42 ~ $ python ticktateScraperv03.py
Completed Yesterday Module
got the tree
<lxml.etree._ElementTree object at 0x7f0efd512fc8>
Traceback (most recent call last):
  File "ticktateScraperv03.py", line 81, in <module>
    count = tree.xpath("count(//*[@id='main']/div[2]/div)")
  File "lxml.etree.pyx", line 2111, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57604)
  File "lxml.etree.pyx", line 1780, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:54277)
AssertionError: ElementTree not initialized, missing root

My current hypothesis is that there is ambiguity in what the root is (which, as I understand it, would be the 'html' tag). So I think I either need to write out the full XPath or somehow indicate that the html tag is the root?

I am planning to put some time against this later this evening.

Any thoughts prior to that would be very much appreciated!

Sincerely, Michael

PS - If not clear, I'm a Python beginner (esp. web) and a PA total beginner, so I appreciate the help.

No probs. My guess is that the "missing root" is lxml's unhelpful way of telling you it couldn't actually load anything from that URL... you could do a print(tree) or something similar to confirm if that's the case.

In the meantime, I think we might be blocked from scraping that particular site. You could contact the site administrators and ask them if they do sometimes block scraping requests?

Harry - thanks. Yes, that looks like what is happening.

It does return an ElementTree object, but when I

url = "http://www.apeconcerts.com"
parser = lxml.etree.HTMLParser(encoding='utf-8')
tree = lxml.etree.parse(url, parser)
print "got the tree"
print tree
treeString = lxml.etree.tostring(tree)
print treeString

results in...

got the tree
<lxml.etree._ElementTree object at 0x7f22e5467a70>
None

And then the file continues to run and when it hits xpath((count("//...

Traceback (most recent call last):
  File "ticktateScraperv03.py", line 67, in <module>
    testpath = tree.xpath("//html/body/main/div/div/div[1]/h1/a/text()")
  File "lxml.etree.pyx", line 2111, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57604)
  File "lxml.etree.pyx", line 1780, in lxml.etree._ElementTree._assertHasRoot (src/lxml/lxml.etree.c:54277)
AssertionError: ElementTree not initialized, missing root

Thanks for the help!

yep, "None" sure looks like it didn't get anything back from the server. I guess talking to APE directly is the only thing we can recommend right now?