Monday, December 28, 2009

Element.text and other ElementTree Annoyances

I have a love/hate relationship with ElementTree. It generally makes processing and generating XML very easy. But, some of the design decisions feel like they were meant to frustrate, rather than help, the programmer:

  • Many __str__ and __repr__ methods return near-useless strings like <Element ElementName at 7fb1d0f63e60>. Would methods that specify attributes and text/tail properties really be so difficult to define? Even "def __str__(self): return tostring(self)" would be an improvement.
  • Element() cannot specify text. The Element factory only lets you specify the tag name and attributes. There is no argument you can pass to specify the text or tail. See, for example, a discussion about setting the text property. I see no point in forcing the programmer to write the extra line of code.
  • Interfaces. ElementTree hides the Element class and provides a factory which returns an object which implements the _ElementInterface. There are other languages in which I can see this being a useful practice. But, python does not have sufficient language support and I find that this half-hearted attempt at abstraction simply makes the module more difficult to use. Python already provides plenty of tools to hide "magic" which don't interrupt programmer intuitions. Why not use those?

Wednesday, December 23, 2009

Asynchronous HTTP Request

Note (4/2/11): Please see my recent post detailing asynchronous HTTP requests using Twisted.

Note (3/13/11): I originally wrote this post while looking for callback-style HTTP request functionality in python. I made the mistake of thinking that "callback-style" is the same as "asynchronous". The following details my efforts to achieve a callback-style HTTP request using urllib2. The final (updated) code example illustrates how to use threads to achieve asynchronicity. I'd recommend using a thread pool if you plan more than just a handful of requests. And, as others have noted, Twisted is really the best python framework for asynchronous programming. Also, I'd like to thank the commenters for pointing out my mistakes; I'm sorry for not realizing my errors sooner.

You might think it would be easy to write python code to perform an asynchronous achieve a callback-style web request. It ought to be as simple as providing a url and callback function to some python library routine, no? Well, technically, it is that simple. But somehow, the documentation makes the task surprisingly difficult.

One option, of course, is Twisted. But, reading through the (sparse, fractured) documentation made me think there had to be something easier. This led me to urllib2. The short answer is that, yes, urllib2 does what I want. But, the documentation is sufficiently backwards that it took me over an hour to figure out how to accomplish the task.

Accomplishing a blocking simple HTTP request with urllib2 is simple and the documentation reflects that: use openurl. The return value of openurl provides the response and additional information in a file-like object. The problem is how to achieve the same result in an asynchronous callback-style manner. One would think openurl could simply take an additional handler object which is called with the response as its only argument when the request completes. Ha! build_opener looked vaguely promising as it accepted handler(s). This led me to create a class which inherited from BaseHandler which defined protocol_response. No dice. And, as I later realized, protocol_response takes three arguments (self, req, response), not two, and changes names depending on the protocol. Of course, at that point, I was at a loss as to how the protocol name was determined (the BaseHandler documentation ignored this issue). And, the examples were useless since they all used standard handlers. Next, I tried inheriting from HTTPHandler, overriding http_response with a method that simply prints the url, info and response text. This almost worked. It successfully retrieved the web page and printed it. But, then, it raised the following exception:

Traceback (most recent call last):
  File "./webtest.py", line 14, in 
    o.open('http://www.google.com/')
  File "/usr/lib/python2.6/urllib2.py", line 389, in open
    response = meth(req, response)
  File "/usr/lib/python2.6/urllib2.py", line 496, in http_response
    code, msg, hdrs = response.code, response.msg, response.info()
AttributeError: 'NoneType' object has no attribute 'code'
After much searching, I finally realized that I had failed to return a response-like object from my http_response method. This seems like an odd requirement for a callback method. And, it could have been easily clarified in the documentation with an example.

Alas, after all that, I was able to use urllib2 to successfully make an asynchronous HTTP request, so I can't complain too much. Here's the code for anyone who's interested:

#!/usr/bin/env python

import urllib2
import threading

class MyHandler(urllib2.HTTPHandler):
    def http_response(self, req, response):
        print "url: %s" % (response.geturl(),)
        print "info: %s" % (response.info(),)
        for l in response:
            print l
        return response

o = urllib2.build_opener(MyHandler())
t = threading.Thread(target=o.open, args=('http://www.google.com/',))
t.start()
print "I'm asynchronous!"

Update (3/12/11): My comment before the sample code indicated that the sample code was asynchronous. But, it wasn't. I've updated it to be asynchronous. When originally writing this post, I intended the example code to show the urllib2 handler approach.

Thursday, December 17, 2009

Reworking the GIL

The title of this post is stolen from an email which describes steps the author has made to address GIL issues David Beazley raised with his talk on the Global Interpreter Lock. The proposed changes certainly won't turn Python into a completely thread-friendly language (the GIL is not going away any time soon), but it sounds like these changes will greatly reduce thread overhead and give the effect of running on a single-core machine that one would expect with a global interpreter lock.

Thursday, December 3, 2009

Twisted Annoyances

While Twisted is generally an excellent network library, it certainly has its quirks.
  • Exception Trapping: by default, the reactor will trap all exceptions. See ReactorBase.runUntilCurrent for the code that implements this horrible behavior. This breaks intuitions most developers have for how exceptions are supposed to work. For example, the python tutorial says:
    The last except clause may omit the exception name(s), to serve as a wildcard. Use this with extreme caution, since it is easy to mask a real programming error in this way!
    Twisted does an excellent job of masking real programming errors. I'm surprised that there is no option to turn off this behavior.
  • callFromThread: calling a reactor method (e.g. protocol.writeData) from a non-reactor thread doesn't work without wrapping it with this method.