Monday, December 28, 2009

Element.text and other ElementTree Annoyances

I have a love/hate relationship with ElementTree. It generally makes processing and generating XML very easy. But, some of the design decisions feel like they were meant to frustrate, rather than help, the programmer:

  • Many __str__ and __repr__ methods return near-useless strings like <Element ElementName at 7fb1d0f63e60>. Would methods that specify attributes and text/tail properties really be so difficult to define? Even "def __str__(self): return tostring(self)" would be an improvement.
  • Element() cannot specify text. The Element factory only lets you specify the tag name and attributes. There is no argument you can pass to specify the text or tail. See, for example, a discussion about setting the text property. I see no point in forcing the programmer to write the extra line of code.
  • Interfaces. ElementTree hides the Element class and provides a factory which returns an object which implements the _ElementInterface. There are other languages in which I can see this being a useful practice. But, python does not have sufficient language support and I find that this half-hearted attempt at abstraction simply makes the module more difficult to use. Python already provides plenty of tools to hide "magic" which don't interrupt programmer intuitions. Why not use those?

Wednesday, December 23, 2009

Asynchronous HTTP Request

Note (4/2/11): Please see my recent post detailing asynchronous HTTP requests using Twisted.

Note (3/13/11): I originally wrote this post while looking for callback-style HTTP request functionality in python. I made the mistake of thinking that "callback-style" is the same as "asynchronous". The following details my efforts to achieve a callback-style HTTP request using urllib2. The final (updated) code example illustrates how to use threads to achieve asynchronicity. I'd recommend using a thread pool if you plan more than just a handful of requests. And, as others have noted, Twisted is really the best python framework for asynchronous programming. Also, I'd like to thank the commenters for pointing out my mistakes; I'm sorry for not realizing my errors sooner.

You might think it would be easy to write python code to perform an asynchronous achieve a callback-style web request. It ought to be as simple as providing a url and callback function to some python library routine, no? Well, technically, it is that simple. But somehow, the documentation makes the task surprisingly difficult.

One option, of course, is Twisted. But, reading through the (sparse, fractured) documentation made me think there had to be something easier. This led me to urllib2. The short answer is that, yes, urllib2 does what I want. But, the documentation is sufficiently backwards that it took me over an hour to figure out how to accomplish the task.

Accomplishing a blocking simple HTTP request with urllib2 is simple and the documentation reflects that: use openurl. The return value of openurl provides the response and additional information in a file-like object. The problem is how to achieve the same result in an asynchronous callback-style manner. One would think openurl could simply take an additional handler object which is called with the response as its only argument when the request completes. Ha! build_opener looked vaguely promising as it accepted handler(s). This led me to create a class which inherited from BaseHandler which defined protocol_response. No dice. And, as I later realized, protocol_response takes three arguments (self, req, response), not two, and changes names depending on the protocol. Of course, at that point, I was at a loss as to how the protocol name was determined (the BaseHandler documentation ignored this issue). And, the examples were useless since they all used standard handlers. Next, I tried inheriting from HTTPHandler, overriding http_response with a method that simply prints the url, info and response text. This almost worked. It successfully retrieved the web page and printed it. But, then, it raised the following exception:

Traceback (most recent call last):
  File "./webtest.py", line 14, in 
    o.open('http://www.google.com/')
  File "/usr/lib/python2.6/urllib2.py", line 389, in open
    response = meth(req, response)
  File "/usr/lib/python2.6/urllib2.py", line 496, in http_response
    code, msg, hdrs = response.code, response.msg, response.info()
AttributeError: 'NoneType' object has no attribute 'code'
After much searching, I finally realized that I had failed to return a response-like object from my http_response method. This seems like an odd requirement for a callback method. And, it could have been easily clarified in the documentation with an example.

Alas, after all that, I was able to use urllib2 to successfully make an asynchronous HTTP request, so I can't complain too much. Here's the code for anyone who's interested:

#!/usr/bin/env python

import urllib2
import threading

class MyHandler(urllib2.HTTPHandler):
    def http_response(self, req, response):
        print "url: %s" % (response.geturl(),)
        print "info: %s" % (response.info(),)
        for l in response:
            print l
        return response

o = urllib2.build_opener(MyHandler())
t = threading.Thread(target=o.open, args=('http://www.google.com/',))
t.start()
print "I'm asynchronous!"

Update (3/12/11): My comment before the sample code indicated that the sample code was asynchronous. But, it wasn't. I've updated it to be asynchronous. When originally writing this post, I intended the example code to show the urllib2 handler approach.

Thursday, December 17, 2009

Reworking the GIL

The title of this post is stolen from an email which describes steps the author has made to address GIL issues David Beazley raised with his talk on the Global Interpreter Lock. The proposed changes certainly won't turn Python into a completely thread-friendly language (the GIL is not going away any time soon), but it sounds like these changes will greatly reduce thread overhead and give the effect of running on a single-core machine that one would expect with a global interpreter lock.

Thursday, December 3, 2009

Twisted Annoyances

While Twisted is generally an excellent network library, it certainly has its quirks.
  • Exception Trapping: by default, the reactor will trap all exceptions. See ReactorBase.runUntilCurrent for the code that implements this horrible behavior. This breaks intuitions most developers have for how exceptions are supposed to work. For example, the python tutorial says:
    The last except clause may omit the exception name(s), to serve as a wildcard. Use this with extreme caution, since it is easy to mask a real programming error in this way!
    Twisted does an excellent job of masking real programming errors. I'm surprised that there is no option to turn off this behavior.
  • callFromThread: calling a reactor method (e.g. protocol.writeData) from a non-reactor thread doesn't work without wrapping it with this method.

Tuesday, November 17, 2009

Python Coil

Coil is a nice configuration language for Python created by Michael Marineau created by Itamar Turner-Trauring and currently maintained by Michael Marineau. It is used here at ITA. It is much less verbose than XML but is very readable and minimizes duplication. I really wish there were a Debian package for it!

Update (11/18/09): Michael M. informed me of the correct history :)

Wednesday, October 28, 2009

Block Scoping

Unlike languages like Java, C++ and Perl, Python does not have block scoping. I.e. if you define a variable inside a loop in Python, it will still be in scope after that loop and will override previous bindings to that name. Python instead delimits scope at the levels of module, class and function. I defer to the authoritative source for the gory details. Note that according to the definitions used therein, Python does have block-level scoping, but that is only if you define block delimiters to be modules, classes and functions :-)

Python scoping is a drawback in the sense that if you are used to (Java/C++/Perl) block scoping, you are likely to accidentally introduce bugs as a result of using the same variable at different block levels. I've introduced a few such bugs. On the other hand, Python scoping eliminates the need to pre-define/declare variables which are set in a loop/if block, yet you need access to afterward. So, even though I've been burned by this style of scoping, I find that it helps me write better (i.e. more readable) code.

Tuesday, October 27, 2009

The Global Interpreter Lock

Python technically has threading capabilities. And, it can work quite well if the threads are i/o-bound. However, Python threading doesn't work so well when threads are cpu-bound. The following hour-long video explains why. Read the slides if you are impatient.

http://blip.tv/file/2232410

One observation that David Beazley makes is that only the "main" thread can deal with signals like Control-C. However, if this thread is blocked via a join(), the signal will not get handled. So, it may be worth creating a thread separate from the "main" thread to spawn and join threads. Haven't yet tested this, though...

Monday, October 19, 2009

Controlling Printing in Numpy

Numpy has numpy.set_printoptions for controlling the printing of arrays. See the doc for full details. Now that I know about it, I'll be using something displaying fewer precision digits, allowing a larger linewidth and not summarizing until the array is substantially larger:
numpy.set_printoptions(precision=4,
                       threshold=10000,
                       linewidth=150)