Wednesday, December 1, 2010

Half-closing a TCP connection in Twisted

loseWriteConnection is the function I had been looking for all day. In retrospect, it was obvious---just look at the ITCPTransport manual page. But, at first I didn't know what I was looking for---I was just confused as to why netcat wasn't working as expected.

I was trying to get server status information which required sending a simple command to the server. When I used a custom netcat-like utility, it worked, but when I used netcat or python/twisted, it didn't. At first, I thought the special utility might have been sending an extra EOF-like character, but some testing eliminated that possibility. Then, I thought it might be a feed-line issue. Nope. Finally, I realized the problem---netcat and python/twisted weren't half-closing the write connection after sending the command. How did I come to this conclusion? I tried the netcat -q option and immediately got back the server status information (before the specified timeout).

Earlier, I had tried to (half-)close the connection with python/twisted using ITransport.loseConnection. But, after fully realizing the half-close issue and making additional loseConnection attempts, I concluded that loseConnection fully closes the connection, losing the response. Next, I found _closeWriteConnection which sounded like it would do exactly what I wanted. The source even looked like it would work, but for whatever reason it didn't. Finally, I was clued-into loseWriteConnection which closed the write-side of the connection while still allowing reading of the server response.

Friday, September 24, 2010

Running Tests

For a python project I worked on, we used the standard python unittest module and placed test classes within an if __name__=='__main__': block at the bottom of each module. This makes tests easy to run and has the advantage of keeping the testing code close to the source code. But, as I've learned, there's a better way to do it.

The major drawback of the above framework is a lack of control over tests. One cannot selectively run tests from within a module nor can test results be compiled in a nice way (since screen-scraping is the only option). I've since learned about nose, which is a "test runner." Instead of wrapping unit test classes in a if __name__=='__main__':, you simply place test classes somewhere in your source code hierarchy. Options include along-side the module code, or in "test" files within a "test" directory. To run tests, you simply run nosetests with arguments specifying what tests you want to run. This could be the root directory of your source code tree, or a list of python module names. Further refinement of which tests to run can be had by using nose attributes.

Tuesday, July 13, 2010

An absolutely relative import

Part of the "What's New" documentation for python 2.5 describes how to make use of absolute imports. After reading this, you might find the following example confusing. I sure was confused after trying it.

Create string.py:

import string
a = 1
Create main.py:
from __future__ import absolute_import
import string
print string.a
Both scripts should be placed in the same directory. Run main.py:
$ python main.py
You'll see main.py print "1", the value set by string.py. A reading of the python documentation might lead you to believe that this behavior is incorrect---it should instead import the standard library string module and raise an AttributeError. This interpretation is correct except for that, by default, python includes the script directory in the list of "absolute" import paths. So, the easy fix is to delete this entry which conveniently is always found at the beginning of sys.path. The revised main.py is:
from __future__ import absolute_import
import sys
sys.path = sys.path[1:]
import string
print string.a

I appreciate that python has moved to a cleaner import system. But, leaving the script/current directory in the list of "absolute" import paths seems like a huge oversight.

What's especially ridiculous about the default behavior is that if you have a module with the same name as a standard library module, import the standard library module, and include unittests at the bottom, the unittests won't work because the import will behave differently depending on whether the module is imported or run as a script. This is the problem that initially brought me down this path...

Update 9/23: After talking with different people about this issue, I've learned that it's easy to think that sys.path.remove('.') is the right thing to do here. It's not. The default local path inserted by python may be a full path or an empty string in which case sys.path.remove('.') won't fix the problem. Trying to remove all local directory entries is also incorrect since the user may genuinely want to include the local directory in the search path.

Wednesday, July 7, 2010

jsonlib

For a project I worked on at ITA, we decided to use pickle for internal object serialization/communication. Pickle certainly makes coding simple, but I've occasionally wondered whether we made the best choice. I found this article comparing deserialization libraries to be interesting. It sounds like the two main competing camps are json and Google's protocol buffers. It sounds like protocol buffers is slow (in python) because it is pure python and not optimized for speed. One json library, jsonlib sounds like the right way to go as it provides faster speeds and more compact storage than pickle.

Tuesday, June 8, 2010

Returning an exit status with Twisted

When I had a need for returning an exit status from a Twisted process, my first instinct was to look for a reactor.stop argument. In fact, there have been multiple requests for such, e.g. tickets #718 and #2182. But, then, I realized that reactor.stop doesn't stop the reactor, it merely initiates the shutdown process. The reactor is not shut down until reactor.run exits. This realization made it clear what I should do to return a specific exit code---simply add

    sys.exit(code)
immediately after reactor.run.

Monday, March 22, 2010

More ElementTree Annoyances

  • Cannot serialize int. I can see the value in not automatically serializing every possible object with a __str__ method. But, not converting an int? C'mon!
  • Cannot serilaize None. Wouldn't None be the perfect value to indicate "don't serialize this attribute"?
I'm generally a fail-fast-and-loudly kind of guy, but I also don't like having to write more code when it's obvious what I mean. These seem like two cases where I think the tradeoff is in favor of writing less code...

Examples:

>>> import xml.etree.ElementTree as et
>>> et.tostring(et.Element('Foo', attrib={ 'a': 1}))
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1009, in tostring
    ElementTree(element).write(file, encoding)
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 663, in write
    self._write(file, self._root, encoding, {})
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 698, in _write
    _escape_attrib(v, encoding)))
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 830, in _escape_attrib
    _raise_serialization_error(text)
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 777, in _raise_serialization_error
    "cannot serialize %r (type %s)" % (text, type(text).__name__)
TypeError: cannot serialize 1 (type int)
>>> et.tostring(et.Element('Foo', attrib={ 'a': None}))
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1009, in tostring
    ElementTree(element).write(file, encoding)
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 663, in write
    self._write(file, self._root, encoding, {})
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 698, in _write
    _escape_attrib(v, encoding)))
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 830, in _escape_attrib
    _raise_serialization_error(text)
  File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 777, in _raise_serialization_error
    "cannot serialize %r (type %s)" % (text, type(text).__name__)
TypeError: cannot serialize None (type NoneType)

Monday, February 8, 2010

__missing__

According to the python documentation:

If a subclass of dict defines a method __missing__(), if the key key is not present, the d[key] operation calls that method with the key key as argument. The d[key] operation then returns or raises whatever is returned or raised by the __missing__(key) call if the key is not present. No other operations or methods invoke __missing__(). If __missing__() is not defined, KeyError is raised. __missing__() must be a method; it cannot be an instance variable. For an example, see collections.defaultdict.
This is, at least, incomplete, since __missing__ must not only return the default value, but also assign it internally. This is made clear in the documentation for collections.defaultdict:
If default_factory is not None, it is called without arguments to provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.
Surprisingly, the __missing__ method is not mentioned in the special method names section of the python documentation.

Thursday, February 4, 2010

collections.defaultdict

collections.defaultdict is nice, especially when counting things. But, defaultdict only lets you use zero-argument constructors. Pffft! Fortunately, it's easy to write a defaultdict which passes arguments to the constructor:

class defaultdict2(dict):
    def __init__(self, factory, factArgs=(), dictArgs=()):
        dict.__init__(self, *dictArgs)
        self.factory = factory
        self.factArgs = factArgs
    def __missing__(self, key):
        self[key] = self.factory(*self.factArgs)
        return self[key]

Update 2/8/10: added "return" line to __missing__ per discussion in this post on __missing__.

Wednesday, February 3, 2010

Kid Template Recompilation

I'm involved in a project which uses the TurboGears framework for serving web pages. The templating language we use is Kid. Recently, we ran into a problem where web pages did not correspond to the installed templates. After a bit of detective work, we suspected that TurboGears/Kid was not using the templates, but rather stale, compiled versions of old templates (.pyc files). Some Kid mailing list discussion confirmed our suspicions. The problem is that Kid only recompiles if the mtime of the source (.kid) file is after the mtime of the corresponding compiled (.pyc) file. In contrast, Python recompiles unless the mtime stored in the .pyc file exactly matches the mtime of the source (.py) file.

My understanding is that, ideally, Python would use a one-way hash of the source and only use the compiled file if there is an exact match. The exact mtime comparison is practically nearly as good and much, much faster. But, the mtime inequality comparison is a poor approximation of the ideal and only works when you can guarantee that (1) the system clock is perfect and never changes timezone (e.g. no switch between EDT and EST), and (2) mtimes are always updated to "now" whenever contents or locations are changed (i.e. even "mv" must affect mtime and rsync -a is right out). I don't know of any OS which provides these guarantees. The good news is that there is no disagreement on the existence of the problem; so, this is likely to be fixed in a future version of Kid.

Tuesday, January 19, 2010

numpy.dot

I should have known. numpy.dot doesn't work with sparse matrices. What's worse is that it happily accepts a sparse matrix as an argument and yields some convoluted array of sparse matrices. What I should be doing is x.dot(y) where x is a scipy.sparse.sparse.spmatrix and y is a numpy.ndarray.

Note that I'm using the Debian stable versions of these packages: numpy 1.1.0 and scipy 0.6.0.

Friday, January 8, 2010

urllib2.HTTPErrorProcessor

With code similar to that I posed in Asynchronous HTTP Request, I was occasionally getting empty responses to my requests. When I added urllib2.HTTPErrorProcessor to the inheritance list for MyHandler, the problem went away. My guess is the server was generating a 503 Service Unavailable responses and my client code wasn't handling it. How one was supposed to know to do this from the documentation, I am unsure. I'm guessing that if the server might provide a redirect for your url, you would also want to inherit from urllib2.HTTPRedirectHandler.