Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.
Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.
Here is my solution. I'll annotate it in a minute.
NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer
to access the binary stream underlying the text stream file.
#!/usr/bin/env python3.0On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
#!/usr/bin/env python3.0Back to the merge-sort solution. The first three lines are obvious:
import sys, array
a = array.array('i', sys.stdin.buffer.read())
a = list(a)
a.sort()
a = array.array('i', a)
a.tofile(sys.stdout.buffer)
#!/usr/bin/env python3.0The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
Then we define a little helper that is a generator which reads integers from a file and yields them one at a time:
def intsfromfile(f):This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
a.fromfile(f, 1000)
because the fromfile()
method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile()
function, to a list of iterators that we'll use in the subsequent merge phase.
iters = []Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
Finally we merge all these files (each of which is sorted) together. The heapq
module has a really nice function for this purpose: heapq.merge(
iter1, iter2, ...)
returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).
a = array.array('i')This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
And that's the main moral of the story.
Another lesson is praise for the heapq
module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array
module for managing binary data.
And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!
21 comments:
The formatting for the code examples is really crappy: the indentation is wrong for some lines, and there's not enough of it for most others. Ironically, for Python, it is hard to read.
> let this example remind you that Python 3.0 is notso different from Python 2.5
Except that Python 2.5 doesn't have heapq.merge :-)
Paul, I think that is just the way blogger handled the code.
The only thing I noticed is the imports are all on one line, which is against PEP 8.
I'd never heard of heapq.. thats pretty sweet
I really appreciated your post, especially the use of generators, and heapq.
Indeed, that would be awesome if there was some kind of plugins in blogspot to write formatted code.
Maybe the question was asked jokingly, but this is a very common task in database implementation. RDBMSs have always had to be able to sort tables on disk that are arbitrarily larger than RAM. What you've described is a nice two-phase multiway merge sort (efficiently and in very good Pythonic style, of course :), which I imagine is taught in most database systems classes.
Dave
(P.S. Hi Guido!)
You could use an embeddable pastebin (such as gist) and get syntax highlighting for free: http://gist.github.com/18896 . Beautiful python code deserves to be shown nicely :)
I cleaned up the comments. Embarrassingly, they were a copy-paste error when I moved the text from my Artima plaintext edit buffer to Blogger. Thanks for pointing it out. No thanks for flaming.
PS. Python 2.6 does have heapq.merge()...
Guido, my post wasn't meant as a flame if that is what you are referring to. I was just messing around. Actually, I've been playing with python since 1.5(.4 I think... but I can't remember) and I had never seen PEP 8 until maybe a month ago, so it was fresh in my mind as I read your code.
I was poking fun, when, in reality I have a habit of just putting all my imports on one line also.
@tyler: No, I was thinking of paul smith.
Nice example of making the most out of a limited system.
If this is an "interview" question, then another approach is to clarify how many distinct 32-bit values there are to sort. If sufficiently low, then track their frequency as each is read and discarded, and then output that many of them in sorted order.
Christophe Meessen suggests an algorithm described here
http://news.ycombinator.com/item?id=341037
It requires only two pass on the million integer and uses nothing more than the 2MB buffer. No temporary files or so.
The complexity is O nlogn
Now, what if you have to sort a million unique integers between 0 and 10,000,000? :-)
@Paolo Bonzini: Unique? Makes me think of a bitmap...
@olivier: that's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked.
For nice code listings you can use google's own client side code highlighting lib:
google-code-prettify
My post on how to set it up:
Code highlighting on blogger
@union: I followed your recipe, and I verified that all the elements are indeed in the page, but it doesn't work. What am I missing?