Sorting a million 32-bit integers in 2MB of RAM using Python

Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.

Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.

Here is my solution. I'll annotate it in a minute.

NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer to access the binary stream underlying the text stream file.

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

def intsfromfile(f):
 
while True:
     a
= array.array('i')
     a
.fromstring(f.read(4000))
     
if not a:
         
break
     
for x in a:
         
yield x

iters
= []
while True:
  a
= array.array('i')
  a
.fromstring(sys.stdin.buffer.read(40000))
 
if not a:
     
break
  f
= tempfile.TemporaryFile()
  array
.array('i', sorted(a)).tofile(f)
  f
.seek(0)
  iters
.append(intsfromfile(f))

a
= array.array('i')
for x in heapq.merge(*iters):
  a
.append(x)
 
if len(a) >= 1000:
      a
.tofile(sys.stdout.buffer)
     
del a[:]
if a:
  a
.tofile(sys.stdout.buffer)
On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:
#!/usr/bin/env python3.0
import sys, array
a
= array.array('i', sys.stdin.buffer.read())
a
= list(a)
a
.sort()
a
= array.array('i', a)
a
.tofile(sys.stdout.buffer)
Back to the merge-sort solution. The first three lines are obvious:
#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.

Then we define a little helper that is a generator which reads integers from a file and yields them one at a time:
def intsfromfile(f):
 
while True:
      a
= array.array('i')
      a
.fromstring(f.read(4000))
     
if not a:
         
break
     
for x in a:
         
yield x
This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use a.fromfile(f, 1000) because the fromfile() method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)

Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile() function, to a list of iterators that we'll use in the subsequent merge phase.

iters = []
while True:
  a
= array.array('i')
  a
.fromstring(sys.stdin.buffer.read(40000))
 
if not a:
     
break
  f
= tempfile.TemporaryFile()
  array
.array('i', sorted(a)).tofile(f)
  f
.seek(0)
  iters
.append(intsfromfile(f))
Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.

Finally we merge all these files (each of which is sorted) together. The heapq module has a really nice function for this purpose: heapq.merge(iter1, iter2, ...) returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).

a = array.array('i')
for x in heapq.merge(*iters):
  a
.append(x)
 
if len(a) >= 1000:
      a
.tofile(sys.stdout.buffer)
     
del a[:]
if a:
  a
.tofile(sys.stdout.buffer)
This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!

And that's the main moral of the story.

Another lesson is praise for the heapq module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array module for managing binary data.

And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!

21 comments:

Paul Smith said...

The formatting for the code examples is really crappy: the indentation is wrong for some lines, and there's not enough of it for most others. Ironically, for Python, it is hard to read.

isagalaev said...

> let this example remind you that Python 3.0 is notso different from Python 2.5

Except that Python 2.5 doesn't have heapq.merge :-)

Tyler said...

Paul, I think that is just the way blogger handled the code.

The only thing I noticed is the imports are all on one line, which is against PEP 8.

I'd never heard of heapq.. thats pretty sweet

PouleJapon said...

I really appreciated your post, especially the use of generators, and heapq.

Indeed, that would be awesome if there was some kind of plugins in blogspot to write formatted code.

David said...

Maybe the question was asked jokingly, but this is a very common task in database implementation. RDBMSs have always had to be able to sort tables on disk that are arbitrarily larger than RAM. What you've described is a nice two-phase multiway merge sort (efficiently and in very good Pythonic style, of course :), which I imagine is taught in most database systems classes.

Dave
(P.S. Hi Guido!)

rbonvall said...

You could use an embeddable pastebin (such as gist) and get syntax highlighting for free: http://gist.github.com/18896 . Beautiful python code deserves to be shown nicely :)

Guido van Rossum said...

I cleaned up the comments. Embarrassingly, they were a copy-paste error when I moved the text from my Artima plaintext edit buffer to Blogger. Thanks for pointing it out. No thanks for flaming.

Guido van Rossum said...

PS. Python 2.6 does have heapq.merge()...

Tyler said...

Guido, my post wasn't meant as a flame if that is what you are referring to. I was just messing around. Actually, I've been playing with python since 1.5(.4 I think... but I can't remember) and I had never seen PEP 8 until maybe a month ago, so it was fresh in my mind as I read your code.

I was poking fun, when, in reality I have a habit of just putting all my imports on one line also.

Guido van Rossum said...

@tyler: No, I was thinking of paul smith.

Christian Witts said...

Nice example of making the most out of a limited system.

Ralph said...

If this is an "interview" question, then another approach is to clarify how many distinct 32-bit values there are to sort. If sufficiently low, then track their frequency as each is read and discarded, and then output that many of them in sorted order.

Olivier said...

Christophe Meessen suggests an algorithm described here

http://news.ycombinator.com/item?id=341037

It requires only two pass on the million integer and uses nothing more than the 2MB buffer. No temporary files or so.

The complexity is O nlogn

Paolo Bonzini said...

Now, what if you have to sort a million unique integers between 0 and 10,000,000? :-)

ptn said...

@Paolo Bonzini: Unique? Makes me think of a bitmap...

Guido van Rossum said...

@olivier: that's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked.

union said...

For nice code listings you can use google's own client side code highlighting lib:

google-code-prettify

My post on how to set it up:

Code highlighting on blogger

Guido van Rossum said...

@union: I followed your recipe, and I verified that all the elements are indeed in the page, but it doesn't work. What am I missing?

::...
免责声明:
当前网页内容, 由 大妈 ZoomQuiet 使用工具: ScrapBook :: Firefox Extension 人工从互联网中收集并分享;
内容版权归原作者所有;
本人对内容的有效性/合法性不承担任何强制性责任.
若有不妥, 欢迎评注提醒:

或是邮件反馈可也:
askdama[AT]googlegroups.com


订阅 substack 体验古早写作:


点击注册~> 获得 100$ 体验券: DigitalOcean Referral Badge

关注公众号, 持续获得相关各种嗯哼:
zoomquiet


自怼圈/年度番新

DU22.4
关于 ~ DebugUself with DAMA ;-)
粤ICP备18025058号-1
公安备案号: 44049002000656 ...::