# algorithm - How to sort millions of rows of data in a file with less/meagre memory

(From here)

I attended an interview last week and this question was asked:

How do you sort a billion rows of data in a file with only 640KB of memory in a 8080 processor based machine? No virtual memory, no external disk.

I explicitly asked the interviewer if I could use a hard drive, so I can serialize trees as I sort them and then combine at the end. He said no. I tried many ways, different algorithms. Nothing he agreed.

I gave up and asked him politely, "how would you do that?" He bluntly said, "I would not tell you." (The interview ended right after that. I didn't mean to offend him, as a developer, I got curious. Moreover, it was an instinctive question, just as I would ask anyone at my workplace.)

This interview was for a really big bank.

So, how would anyone approach this problem?

Heapsort would be my reccomendation. It's relatively quick when n is large, and you only have to look at three elements with definite indecies at once.

That being said, my intuition tells me that sorting a billion rows on an 8080 even in C would be unfeasibly slow.

I would not do it in C#, for starters. Are you sure you have this tagged right? This is a C problem, if it can be solved.

640K only gives you 640 * 1024 * 8 bits so there's no way to solve this as framed. Perhaps that's the answer he/she was looking for. These Investment Bank interviews are something of a mindgame sometimes.

If speed is not a requirement, you could bubble sort rows in place in the file. This only requires looking at two rows of data at a time, with no external information or storage required.

Another question to have asked, is "What is the nature of the rows?" If the number of distinct values is low enough, then the answer might be a pigeon hole sort.

For example, say the file to be sorted only contained rows that held a number between 0 and 100 inclusive. Create an array of 101 unsigned 32 bit or 64 bit integers with a value of 0. As you read a row, use it to index the array and increament the count of that element. Once the file is read, start at 0, read the the number of zeros read and spit out that many, go to 1, repeat. Expand the array size as needed to handle the set of numbers coming through. Of course there are limits, say the values that can be seen span from -2e9 to +2e9. That's going to require 4e9 bins, which is not going to fit in 640K of RAM.

If instead the rows are strings, but you are still looking at a small enough set of distinct value, then use an associative array or hash table to hold the counts.

Knuth has a whole section on external sorting; this was commonplace back when there were no hard drives & not much memory, and tape drives were the norm. Look at the wikipedia page, and/or vol. 3 of Knuth's Art of Computer Programming.

I agree with Robusto's comment:

Where do you get the file from if you can't use the drive? It's certainly not going to be held in memory.

Not enough problem definition.

The more I think about this, the more I think merge sort would work very well within the memory window we're given.

Let's say you have x memory available. Divide the billion entries into billion/x + 1 sections and heapsort them (heapsort because no extra memory is required and it's O(2n(log n)) time). When all sections are heapsorted, do a merge sort starting across the first elements of all sections. This will work so long as you have more than sqrt(billion) memory to work with given basic 8080 OS memory usage.

Doing the math, this assumes that each data row is less than 165 bits.

Obviously you have to be able to read and write to the billion row file. The constraint of no external disk means you must restrict yourself to in-place algorithms or make some assumptions about the starting conditions and distribution of data so that you can keep the data sorted as it is added to the file (e.g. use the key as the index and create a large enough file to hold the expected number of keys).

If you must start with an unsorted file and sort it, you can use merge an in-place merge sort operating on very small chunks of the file. Since no constraints are made on the access times of the storage media, it may be very fast.

I'd use the GPU! Even on a fast computer, the GPU is often faster at sorting. And I don't know how big the "rows" are, but it's not hard to find 1GB video cards, so that answers the storage question, too.

Besides, if I had to work on an 8080, I'd definitely want to put the sweetest graphics card I could find on there.

You just have to be ready for the follow-up question: "How do you get an 8080 to talk to a modern PCI Express 2.0 x16 card?". I have discovered a truly marvelous method, but this textarea is too narrow to contain it.

You can find the discussion on a similar problem in Jon Bentley Programming Pearls Column. 1. Here Bentley deals with a problem of sorting millions of area codes which are guaranteed to be unique by using a bitset data-structure.