How to make a program run 80 times faster

In the book Exercises in Programming Style, the author, Cristina Videira Lopes, solves a programming problem in 33 different ways, each an example of a particular style of programming: monolithic, object-oriented, etc.

Being an old geezer of a programmer, I was immediately drawn to style #1, Good Old Times. The constraints that must be dealt with using this style are two:

    "Very small amounts of primary memory, typically orders of magnitude
    smaller than the data that needs to be processed/generated."

    "No identifiers - i.e., no variable names or tagged memory addresses.
    All we have is memory that is addressable with numbers."

The second constraint is really archaic, and even a programmer as old as I am never had to deal with anything so primitive. (Although I did have to use a Basic compiler that only allowed identifiers of the form A, B, ..., Z, A1, A2, ...)

However, the first constraint was very real when I began learning to program. My first programming job introduced me to a Danish computer, the GIER, which had 1K of 40-bit words, i.e., about 5K bytes of main memory. The modern programmer wouldn't know how to begin doing anything in this minuscule amount of memory, but the GIER boasted a full-blown Algol 60 compiler that was extremely fast and which produced object programs with automatic overlays so that a program much larger than 5K could run with ease. See The Design of the GIER Algol Compiler - Part I and Part II for some insight into the creativity and ingenuity that were applied to the problem of not enough main memory, way back over 55 years ago.

Digging into the Python code for program version 1, it quickly became apparent to me that the Good Old Times style presented in the book is just not a viable solution, even in the Good Old Times. Limited main memory forced old-time programmers to develop algorithms which performed decently even in the face of almost no memory. So, I modified the program, keeping in mind the major constraint of not enough memory, and was gratified to see that my solution ran 80 times faster than the original program. Generally, I am not able to increase the performance of a program by quite such a factor (🙂), so it might be useful to see what made it possible in this case.

What is the problem to be solved? To quote the author: "given a text file, we want to produce the list of words in the file and their frequencies, and print them out in decreasing order of frequency."

The text file used is pride-and-prejudice.txt, containing Jane Austen's Pride and Prejudice, available from the Gutenberg Collection. The Linux utility wc reports this about the file:

     13,426 lines of text
    124,588 words
    704,145 bytes

There is one other input file, stop_words.txt, 554 bytes in size, containing 121 common words like "a", "the", etc. Any occurrences of these words in the main input file are to be ignored.

The desired output from the program is a text file containing the 25 most frequently occurring words in the input file, listed in descending order by frequency:

    mr  -  786
    elizabeth  -  635
    very  -  488
    darcy  -  418
    such  -  395
    mrs  -  343
    much  -  329
    more  -  327
    bennet  -  323
    bingley  -  306
    jane  -  295
    miss  -  283
    one  -  275
    know  -  239
    before  -  229
    herself  -  227
    though  -  226
    well  -  224
    never  -  220
    sister  -  218
    soon  -  216
    think  -  211
    now  -  209
    time  -  203
    good  -  201

The original program is organized like this:

    Load the entire stop_words.txt file into primary memory
    Read the main input file one line at a time, and for each line:
        Delimit words in the line, and for each word:
            Look the word up in the stop words structure
                If found, ignore the word
            Otherwise, look the word up in the word_freqs structure
                If not found, add the word to the structure with frequency 1
                If found, increment the frequency by 1
    Read the word_freqs structure one line at a time and build an output 
        structure containing the 25 most frequent words in descending order
        by frequency

Note that once the main input file has been processed, the word_freqs structure contains all the data we need, namely, for each unique word appearing in the file, the word itself plus a count of how many times the word appears.

The big question is, where is this word_freqs structure stored? Clearly it can't be in main memory because main memory is what we are acutely short of. Thus, it must be in secondary memory, e.g., a disk file.

But, now consider what happens when we have delimited a word in the main input file and we want to look it up in the word_freqs structure:

    Seek to the beginning of the word_freqs structure on disk
    Read each entry in word_freqs one at a time and for each:
        Compare the entry to the delimited word
        If no match, keep reading from the word_freqs structure
        If there is a match, increment the frequency and update 
            the entry in the word_freqs structure
    After reading every entry in word_freqs, if a match is not found, 
        then add a new entry to the word_freqs structure with the 
        delimited word and a frequency of 1

Now, pay attention: This process will be done for EVERY non-stop word in the main input file. There are 69,373 stop words in the file, so the number of non-stop words is 124,588 - 69,373 = 55,215. Thus the word_freqs structure on disk is searched 55,215 times, and the searches get longer as more words are added to the structure. And this is all being done on disk, not in main memory. So it is probably going to be slow. And, indeed, it is slow.

What to do about it? We need a solution that does not do so much disk access. Here is an approach:

    When a word in the main input file is delimited, if it is a non-stop
        word, simply write it to a words_file

    So when the main input file has been processed, every non-stop word 
        in it appears in the words_file

    Now we want to sort the words_file
    To achieve this we do a classic external merge sort:  
        Read a chunk from words_file, as much as will fit in main memory
        Sort the chunk in main memory
        Write the chunk to a temp file Ti on disk
    We end up with n files on disk, each sorted;  now merge:
    Read the first record from each of the n files into an array
    Repeat until the array is empty:
        Find the smallest record in the array, say in slot i, and 
            write it to the output file
        Read another record from Ti into slot i of the array
    Now we have words_file.srt, a sorted version of words_file

    Read words-file.srt, counting adjacent equal records to come up 
        with a frequency for each word; write word_freqs, a file with
        distinct words and a count for each word

Processing word_freqs to come up with the 25 most frequent words in descending order is done just as in the original program. The result: A program that does exactly the same job as the original program, but runs at least 80 times faster.

Of course, the classic time-space trade-off is in play here. The original solution uses less memory, but pays for it by running slowly. The new solution pays more in terms of memory, but gains dramatically in speed. The new solution could be tweaked to use less memory, but that is not the point of this exercise. Keep in mind the memory we are talking about here is secondary memory, not main memory. There was never enough main memory in the Good Old Times.


If you want to run the original and new programs and compare the results, follow the directions below to download, unzip, and run the programs per instructions in the README file.


Download

    Download Python programs and data files

Unzip on GNU/Linux

    tar xzf 80-times.tar.gz

Unzip on Windows

If not already on your Windows system, download 7-Zip, and then unzip twice:

    7z x 80-times.tar.gz
    7z x 80-times.tar

README

Follow directions in the README file.