In the book Exercises in Programming Style, the author, Cristina Videira Lopes, solves a programming problem in 33 different ways, each an example of a particular style of programming: monolithic, object-oriented, etc.
Being an old geezer of a programmer, I was immediately drawn to style #1, Good Old Times. The constraints that must be dealt with using this style are two:
"Very small amounts of primary memory, typically orders of magnitude smaller than the data that needs to be processed/generated." "No identifiers - i.e., no variable names or tagged memory addresses. All we have is memory that is addressable with numbers."
The second constraint is really archaic, and even a programmer as old as I am never had to deal with anything so primitive. (Although I did have to use a Basic compiler that only allowed identifiers of the form A, B, ..., Z, A1, A2, ...)
However, the first constraint was very real when I began learning to program. My first programming job introduced me to a Danish computer, the GIER, which had 1K of 40-bit words, i.e., about 5K bytes of main memory. The modern programmer wouldn't know how to begin doing anything in this minuscule amount of memory, but the GIER boasted a full-blown Algol 60 compiler that was extremely fast and which produced object programs with automatic overlays so that a program much larger than 5K could run with ease. See The Design of the GIER Algol Compiler - Part I and Part II for some insight into the creativity and ingenuity that were applied to the problem of not enough main memory, way back over 55 years ago.
Digging into the Python code for program version 1, it quickly became apparent to me that the Good Old Times style presented in the book is just not a viable solution, even in the Good Old Times. Limited main memory forced old-time programmers to develop algorithms which performed decently even in the face of almost no memory. So, I modified the program, keeping in mind the major constraint of not enough memory, and was gratified to see that my solution ran 80 times faster than the original program. Generally, I am not able to increase the performance of a program by quite such a factor (🙂), so it might be useful to see what made it possible in this case.
What is the problem to be solved? To quote the author: "given a text file, we want to produce the list of words in the file and their frequencies, and print them out in decreasing order of frequency."
The text file used is pride-and-prejudice.txt, containing Jane Austen's Pride and Prejudice, available from the Gutenberg Collection. The Linux utility wc reports this about the file:
13,426 lines of text 124,588 words 704,145 bytes
There is one other input file, stop_words.txt, 554 bytes in size, containing 121 common words like "a", "the", etc. Any occurrences of these words in the main input file are to be ignored.
The desired output from the program is a text file containing the 25 most frequently occurring words in the input file, listed in descending order by frequency:
mr - 786 elizabeth - 635 very - 488 darcy - 418 such - 395 mrs - 343 much - 329 more - 327 bennet - 323 bingley - 306 jane - 295 miss - 283 one - 275 know - 239 before - 229 herself - 227 though - 226 well - 224 never - 220 sister - 218 soon - 216 think - 211 now - 209 time - 203 good - 201
The original program is organized like this:
Load the entire stop_words.txt file into primary memory Read the main input file one line at a time, and for each line: Delimit words in the line, and for each word: Look the word up in the stop words structure If found, ignore the word Otherwise, look the word up in the word_freqs structure If not found, add the word to the structure with frequency 1 If found, increment the frequency by 1 Read the word_freqs structure one line at a time and build an output structure containing the 25 most frequent words in descending order by frequency
Note that once the main input file has been processed, the word_freqs structure contains all the data we need, namely, for each unique word appearing in the file, the word itself plus a count of how many times the word appears.
The big question is, where is this word_freqs structure stored? Clearly it can't be in main memory because main memory is what we are acutely short of. Thus, it must be in secondary memory, e.g., a disk file.
But, now consider what happens when we have delimited a word in the main input file and we want to look it up in the word_freqs structure:
Seek to the beginning of the word_freqs structure on disk Read each entry in word_freqs one at a time and for each: Compare the entry to the delimited word If no match, keep reading from the word_freqs structure If there is a match, increment the frequency and update the entry in the word_freqs structure After reading every entry in word_freqs, if a match is not found, then add a new entry to the word_freqs structure with the delimited word and a frequency of 1
Now, pay attention: This process will be done for EVERY non-stop word in the main input file. There are 69,373 stop words in the file, so the number of non-stop words is 124,588 - 69,373 = 55,215. Thus the word_freqs structure on disk is searched 55,215 times, and the searches get longer as more words are added to the structure. And this is all being done on disk, not in main memory. So it is probably going to be slow. And, indeed, it is slow.
What to do about it? We need a solution that does not do so much disk access. Here is an approach:
When a word in the main input file is delimited, if it is a non-stop word, simply write it to a words_file So when the main input file has been processed, every non-stop word in it appears in the words_file Now we want to sort the words_file To achieve this we do a classic external merge sort: Read a chunk from words_file, as much as will fit in main memory Sort the chunk in main memory Write the chunk to a temp file Ti on disk We end up with n files on disk, each sorted; now merge: Read the first record from each of the n files into an array Repeat until the array is empty: Find the smallest record in the array, say in slot i, and write it to the output file Read another record from Ti into slot i of the array Now we have words_file.srt, a sorted version of words_file Read words-file.srt, counting adjacent equal records to come up with a frequency for each word; write word_freqs, a file with distinct words and a count for each word
Processing word_freqs to come up with the 25 most frequent words in descending order is done just as in the original program. The result: A program that does exactly the same job as the original program, but runs at least 80 times faster.
Of course, the classic time-space trade-off is in play here. The original solution uses less memory, but pays for it by running slowly. The new solution pays more in terms of memory, but gains dramatically in speed. The new solution could be tweaked to use less memory, but that is not the point of this exercise. Keep in mind the memory we are talking about here is secondary memory, not main memory. There was never enough main memory in the Good Old Times.
Follow the directions in the README file at this github repository to run the two programs on Linux or Windows. You should see that the new program runs around 80 times faster than the old one.