How to Check Spelling the Old School Unix Way

Bash Shell

Before word processors had a spell check function, you had to run your own spell check against a document. And in the very early Unix days, systems didn’t have a dedicated “spell check” program, but instead required a set of commands to do it yourself. Let’s take a look at how to check spelling the “old school Unix” way.

Checking Spelling On The Command Line

These days, we don’t think about the spell checker in our word processor. You may not even “run” a spell check anymore. It’s easier to watch for the red squiggly line to appear under misspelled words; if there’s a red line under it, you fix the spelling.

In the early Unix days, the system provided a dictionary file (usually /usr/share/dict/words on most Linux systems) that contained a sorted list of words from the dictionary, with each word on a line by itself. To check the spelling from a document, you need to compare all the words in your document against the dictionary file. And to do that, you need to convert your document into a format that looks like the dictionary file: a sorted list of words, with each word on its own line.

The dictionary file is all lowercase, so first you need to convert your document to use lowercase letters. You do this with the cat command to display the file, and the tr command to translate characters from one set to another. In this case, you can ask tr to convert all the uppercase letters A-Z to the lowercase letters a-z:

cat document | tr A-Z a-z

While the dictionary includes in-word punctuation like hyphens and apostrophes, the list of words doesn’t include sentence punctuation like periods and question marks. So the next step is to use tr, this time to delete (-d) the characters we don’t want:

cat document | tr A-Z a-z | tr -d ',.:;()?!'

The dictionary file has each word on a line by itself, so you need to break your document up so each word appears on its own line. The tr command can replaces spaces with a “new line” character, to do this for us:

cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' '\n'

Sorting the output is easily done using the Unix sort command. Add the uniq command to clean up the output, to remove any duplicate words. For example, you probably use the word “the” several times in any document. Using sort then uniq will strip out the repeated instances of “the” so you only have one “the” in your output.

cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' '\n' | sort | uniq

Now you are ready to compare the list of words from your document with the dictionary file! The standard Unix command comm compares two files line-by-line, and identifies lines that are unique to the first file, unique to the second file, or lines that are common to both. To find the list of misspelled words from your document, you want the list of unique words – words that are found in your document, but not in the dictionary file. Use the -2 option to not print the words unique to the second file, and the -3 option to not display the words that are common to both files. What’s left are the words unique to your document that do not appear in the dictionary; these are misspelled words.

cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' '\n' | sort | uniq | comm -2 -3 - /usr/share/dict/words

The single hyphen tells comm to read from the “standard input,” which is the output from the previous commands on the command line.

And that’s how to check spelling the “old school Unix” way! Let me demonstrate with a sample document. I’ve intentionally misspelled a few words here:

$cat document
Early Unix didn't have word procesors like we thikn of them today. Instead,
you wrote a plain text document that might have embedded special commands to
underline text or create a list of bulet points. But how did you check the
spelling of your document?

By running the list of commands, you’ll find this list of misspelled words:

$cat document | tr A-Z a-z | tr -d ',.:;()?!' | tr ' ' '\n' | sort | uniq | comm -2 -3 - words
bulet
procesors
thikn

The key to checking spelling this way is the Unix comm command to compare two sorted lists of words. The two lists do need to be sorted the same way. Your Linux system’s /usr/share/dict/words file may include some uppercase words such as common names or titles or locations. For example, the dictionary file on my Fedora 32 system contains both “Minnesota” (correct capitalization for the U.S. state name) “minnesota” (all lowercase) on adjacent lines. But the Unix sort command sorts uppercase letters separately from lowercase letters. This will confuse the comm command, which will complain the input file is not correctly sorted. To better match the “old school Unix” method to check spelling, you may first need to sort your system’s dictionary file and save it in a separate file. You can do so like this:

sort /usr/share/dict/words > words

Source

Author: admin

Leave a Reply

Your email address will not be published. Required fields are marked *