IAM

ARTICLE

Thoroughly Spell-Checking a PhD Thesis

I always used aspell to spell check papers. I did not care about setting up a dictionary of words that aspell does not recognize due to time pressure. As papers usually involve few LaTeX files, this was an OK process. For my PhD thesis, however, I needed a more automatic and thorough process. This is because more files are involved and I had to spell check multiple times, several weeks or months apart, throughout the process. In this article, I want to share a semi-automatic but thorough process based on aspell and TeXtidote that worked well for me.

Introduction

Spell checking capabilities are getting better every year. While several companies offer good solutions for web browsers and good spell checking is built into most office applications, getting LaTeX projects/files spell-checked still seems problematic. While some of the web tools seem to work partly with OverLeaf, importing a LaTeX project into OverLeaf to manually spell-check all files is quite cumbersome.

For me, GNU's aspell has long been to go-to tool to spell-check LaTeX files. For papers, this was usually a good solution as few files were involved and spell-checking only happened once before submission and camera ready submission, respectively. For my PhD thesis, however, I realized this is not practical — also because I never invested in building a custom dictionary containing many of the terms that aspell does not recognize. In this article, I want to share some scripts to use a combination of aspell and TeXtidote for spell and grammar checking.

Requirements

aspell is commonly pre-installed with many Linux distributions. On Debian/Ubuntu, it can manually be installed via sudo apt-get install aspell. TeXtidote is a bit more involved to install as it relies on Java. However, the official repository includes instructions for Debian systems.

Usage and Scripts

aspell: The main problem with aspell is that it fails recognizing many scientific terms and sometimes also flags LaTeX commands or bib entries. Moreover, every file needs to be scanned individually which is difficult for a large project such as a PhD thesis. Nevertheless, aspell is able to not only identify but correct many basic spelling mistakes. So I tackled both problems using a custom dictionary and a simple Python script that helps me go through all files automatically.

An aspell dictionary looks as follows:

personal_ws-1.1 en 0
ResNet
ResNets
SimpleNet
...

It should not contain empty lines or duplicates. Running aspell with a custom dictionary is pretty simple: aspell check --add-extra-dicts=./dict.txt thesis.tex. However, it will only check the current file. So I thought it would be nice to go through files sequentially, allowing to update the dictionary in between. So the following Python script allows to do this, while also recording which files have been checked:

import os
import sys
import glob


CHECKED_FILE = 'checked.txt'
DICT_FILE = 'aspell_dict.txt'

def query_yes_no(question, default="yes"):
    """Ask a yes/no question via raw_input() and return their answer.

    "question" is a string that is presented to the user.
    "default" is the presumed answer if the user just hits <Enter>.
            It must be "yes" (the default), "no" or None (meaning
            an answer is required of the user).

    The "answer" return value is True for "yes" or False for "no".
    """
    valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
    if default is None:
        prompt = " [y/n] "
    elif default == "yes":
        prompt = " [Y/n] "
    elif default == "no":
        prompt = " [y/N] "
    else:
        raise ValueError("invalid default answer: '%s'" % default)

    while True:
        sys.stdout.write(question + prompt)
        choice = input().lower()
        if default is not None and choice == "":
            return valid[default]
        elif choice in valid:
            return valid[choice]
        else:
            sys.stdout.write("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")


def read_checked_filenames():
    """Read checked filenames."""
    with open(CHECKED_FILE, 'r') as file:
        lines = file.readlines()
    lines = [line.strip() for line in lines]
    lines = [line for line in lines if line != '']
    return lines


def add_checked_filename(filename):
    """Add filename to checked ones."""
    fo = open(CHECKED_FILE, 'a')
    fo.write(filename + "\n")
    fo.close()


if __name__ == '__main__':
    checked_filenames = read_checked_filenames()
    print('Already checked: %d files' % len(checked_filenames))
    for checked_filename in checked_filenames:
        print(checked_filename)

    last_filename = None
    for filename in glob.glob('**/*.tex'):
        if filename not in checked_filenames:
            if last_filename is not None:
                add_checked_filename(last_filename)
            last_filename = filename
            if query_yes_no('Check spelling for %s?' % filename):
                print('aspell check --add-extra-dicts=./%s %s' % (DICT_FILE, filename))
                os.system('aspell check --add-extra-dicts=./%s %s' % (DICT_FILE, filename))
                print('Checked %s.' % filename)
            else:
                print('Ignored %s.' % filename)

In practice, I had the dictionary file open on one side and the script running in bash on the other. Between each file, I would keep adding all words or acronyms aspell did not know but where correct. As a result, after some files, aspell would only stop at spelling mistakes. At the same time, the script can be restarted and will not re-check all files recorded in checked.txt.

TeXtidote: In contrast to aspell, textidote creates a HTML report that not only highlights spelling or grammar mistakes but also LaTeX problems. The latter can be useful but flagged too many custom commands in my experience. To instruct textidote to do spell-checking with a custom dictionary, the following command can be used:

textidote --output html --check en --firstlang de --remove-macros eg,ie,cf,wrt,etc,secref,figref,tabref,algref,onedot --dict dict.txt --ignore sh:figref,sh:c:noin,sh:secskip,sh:008,sh:seclen --read-all thesis.tex > thesis.html

Note that the command also allows to ignore custom macros, but this was a bit shaky in practice. The dictionary format does follow aspell except for the first format line. It still requires manually going through the HTML report, but textidote automatically gathers all included LaTeX files, which is nice. Also, using Chrome developer tools, one can quickly search/jump to spelling errors (usually highlighted in red). The spell-checker was generally a bit more up-to-date compared to aspell.

Conclusion

While the combination of aspell and textidote still requies some manual work, I found the results to be pretty good in terms of fixing spelling errors and fixing simple grammar mistakes. Also, the work required to check 100+ LaTeX files and 300+ pages of PhD thesis was reasonable.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.