IAM

NOVEMBER2016

READING

M. F. Porter. An Algorithm for Suffix Stripping. Readings in Information Retrieval, 1997.

Porter presents a simple but fast algorithm for suffix stripping in natural language processing. The so-called Porter algorithm tries to strip all suffixes in order to reduce related words to the same stem. In his original paper, Porter explicitly discusses the use of the algorithm for information retrieval systems.

The algorithm proceeds as follows. Given a word, it is classified according to its consonants and vowels. Indeed, each word can be represented in the form of

$[C](VC)^m[V]$

where C stands for a sequence of consonants and V stands for a sequence of vowels. The words are classified according to $m$. Examples from the paper:

  • $m = 0$: "TR", "EE", "TREE", "Y", "BY".
  • $m = 1$: "TROUBLE", "OATS", "TREES", "IVY".
  • $m = 2$: "TROUBLES", "PRIVATE", "OATEN", "ORRERY".

The algorithm then applies a sequence of rules of the form

(condition) $S_1 \rightarrow S_2$

where the condition usually refers to $m$ and $S_1$, $S_2$ match suffixes on the given word. Examples of such rules are

($m > 1$) EMENT $\rightarrow$

SSES $\rightarrow$ SS

Groups of such rules are applied to the given word in turn; the groups and their rules can be found in the paper.

While the algorithm is easy to implement and fast, Porter mentions several aspects to consider. For example, some words may be reduced to the same stem although their meaning differs (for example "RELATE" and "RELATIVITY" are reduced to the same). However, adapting the rules to resolve this problem in particular cases will cause the algorithm to "fail" in other cases.

An implementation of Porter's algorithm can be found in the NLTK framework.

What is your opinion on this article? Let me know your thoughts on Twitter @davidstutz92 or LinkedIn in/davidstutz92.