Computer science

Approximate string-matching algorithms, part 1

Part 1, Part 2

Table of contents

Introduction

Nowadays computers keep huge amount of data. However the retrieval of textual information is difficult when is misspelled or not exactly known.

This article is devoted to approximate string-matching algorithms. It gives a simplified but clear description of algorithms with examples.

First, we are going to explain the procedures constructing the phonetic codes for the searched text that sound the same, but spelled differently. Then we explain procedures that give different types of textual similarity metric used in a fault-tolerant search. In conclusion, the algorithms are compared so that the reader can choose the right one. See also resources later in this article.

Comments and questions are welcome: please reach me at alexandre.rodichevski@chiappani.it.

Phonetic coding

The procedures discussed here are used to simplify searches in databases when one knows how the text is pronounced but not how it is spelled. For this purpose the phonetic codes are constructed for the searched text, while the database is preventively indexed using those codes. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and can be filed together. The use of phonetic codes reduces matching problems from wrong or different spelling.

Phonetic codes are very useful in order to match lists of either personal names or enterprise names. They are also used for speech recognition and in search engines of databases like the anti-terrorist ones.

Soundex

Soundex builds a phonetic code for people's name. The resulting code is made of one letter and three digits, each digit being one of six consonant sounds. The algorithm was first applied in the U.S. census in 1880.

Procedure
  1. Take the first letter.
  2. Translate remaining characters:
    B, F, P, V → 1
    C, G, J, K, Q, S, X, Z → 2
    D, T → 3
    L → 4
    M, N → 5
    R → 6
  3. Drop adjacent letters having the same code.
  4. Drop non-initial A, E, I, O, U, Y, W and H.
  5. Take first four characters padding with zeros.
Examples
  1. ALEXANDRE → A4E2A536E → A4E2A536E → A42536 → A425.
  2. ALEKSANDER → A4E22A53E6 → A4E2A53E6 → A42536 → A425.

Daitch-Mokotoff Soundex

In 1985 the new Daitch-Mokotoff Soundex algorithm was applied for phonetic coding of Slavic and Yiddish surnames with similar pronunciation and different spelling. The most important improvements are the followings: the code is composed of six characters; there are two different codes when a name has two possible pronunciations; the code is composed of ten figures from 0 to 9.

NYSIIS

The New York State Identification and Intelligence System was developed in 1970 through rigorous empirical analysis. In this algorithm a phonetic code of up to 6 letters is constructed.

Procedure
  1. Translate first characters of name:
    MAC → MCC
    PH → FF
    KN → NN
    PF → FF
    K → C
    SCH → SSS
  2. Translate last characters of name:
    EE → Y
    IE → Y
    DT, RT, RD, NT, ND → D
  3. Translate remaining characters by following rules, incrementing by one character each time:
    EV → AF; else E, I, O, U → A
    Q → G
    Z → S
    M → N
    KN → N; else K → C
    SCH → SSS
    PH → FF
    H → previous character, if previous or next are nonvowel
    W → previous character, if previous is vowel
  4. Check last character:
    if last character is S, remove it
    if last characters are AY, replace with Y
    if last character is A, remove it
  5. Drop second letter of doubled letters.
  6. Take first six characters.
Examples
  1. ALEXANDRE → ALAXANDRA → ALAXANDR → ALAXANDR → ALAXAN
  2. ALEKSANDER → ALACSANDAR → ALACSANDAR → ALACSANDAR → ALACSA

Metaphone

The algorithm phonetically codes words by reducing them to 16 consonant sounds: B, X, S, K, J, T, F, H, L, M, N, P, R, 0, W, Y. Zero represents the "th" sound; X stands for "sh".

Procedure
  1. Drop second letter of doubled letters, except C.
  2. If the word begins with KN, GN, PN, AE, WR, drop first letter.
  3. Drop B at the end of a word after M.
  4. C → X in CIA or CH; C → S in CI, CE or CY; C → K otherwise.
  5. D → J in DGE, DGY or DGI; D → T otherwise.
  6. Drop G in GH and if not at end or before a vowel in GN or GNED; G → J before I or E or Y if not double GG; G → K otherwise.
  7. Drop H after vowel and if no vowel follows.
  8. Drop K after C.
  9. P → F in PH.
  10. Q → K.
  11. S → X in SH or SIO or SIA.
  12. T → X in TIA or TIO; T → 0 in TH; T is dropped in TCH.
  13. V → F.
  14. If the word begins with WH, drop H; drop W if not followed by a vowel.
  15. If the word begins with X, then X → S; X → KS otherwise.
  16. Drop Y if not followed by a vowel.
  17. Z → S.
  18. Vowels are kept only when they are the first letter.
  19. In all the other cases the letters do non change.
Examples
  1. ALEXANDRE → ALEKSANTRE → ALKSNTR
  2. ALEKSANDER → ALEKSANTER → ALKSNTR

Double metaphone

Double metaphone is the improved version of Metaphone. This algorithm phonetically codes words by reducing them to 12 consonant sounds. It returns two keys if a word has two possible pronunciations.

Caverphone

The Caverphone phonetic matching algorithm was created in the Caversham Project at the University of Otago in New Zealand in 2002. The algorithm allows accents present in the study area (southern part of the city of Dunedin, New Zealand).

A new version, Caverphone 2.0, has been built as a more general phonetic match.

Approximate string-matching algorithms
Part 1, Part 2