search

Home  >  Q&A  >  body text

Way to find words that differ by only one consonant in a large word list

<p>I have a list of nearly 5000 "fantasy" words written in ASCII text. Some of these words are as follows: </p> <pre class="brush:php;toolbar:false;">txintoq txiqbal txiqfun txiqwek txiqyal txiyton txonmiq txoqwul txoqxik</pre> <p>I want to design an algorithm that checks/verifies that no two words in a list differ by only one "similar consonant".So I'll define "set of similar consonants" like this (tentatively): </p> <pre class="brush:php;toolbar:false;">zs xj pb td kg</pre> <p><em>There may be 3 or more consonants in a set, but I will only show 2 now. As I learn more about which consonants sound alike in the tones of fantasy languages, I need to further refine this definition. </em></p> <p>Therefore, words like the following will be marked as "requires correction" (because they sound too similar): </p> <pre class="brush:php;toolbar:false;">txindan txintan # Only d/t is different xumaq jumaq # Only x/j is different dolpar dolbar # Only a b/p is different</pre> <p>How do I find these words that differ by only one consonant in my list of ~5,000 words in a <em>relatively efficient</em> way? </p> <p>This is a very naive solution that I currently have in mind, as follows: </p> <pre class="brush:php;toolbar:false;">import fs from 'fs' const terms = fs .readFileSync('term.csv', 'utf-8') .trim() .split(/n /) .map(line => { let [term] = line.split(',') return term }) .filter(x => x) const consonantSets = ` zs xj pb td kg` .split(/n /) .map(x => x.split('')) function computeSimilarTerms( term: string, consonantSets: Array<Array<string>>, ) { const termLetters = term?.split('') ?? [] const newTerms: Array<string> = [] for (const consonantSet of consonantSets) { for (const letter of consonantSet) { for (const letter2 of consonantSet) { if (letter === letter2) { continue } let i = 0 while (i < termLetters.length) { const termLetter = termLetters[i] if (termLetter === letter) { const newTerm = termLetters.concat() termLetters[i] = letter2 newTerms.push(newTerm.join('')) } i } } } } return newTerms } for (const term of terms) { const similarTerms = computeSimilarTerms(term, consonantSets) similarTerms.forEach(similarTerm => { if (terms.includes(similarTerm)) { console.log(term, similarTerm) } }) }</pre> <p>How can this be accomplished with relatively little brute force? And this solution is incomplete because it does not build <em>all possible similar word combinations</em>. So somewhere in the algorithm it should be able to do this. Any ideas? </p>
P粉757640504P粉757640504495 days ago623

reply all(1)I'll reply

  • P粉238433862

    P粉2384338622023-08-16 13:13:46

    Choose one consonant in each group to be the "representative" of that group. Then, build a map that groups words together such that they become identical when their consonants are replaced by their representative consonants.

    Important note: This method only works when the consonant groups form equivalence classes. In particular, consonant similarity must be transitive. If 'bp' is similar, 'bv' is similar, but 'pv' is not similar, then this method is invalid.

    The following is the code for the example in Python; I let you write the JavaScript code.

    • f is a mapping that maps each consonant to its representative consonant;
    • d is a map that maps each representative word to a list of words with this representative.
    bigwordlist = '''dolbar
    dolpar
    jumaq
    txindan
    txintan
    txintoq
    txiqbal
    txiqfun
    txiqwek
    txiqyal
    txinton
    txonmiq
    txoqwul
    txoqxik
    xumaq'''.splitlines()
    
    consonant_groups = '''zs
    xj
    pb
    td
    kg'''.splitlines()
    
    f = {}
    for g in consonant_groups:
        for c in g:
            f[c] = g[0]
    
    print(f)
    # {'z': 'z', 's': 'z', 'x': 'x', 'j': 'x', 'p': 'p', 'b': 'p', 't': 't', 'd': 't', 'k': 'k', 'g': 'k'}
        
    d = {}
    for word in bigwordlist:
        key = ''.join(f.get(c, c) for c in word)
        d.setdefault(key, []).append(word)
    
    print(d)
    # {'tolpar': ['dolbar', 'dolpar'], 'xumaq': ['jumaq', 'xumaq'], 'txintan': ['txindan', 'txintan'], 'txintoq': ['txintoq'], 'txiqpal': ['txiqbal'], 'txiqfun': ['txiqfun'], 'txiqwek': ['txiqwek'], 'txiqyal': ['txiqyal'], 'txinton': ['txinton'], 'txonmiq': ['txonmiq'], 'txoqwul': ['txoqwul'], 'txoqxik': ['txoqxik']}
    

    Finally, we can see which words are similar:

    print([g for g in d.values() if len(g) > 1])
    # [['dolbar', 'dolpar'], ['jumaq', 'xumaq'], ['txindan', 'txintan']]
    

    reply
    0
  • Cancelreply