Unknown Language Coding Problem

This is from the Daily Coding Problem email list. The owners of that list haven’t sent out a problem that caught my imagination in quite a while.

Daily Coding Problem: Problem #1553 [Hard]

This problem was asked by Airbnb.

You come across a dictionary of sorted words in a language you’ve never seen before. Write a program that returns the correct order of letters in this language.

For example, given ['xww', 'wxyz', 'wxyw', 'ywx', 'ywz'], you should return ['x', 'z', 'w', 'y'].


Github repo for my solution. Feel free to look it over, try it and email me (bediger8@gmail.com) if you notice anything.

Here’s how I solved this problem. My solution is idiosyncratic compared to the concise solutions from interview prep websites.

Analysis

I wrote a program that compares characters in adjacent pairs of words to determine lexical “less than” relationships between pairs of characters.

If the first character of an adjacent pair of words is different, unknown language letter word[n][0] is lexically less than unknown language letter word[n+1][0]. The initial letters of the sorted list of words have a “less than” relationship, but realistically not all letters in the list of words will appear at the beginning of a word. Perhaps the unknown language has some letters that never appear in the initial position. The letter ‘X’ is almost in this category for English.

You can derive extra information in some instances. If the first character of an adjacent pair of words is identical, compare the second characters. If the second characters differ, the first word’s second character is lexically less than the second word’s second character. if the second characters are identical, compare third characters, and so on.

You get at most one lexically less than relationship of characters between any pair of adjacent words, but you can’t assume an immediate, parent-child relationship.

Unless the sorted word list is chosen carefully, you might get letters that you don’t have enough information to derive any lexical less than relationship, Consider a sorted list of words ['ab', 'ac', 'bae', 'bc']. You can tell that ‘a’ < ‘b’ when comparing “ac” and “bce”. The word “bce” has an ’e’ character that can’t be compared to any other character. “bce” is the longest word in the sorted list and ’e’ only occurs there.

It’s also possible to have sorted word lists where letters that have a less than relationship, but you can’t tell if there are other letters that sort between them.

The other solutions of this I could find ignored this nicety, producing a topological sort of letters found.

Data Structures

The data structure to keep the characters in will need to allow arbitrary insertions, and deletions. The trick here is that even when a pair of characters have a lexical relationship based on one pair of adjacent words, we don’t know if characters will show up that are lexically between them in pairs of words encountered later. We also have to pick a data structure that allows a “less than” character to have multiple characters “greater than”.

Evolution of data structure

Even though I had an idea of how to find relationships between pairs of letters, the data structure I used evolved as I wrote code and made decisions.

At first glance, I thought a doubly-linked list would work:

type node struct {
    character rune
    next *node
    prev *node
}

After thinking a bit and writing a little code, I realized that a character could have multiple children during the procedure of examining pairs of words, even if after comparing all words, only one child character per character found existed. A slice handles that situation.

type node struct {
    character rune
    children  []*node
}

I also realized that to insert a newly encountered letter, I’d need to have a “parent” pointer.

type node struct {
    character rune
    parent    *node
    children  []*node
}

Then I realized a Go map type would be easier coding than all the slice manipulation required to make a node a child or to remove a child node.

type node struct {
    character rune
    parent    *node
    children  map[rune]*node
}

I keep a root node, the letter determined to be lexically least so far. I wrongly believed that would be the first letter of the first word of the list. That’s not true, a word list of ['ba', 'bb', 'bc'] has a first letter of first word that’s not the first lexically sorted letter. The parent element of the root node is nil, so a few non-nil pointer tests have to exist. I suppose it would be possible to use the non-nil parent pointer, or put a root bool element in the node struct, so as to mark the root node, but that seems error-prone.

In addition, I have a map[rune]*node to track all of the characters discovered. Due to Go’s method system, I added a method through which all per-character structures are obtained.

type nodePool map[rune]*node

func (p nodePool) characterNode(character rune) *node {
    if n, ok := p[character]; ok {
        return n
    }
    n := &node{ 
        character: character,
        children:  make(map[rune]*node),
    }
    p[character] = n
    return n
}

The method characterNode() keeps a single struct node pointer for a given character. Code for getting a struct out of a map type looks like an array access in Go: p[character] in the code above. Using the method looks like pool.characterNode(character), so there’s not a lot of cognitive burden in using a method call.

Interesting Consequence

I have a large sorted list of words from a real language, English. Linux comes with /usr/share/dict/words, or at least it’s easily obtained.

letter order from 25 words

That’s an example of using 25 English language words, chosen randomly from /usr/share/dict/words.

My program can’t find any lexical less than relationship for ‘i’, ‘z’, ‘x’, ‘k’. from those 25 words. For example, the letter ‘k’ appears once in the list of words, at the end of a word. The letter ‘z’ appears a few times, but not in a position where it can be given a less than relationship to another letter. There’s reasons for all these letters to appear without a parent.

My program also finds that ‘a’ < ’s’ and ‘c’ < ’t’, but it doesn’t find any letters that ’s’ or ’t’ are less than. This example shows that a topological sort producing a single list of letters would be misleading to someone trying to translate or decipher the unknown language.

It’s also possible to create sorted lists of words that don’t have a single chain of lexical less than relationships. Consider the list ['abd', 'abe', 'bc', 'bd', 'be'].

  • a < b, based on “abe” sorting before “bc”.
  • d < e, based on “abd” sorting before “abe”.
  • c < d, d < e, based on “bc”, “bd” and “be”

Nowhere in that list can we find where ‘c’ relates to ‘a’ or ‘b’.

I find that even with 500 randomly chosen English words, the program still can’t always find where ‘x’, ‘y’ and ‘z’ sort.

As far as languages with some letters that never appear initially, here’s the results of my program trying to figure out letter sorting order from 200 randomly-chosen English words that don’t have ‘a’, ’e’, ‘i’, ‘o’ or ‘u’ initially:

sort of no-initial-vowels words

Notice the branching after ‘a’, ’e’ and ‘o’. That’s got to happen because there’s not enough comparisons of same-first-letter, different-second-letter to find the actual places of the children of vowels in sort order.

This Problem in Job Interviews

I think this is a decent interview question, albeit not one that’s good for whiteboard coding.

It requires the candidate to think about the problem, and devise a solution. The solution has some subtleties, so the candidate will probably have to modify the data structure, and change the algorithm as coding progresses and information presents itself.

The interviewer would get to see some non-trivial code. There’s opportunities for the candidate to make the code more or less clear by choice of function and variable names, and possibly by doing some small amount of object oriented coding. Moving some code to methods can add clarity to the algorithm, which is a little unusual in small problems like this.

The interviewer could also pose alternate word lists to get candidates to consider corner cases. This would allow an interviewer to see how a candidate refactors, if the corner cases trigger bugs or break the algorithm.

I think any algorithm solving this problem would be hard to contain in a candidate’s head, requiring refactoring and changing code and data structures. That’s difficult on a whiteboard after even a small number of lines of code. My code clocks in at just less than 200 lines, which is certainly too big for a whiteboard.

Does it rate a “[Hard]”? I think it might. You need to develop a data structure and write some code to put together a number of instances of it, correctly. Some of the operations aren’t immediately obvious. When the program comes across a lexical less than, it can’t assume an immediate parent-child relationship. Nor can it use an “already seen this letter” indicator, it has to check if the “greater than” letter already exists in the descendants of the “less than” letter.

Or, if you do it the way most of the online sources do it, you have to have the insight that this is a graph theory problem. Topological sorting is a fairly obscure topic.

One big flaw exists in this problem. A topological sort can produce a straight line list of letters that have a lexically less than relationship. That list may not represent the sorting order of the unknown language, if the list of words doesn’t have enough information about the letters. A programmer that considers the use of an ordered list of letters may decide that a topological sort would just mislead users trying to understand the unknown language.