Random non-repeating sequences

While doing some research for an upcoming paper, I came upon a 2008 blog post describing a special case of the following interesting problem:

Given an alphabet of n symbols, construct a uniformly-selected sequence of m of these symbols, with the sequence containing no duplicate symbols.

[Updated 2009/12/14] The difficulty arises when both m and n are large. The obvious method is a rejection method: repeatedly pick a random symbol from the alphabet and check whether you've picked it already. If not, append it to your target number. The problem is that the duplicate check seems to require log m time even if you code it cleverly, giving an asymptotic performance of O(m log m). Hash tables can help some, but ultimately you're going to waste a lot of time checking for the unlikely case that you've generated a duplicate.

There were various solutions given in the comments, but none of them were optimal. A spoiler follows…

The obvious solution here is to do a partial Fisher-Yates shuffle, selecting and retaining only the first m elements. The problem with this approach is that Fisher-Yates requires that the array be initialized with the entire alphabet. If n is much larger than m the cost of initialization dominates the cost of selection, and the whole thing becomes really inefficient. (There's a variant of Fisher-Yates that can initialize the array on-the-fly, but it only works when shuffling the whole array, so it doesn't help here.)

One way to make partial Fisher-Yates work is to employ a technique for initializing an array, little-known outside of academic circles, that can initialize in an arbitrary way in amortized constant time. The basic idea is that instead of your array containing values, it contains pointers into a stack of value / backpointer pairs. Obviously, any pointers that point outside the stack are invalid. To tell whether a pointer into the stack is valid, you check whether the stack entry's backpointer points back to the original array element. If an array element is invalid, initialize a valid stack entry and push it on top of the stack. This blog entry does a good job of explaining the details. Since you don't actually initialize an array element until you first try to read it, you don't pay for initializing elements you don't read.

This self-initializing array is exactly what is wanted to do a partial Fisher-Yates shuffle. The modified code looks like this:

To make a sequence of m symbols from an alphabet 0..n, disregarding duplicates:
  a ← self-initializing array of n elements
  for i in 0 .. m-1 do
    j ← i + randint (n - i)
    v ← read_or_init a[i] i
    v' ← read_or_init a[j] j
    set a[j] ← v
    set a[i] ← v'
  return a[0]..a[m-1]

where set and read\_or\_init are the obvious self-initializing array primitives.

A little bit complex, perhaps, but should have great performance on a wide range of inputs. Academic CS algorithms stuff for the win! (B)