I have a problem where I really need to be able to use finite automata as the keys to an associative container. Each key should actually represent an equivalence class of automata, so that when I search, I will find an equivalent automaton (if such a key exists), even if that automaton isn't structurally identical.
An obvious last-resort approach is of course to use linear search with an equivalence test for each key checked. I'm hoping it's possible to do a lot better than this.
I've been thinking in terms of trying to impose an arbitrary but consistent ordering, and deriving an ordered comparison algorithm. First principles involve the sets of strings that the automata represent. Evaluate the set of possible first tokens for each automaton, and apply an ordering based on those two sets. If necessary, continue to the sets of possible second tokens, third tokens etc. The obvious problem with doing this naively is that there's an infinite number of token-sets to check before you can prove equivalence.
I've been considering a few vague ideas - minimising the input automata first and using some kind of closure algorithm, or converting back to a regular grammar, some ideas involving spanning trees. I've come to the conclusion that I need to abandon the set-of-tokens lexical ordering, but the most significant conclusion I've reached so far is that this isn't trivial, and I'm probably better off reading up on someone elses solution.
I've downloaded a paper from CiteSeerX - Total Ordering on Subgroups and Cosets - but my abstract algebra isn't even good enough to know if this is relevant yet.
It also occurred to me that there might be some way to derive a hash from an automaton, but I haven't given this much thought yet.
Can anyone suggest a good paper to read? - or at least let me know if the one I've downloaded is a red herring or not?
I believe that you can obtain a canonical form from minimized automata. For any two equivalent automatons, their minimized forms are isomorphic (I believe this follows from Myhill-Nerode theorem). This isomorphism respects edge labels and of course node classes (start, accepting, non-accepting). This makes it easier than unlabeled graph isomorphism.
I think that if you build a spanning tree of the minimized automaton starting from the start state and ordering output edges by their labels, then you'll get a canonical form for the automaton which can then be hashed.
Edit: Non-tree edges should be taken into account too, but they can also be ordered canonically by their labels.
here is a thesis form 1992 where they produce canonical minimized automata: Minimization of Nondeterministic Finite Automata
Once you have the canonical, form you can easily hash it for example by performing a depth first enumeration of the states and transitions, and hashing a string obtained by encoding state numbers (count them in the order of their first appearance) for states and transitions as triples
<from_state, symbol, to_state, is_accepting_final_state>
This should solve the problem.
When a problem seems insurmountable, the solution is often to publicly announce how difficult you think the problem is. Then, you will immediately realise that the problem is trivial and that you've just made yourself look an idiot - and that's basically where I am now ;-)
As suggested in the question, to lexically order the two automata, I need to consider two things. The two sets of possible first tokens, and the two sets of possible everything-else tails. The tails can be represented as finite automata, and can be derived from the original automata.
So the comparison algorithm is recursive - compare the head, if different you have your result, if the same then recursively compare the tail.
The problem is the infinite sequence needed to prove equivalence for regular grammars in general. If, during a comparison, a pair of automata recur, equivalent to a pair that you checked previously, you have proven equivalence and you can stop checking. It is in the nature of finite automata that this must happen in a finite number of steps.
The problem is that I still have a problem in the same form. To spot my termination criteria, I need to compare my pair of current automata with all the past automata pairs that occurred during the comparison so far. That's what has been giving me a headache.
It also turns out that that paper is relevant, but probably only takes me this far. Regular languages can form a group using the concatenation operator, and the left coset is related to the head:tail things I've been considering.
The reason I'm an idiot is because I've been imposing a far too strict termination condition, and I should have known it, because it's not that unusual an issue WRT automata algorithms.
I don't need to stop at the first recurrence of an automata pair. I can continue until I find a more easily detected recurrence - one that has some structural equivalence as well as logical equivalence. So long as my derive-a-tail-automaton algorithm is sane (and especially if I minimise and do other cleanups at each step) I will not generate an infinite sequence of equivalent-but-different-looking automata pairs during the comparison. The only sources of variation in structure are the original two automata and the tail automaton algorithm, both of which are finite.
The point is that it doesn't matter that much if I compare too many lexical terms - I will still get the correct result, and while I will terminate a little later, I will still terminate in finite time.
This should mean that I can use an unreliable recurrence detection (allowing some false negatives) using a hash or ordered comparison that is sensitive to the structure of the automata. That's a simpler problem than the structure-insensitive comparison, and I think it's the key that I need.
Of course there's still the issue of performance. A linear search using a standard equivalence algorithm might be a faster approach, based on the issues involved here. Certainly I would expect this comparison to be a less efficient equivalence test than existing algorithms, as it is doing more work - lexical ordering of the non-equivalent cases. The real issue is the overall efficiency of a key-based search, and that is likely to need some headache-inducing analysis. I'm hoping that the fact that non-equivalent automata will tend to compare quickly (detecting a difference in the first few steps, like traditional string comparisons) will make this a practical approach.
Also, if I reach a point where I suspect equivalence, I could use a standard equivalence algorithm to check. If that check fails, I just continue comparing for the ordering where I left off, without needing to check for the tail language recurring - I know that I will find a difference in a finite number of steps.
If all you can do is == or !=, then I think you have to check every set member before adding another one. This is slow. (Edit: I guess you already know this, given the title of your question, even though you go on about comparison functions to directly compare two finite automata.)
I tried to do that with phylogenetic trees, and it quickly runs into performance problems. If you want to build large sets without duplicates, you need a way to transform to a canonical form. Then you can check a hash, or insert into a binary tree with the string representation as a key.
Another researcher who did come up with a way to transform a tree to a canonical rep used Patricia trees to store unique trees for duplicate-checking.