next up previous contents index
Next: Finite State Transducers Up: Basic Definitions Previous: Basic Definitions

Finite State Automata


A finite-state automaton is a device that can be in one of a finite number of states . In certain conditions, it can switch to another state. This is called a transition . When the automaton starts working (when it is switched on), it can be in one of its initial states. There is also another important subset of states of the automaton: the final states. If the automaton is in a final state when it stops working, it is said to accept its input. The input is a sequence of symbols. The interpretation of the symbols depends on the application; they usually represent events, or can be interpreted as ``the event that a particular data became available''. The symbols must come from a finite set of symbols, called the alphabet . If a particular symbol in a particular state triggers a transition from that state to another one, that transition is labeled with that symbol. The labels of transitions can contain one particular symbol that is not in the alphabet. A transition is labeled with tex2html_wrap_inline4652 (not present in the alphabet) if it can be traversed with no input symbol.

It is convenient to present automata as directed graphs. The vertices denote states. They are portrayed as small circles. The transitions form the edges - arcs with arrows pointing from the source state (the state where the transition originates) to the target state. They are labeled with symbols. Unless it is clear from the context, the initial states have short arrows that point to them from ``nowhere''. The final states are represented as two concentric circles.

Figure 2.1: Sample finite-state automaton

The example automaton (fig. 2.1) contains 6 states labeled A, B, C, D, E and F. The state labeled A is initial. The states A, C, and D are final. The automaton contains no cycles, as it is impossible to leave a state, follow some transitions, and return in this way to the same original state.

Starting in one of the initial states, the automaton processes a sequence of input symbols. In a given state, it checks if the next input symbol matches any of the labels of the transitions that go out of the state. If so, the automaton follows it to the target state of the transition. If tex2html_wrap_inline4652 is among the labels, the transition it labels is followed without reading the input. There may be more than one transition with the same label, or a transition with a matching label and a transition labeled with tex2html_wrap_inline4652. In that case, the decision which path to choose is arbitrary. If there is no transition that matches the input symbol (in particular, if there is no tex2html_wrap_inline4652-transition), the automaton stops and rejects the input. The input is accepted when all input is read and matched by transitions and the automaton is in a final state.

A set of sequences of symbols gathered by pursuing all paths from initial states to the final ones is called the language   accepted by the automaton. E.g. the example automaton (fig. 2.1) accepts the language tex2html_wrap_inline4666. The symbol tex2html_wrap_inline4652, meaning an empty input, is accepted because the initial state A is also the final one. a is accepted, because there is a transition from A to C labeled with a, and C is a final state. The input bc is accepted, because the control in the automaton can go from A to F on a, and then it can follow the tex2html_wrap_inline4652-labeled transition to B, and then a transition labeled c to D, which is a final state. The sequences: {ad, ae, af, ba} are all rejected, because E is not a final state. The language of the automaton in fig. 2.1 is finite, because the automaton contains no cycles.

An automaton that has no tex2html_wrap_inline4652-labeled transitions is called tex2html_wrap_inline4652-free. An automaton that has one initial state, is tex2html_wrap_inline4652-free, and for each state there is at most one transition labeled with the same symbol is called a deterministic finite-state automaton  . The automaton from fig. 2.1 can be transformed into the deterministic automaton from fig. 2.2.

Figure 2.2: Deterministic version of the sample automaton

A state is said to be reachable    from another state iff there is a sequence of transitions leading to the state from that other state. An automaton is said to be start-useful    iff every state is reachable from any of its initial states. An automaton is useful    iff it is start-useful and for every state there is at least one final state reachable from that state. The automaton from the fig. 2.2 is not useful; its useful version is presented in fig. 2.3. That version accepts the same language.

Figure 2.3: Useful version of the sample deterministic automaton

An automaton is said to be acyclic    when it contains no cycles, i.e. it is not possible to arrive at the same state twice when following the transitions. All automata in figures presented in this dissertation are acyclic.

Formally, an automaton  is a 5-tuple tex2html_wrap_inline4678, where Q is a set of states, tex2html_wrap_inline4680 is a set of initial (or starting) states, tex2html_wrap_inline4682 is a set of final states, tex2html_wrap_inline4684 is the alphabet , and tex2html_wrap_inline4686 is a partial mapping tex2html_wrap_inline4688 denoting transitions. We extend the tex2html_wrap_inline4686 mapping to tex2html_wrap_inline4692. The size of the automaton, |M|, is equal to the number of states |Q|. We define the language   accepted by the automaton M to be:


Let the mapping tex2html_wrap_inline4700 be the right language   of a state q in M (the set of all strings over tex2html_wrap_inline4706 that will take one from state q to any final state of M using the extended transition relation tex2html_wrap_inline4710):


The consequence of the Myhill-Nerode theorem ([HU79]) is that among many automata that accept a given language there is one (up to isomorphism) that has a minimal number of states. It is called a minimal automaton  . The automaton in fig. 2.3 is minimal.

The property of minimality of is defined as follows ([Wat93a], [Wat93b], [Wat95]):


We will use, however, an alternative definition of minimality   of tex2html_wrap_inline4712 ([Wat93a], [Wat93b]):


tex2html_wrap_inline4714   is a property specifying that all states can be reached from the start state, and it is defined as follows:


Of particular interest for the NLP community are deterministic acyclic finite state automata (DAFSA)   , because they are best suited for representing dictionaries (of various sorts). Whenever we refer to an automaton (or FSA) throughout this paper, we mean an acyclic, deterministic finite state automaton (DAFSA), unless explicitly stated otherwise. Several definitions given above can be rewritten in a simpler way: a deterministic, finite-state automaton is a 5-tuple tex2html_wrap_inline4716, where Q is a set of states, tex2html_wrap_inline4718 is the starting state, tex2html_wrap_inline4682 is a set of final states, tex2html_wrap_inline4684 is the alphabet , and tex2html_wrap_inline4686 is a partial mapping tex2html_wrap_inline4726 denoting transitions. We extend the tex2html_wrap_inline4686 mapping to tex2html_wrap_inline4730 as in [HU79]. The definitions of the language accepted by an automaton, the right language of a state, and of the tex2html_wrap_inline4732 property for DAFSA become simpler:




It is possible to use alternative definition of an FSA:  tex2html_wrap_inline4734, where the meaning of Q, tex2html_wrap_inline4684, i, and F is the same as above, and E is a relation tex2html_wrap_inline4746 - an alternative expression of the tex2html_wrap_inline4686 function from above. This alternative definition may be more useful in descriptions below.

The algorithms described in this paper operate on classical DAFSA (deterministic acyclic finite-state automata). However, our implementation uses a different kind of automata: finite-state automata with final transitions (FSA-FT)   . We define such automaton as tex2html_wrap_inline4734, where Q is the finite set of states, tex2html_wrap_inline4684 is the alphabet , tex2html_wrap_inline4756 is the set of the initial states of the automaton, tex2html_wrap_inline4758 is a set of transitions, and tex2html_wrap_inline4760 is the set of final transitions. We say that the FSA-FT accepts a word if for each symbol from the input a transition was found, and the last transition it traversed while recognizing the word was final.

We define a transformation of a classical finite-state automaton into a finite-state automaton with final transitions: for every FSA tex2html_wrap_inline4734 we find an FSA-FT tex2html_wrap_inline4764, where tex2html_wrap_inline4766, i' is the new initial state, tex2html_wrap_inline4770, and tex2html_wrap_inline4772.

Let tex2html_wrap_inline4774 to be the transitive closure of F defined by the following recursive relation:

Now we can define the language of the FSA-FT to be:


For every deterministic acyclic finite-state automaton tex2html_wrap_inline4784 that does not accept empty input (written as tex2html_wrap_inline4652), i.e. tex2html_wrap_inline4788, or tex2html_wrap_inline4790, we can find a deterministic acyclic finite-state automaton with final transitions (DAFSA-FT)    tex2html_wrap_inline4792, tex2html_wrap_inline4794, where tex2html_wrap_inline4772. We will call this transformation moving markers . It preserves the number of states and the number of transitions. Because such a transformation exists, and it is defined for the minimal automata  , we can write that tex2html_wrap_inline4798.

Minimal deterministic acyclic finite-state automata with final transitions can indeed be smaller than their classical counterparts (see fig. 2.4). The definition of minimality  for an FSA-FT is the same as for the classical FSA, but the definition of the right language   of a state in FSA-FT must change:


Figure 2.4: Classical minimal deterministic acyclic FSA and the corresponding FSA with final transitions

Note that for DAFSA-FT tex2html_wrap_inline4802. This is different from the classical automata, where tex2html_wrap_inline4804. That difference is very important. It means that the recognition of a word is moved from a final state in the classical automaton to the transition reaching that state in the FSA-FT, resulting in the removal of the empty string (tex2html_wrap_inline4652) from the right language   of that automaton. Let tex2html_wrap_inline4784 be a DAFSA (a classical one), and tex2html_wrap_inline4810 be a DAFSA-FT obtained by the transformation moving markers . We can write:


If we can find two states in the classical FSA that have the same right languages except for the empty string, we can merge the states together in a deterministic acyclic finite-state automaton with final transitions. This makes the resulting automaton smaller, i.e. it has less states and less transitions than the corresponding classical finite-state automaton.

Note that as most implementations of automata have implicit states (only transitions are represented explicitly), this new kind of automata saves space at no cost.

The automata with final transitions can be seen as Mealy's automata where the property of being final is the output label. Therefore, they can be perceived more as a notational convention than a new kind of automata. However in transducers, that property would have to be concatenated with the output labels.

next up previous contents index
Next: Finite State Transducers Up: Basic Definitions Previous: Basic Definitions

Jan Daciuk
Wed Jun 3 14:37:17 CEST 1998

Software at