Present & Past Pattern Matching


Why pattern matching?

In the beginning, computers were solely used for numerical calculations. When the hardware became more powerful in the 1950’s, scientists and scholars devised ways to also mimic higher human faculties in software, such as manipulation of symbolic mathematical formula and translation from one human language to another one. Such efforts could only succeed if a researcher was able to translate textbook algorithms into software in a comfortable and convincing way.

Since textbooks often use notational systems to concisely and precisely describe the algorithmic steps needed to transform a symbolic expression to another symbolic expression, it was desirable to use comparable notational systems in the instructions to the computer, because then it would be relatively easy to check that a computer program followed the steps as described in the textbook.



We will present here a specially devised language in which
the linguist can conveniently “tell” the computer to do
things that he wants it to do.

(Victor H. Yngve in A Programming Language for Mechanical Translation, 1958)

Textbook formulas involve some degree of abstraction. For example, if a textbook tells us that (a+b)^2 can be transformed to a^2+2*a*b+b^2, the human reader, and mutatis mutandis the computer, must understand that a and b are not to be taken literally, but that +, ^ and 2 are. Another way of saying is that (a+b)^2 is a pattern that matches e.g., (x+y)^2 and (1+e)^2, but not (x*y)^3.
A textbook explanation of how the past of English regular verbs is formed could be: [VERB+ed]. The only element in this expression that must be taken literally is ed, whereas VERB is a place-holder for a any regular verb and the remaining symbols have syntactic roles that are agreed on in the used notational system.

Instead of toggling hundreds of switches to insert machine instructions into a computer, we use a programming language in which we can express many problems and their solutions in a comprehensible and succinct way. However, there are some fields where run-of-the-mill programming languages lack the necessary expressive power. In such areas pattern expressions and an ability to match those with data could give a boost in control over the data that can be compared with the boost programming languages already gave. Yet, the idea of pattern matching doesn’t resonate well within the general software programming community. Despite almost sixty years of experience with advanced pattern matching implementations, few flavours of pattern matching did find wider application (regular expressions, query languages), and sometimes once widely known and praised ideas wither away together with the software in which those ideas were realized. Why is that so? Why are pattern matching expressions not a standard ingredient in every programming language in the same way as ‘if … then … else …’ expressions and looping constructs? Why don’t we even have a way to express pattern matching in the inter-lingua of software people, pseudo-code?

What is pattern matching supposed to do?

Programmers have become used to think of the prime carrier of information, text, as a flat sequence of symbols. Just look at a text in a language that you don’t master. The fact that computer memory is organized in a one dimensional address space may have contributed to the prevailing tendency to regard the character “string” as the default data structure for data that has not yet been analysed. So for the first analysis step, a facility to apply a pattern expression to “string” data and find matches is very useful.

When we come to analysis of data that already went through the first analysis step, the picture has changed. Now we no longer look at a string of symbols, but at a more complex structure, such as a tree structure. It is in this stage that we can begin to speak of a representation of an expression in a notational system. It is in this stage that we would like to apply patterns and see whether an algebraic expression can be simplified and check that there is subject-verb agreement in a sentence.


Pattern matching in trees is fundamental in a number of areas, including term rewriting systems functional programming languages, equational logic programming languages, code generation, structured text database systems, transformational programming systems, theorem provers, and computational biology. (Christophe Roudet, 1996)

In this context a pattern is a description of something you want to know about your data. What you want to know can be very complex, and if you want to know different things in different parts of your program, the complexities of the involved patterns may be very different. If a computer language supports pattern matching, you don’t tell the computer how to plan its actions, but leave it to the computer to find out whether the descriptions you gave it match anything. When the computer comes back with results, it has one or two things to tell you. First and foremost, it tells you whether it found what you asked for. Did anything match your description of what you wanted to know? Normally, the answer is “yes” or “no”, or “success” or “failure”, or “true” or “false”, depending on the flavour of pattern matching you are employing. In some cases this is all you wanted to know. Optionally, you may receive excerpts from the subject that match your description. Some patterns are so detailed that a yes/no answer is all you need. Other patterns have place-holders for things that you only can describe vaguely, and that you would like to have a closer look at, if found.

How can a programming language offer pattern matching capability?

Pattern matching aims at hiding algorithmic detail. The standard way to hide algorithmic detail is to divide a program’s functionality in modules. Such modules communicate by means of well-documented programming interfaces, normally a list of functions and procedures with well described, fixed parameter signatures. Indeed, some pattern matching implementations rely on importing a library and calling a library function called “match” or something similar that takes a subject (or target) and a pattern as arguments. Such solutions have to solve the problem of handing a pattern over to a function. As we saw before, patterns can have varying degrees of complexity, so one way to hand over a pattern to a function with a fixed number of parameters is to first serialize the pattern so it fits in one parameter. But then somewhere the serialized pattern has to be parsed to de-serialise it. This means that parsing the source code of your program happens in two stages, not less or more. That again means that mutual recursion is impossible to realize. Mutual recursion means that a pattern might include some “normal” program code expressing some algorithm, from which, in turn, a pattern matching operation might be launched, and so on.

We see that library implementations of pattern matching have the problem that the program has to handle two languages, not one, and that they are not both “first class citizens” in your program, but that one of the languages is embedded in the other language. For that reason, most implementations of pattern matching don’t rely on a library specialised in pattern matching, but rather define pattern matching as an integral part of the programming language. This at once puts pattern matching in a difficult position. Since the designers of popular programming languages didn’t define pattern matching (in the sense described before) as part of the language, pattern matching is stuffed away in specialised niche languages and in pet languages that tend to be regarded as “obscure”.

Underlying concepts

In the next section you’ll see a table comparing a number of pattern matching implementations on a number of criteria. These criteria are the following:

String pattern matching
— Do you see my shirt on the clothesline? —
Can (or must) the subject of the match operation be a single one-dimensional array of characters, for example a piece of text?
Tree pattern matching
— What colour has the shirt of the guy two seats to the left of Mary? —
Can (or must) the subject of the match operation be a tree structure, for example an algebraic expression or an XML document?
Find
— Is this 1) a finch, 2) a dove or 3) another magpie? —
Does the pattern matcher compare the subject or part thereof with a list of (sub-)patterns, one after one, until a matching (sub-)pattern is found? The pattern matcher may be able to keep record of the partial patterns tried thus far. When returning to a list of sub-patterns due to backtracking from a later failure, such a pattern matcher compares untried patterns until a pattern is found that matches or the list is exhausted, in which case the pattern matcher may backtrack to even earlier phases of the match operation.
Search
— Where are the eggs? —
Does the pattern matcher attempt to arrange the pattern and the subject relative to each other in such a way that the pattern matches the subject? Often this can be achieved by scanning the subject until a segment of the subject is found that matches the pattern. If the pattern matcher is able to backtrack, it will keep record of the parts of the subject that have been visited in attempts to match a segment of the pattern, and backtrack if not all segments of the pattern could be matched with a partner in the subject. In that case, the pattern matcher will go back to earlier successfully matched segments of the pattern and try to match those segments with not yet visited segments of the subject.

In the tradition of functional languages, a list is a recursive type. A one-hole context in such a type is a data type “with a hole in it” and is what you need if you want to single out an element for updating anywhere in the list. In a list with n elements a hole can be punched in n positions. Most functional languages limit support for one-hole contexts to the head of the list, thus limiting the use of a pattern to single out an element somewhere in a list. “[…] languages like OCaml, Haskell or Scala don’t give programmers variables that range over arbitrary (one-holed) contexts, only variables that range over values. In other words, in such languages you cannot pattern match at an arbitrary position in a term.” (What’s the difference between term rewriting and pattern matching?)

Non-linear pattern
— Did you see those women wearing the same cocktail dress? —
Can the pattern matcher update the pattern during a pattern match operation due to something found in the subject and thereby solve a non-linear constraint problem? If so, the pattern matcher is said to be able to handle non-linear patterns. An example is pairing arbitrary open and close tags in an XML document. (A close tag can only be paired with an open tag with the same name.)
Unfree data (Commutative pattern matching)
— Let’s see, is this a winning poker hand? —
Can the pattern matcher apply a pattern to a subject that has no specified internal order? Unfree data types have more than one representation for an abstract value. Pattern matching unfree data can be done in a few ways. One way is to first bring the data under a canonical form. Another way is to postpone the reorganization of the data until dictated by a pattern. See Active Patterns by Martin Erwig, 1996
Segment assignment (Associative pattern matching)
— The dishes between the salmon and the stuffed goose are for our guest —
Can the pattern matcher assign not only a single item, but also a series of consecutive items as a single segment to a variable? An example in string pattern matching is the assignment of the sub-string spanned by two markers to a variable. Segment assignment can be of practical use in rewriting systems, for example normalization of algebraic expressions.
Nesting
— Are there any plants in my guarden that aren’t described in the flora? —
Can pattern matching be conditional on the success of an embedded pattern matching operation that can refer to the results found so far in the outer pattern matching operation?
Free input format
— “Can you read this?” {‘And’, ‘this’, ‘?’} —
Can any data be subject to pattern matching, with no a priori syntactic restrictions? Here we don’t consider restrictions on string data (no null bytes, no invalid Unicode code-points) as true syntactic restrictions.
Backtracking
— The word before before is is not the first, but the second before
Does, if the currently evaluated sub-pattern fails to match the subject, the match evaluator re-evaluate an earlier successfully matched sub-pattern with an alternative part of the subject, repeating this until the pattern succeeds or all alternatives have been exhausted?
Rewrite
Some pattern matching systems can collect all the needed parts of a subject from which a new data structure can be composed according to rewrite rules. An example is converting an XML document to HTML. A method to rewrite a subject without the need to copy the parts of the subject that remain unchanged (e.g., by means of segment assignment) is by in-place replacement, overwriting (part of) the subject.

Comparison of pattern matching implementations

task→

language↓

String p.m. Tree p.m. Find
(call-by-pattern, dispatch on head)
Search
(Scan data)
Non-linear pattern, back-references, context sensitive patterns Unfree data (Commutative pattern matching) Segment assignment (Associative pattern matching) Nesting free input format backtrack rewrite
switch no no yes no no no no no no no no
Regular Expressions (concept:56, implemented:60s) yes no yes yes no no yes no yes yes yes
Regular Expressions, Extended (Perl) yes no yes yes yes no yes yes yes yes yes
COMIT (57) no no yes yes yes no yes no no yes yes
Schoonschip (64) no yes yes yes yes no? no yes no no? yes
Snobol (60s) yes no yes yes yes no yes yes yes yes yes
ML (70s), NPL (77), Hope (>77), Miranda (85), Haskell (90) yes? yes yes no no no no yes yes yes yes
Bracmat (88) yes yes yes yes yes no yes yes yes yes yes
TregEx+Tsurgeon no yes yes yes no no no no no ? yes
Egison ? yes yes yes yes yes ? ? ? yes ?
Rascal yes no yes yes yes no ? ? yes yes yes
Tom (2001) yes? yes yes yes yes no yes yes ? yes yes
TrafoLa-H (93) yes yes yes yes yes no no? no? ? ? yes
TREX, RELAX NG (01) no yes yes yes yes? yes no no no yes no
XPath no yes yes yes yes no no no no yes no
XSLT no yes yes yes yes no no no no yes
XQuery no yes yes yes yes no no no
LINQ yes no no yes no no no yes yes no no
Pure yes
Reduce
OmniMark yes yes no yes no yes

More information about the implementations

COMIT

COMIT is the oldest string processing and pattern matching programming language. It is slightly younger than FORTRAN and slightly older than LISP. All three languages were first implemented on IBM 704 systems. The language did not survive for as long as the other two languages, but there is a fine paper that explains the language in detail: “A programming language for mechanical translation”. Many ideas developed in COMIT were copied in SNOBOL. COMIT’s input (entered on punch cards, of course) was structured as a list of tokens, each token optionally annotated with one or more features. Tokens could be exploded in characters, so morphological analysis could be done on single words. But the author had more in mind than text analysis, because “This notational system is convenient and well adapted to a large class of problems including language translation and formal algebraic manipulation.”

Bracmat

GitHub

Documentation

Bracmat online (C-to-JavasScript translation by Emscripten )

Egison

Developed by Satoshi Egi (江木 聡志)

TrafoLa-H


[…] the user needs only specify the shape of the subtree he/she wants to select, not where or how such a subtree is to be found […] (TrafoLa-H reference manual, 1993)

Reinhold Heckmann, Georg Sander, Universität des Saarlandes 1993 TrafoLa-H Reference Manual, chapter 8.

In PROSPECTRA a formal specification is gradually transformed into an optimized executable program by the stepwise application of individual transformation rules, or of transformation scripts systematically invoking such rules. The language TrafoLa-H for writing transformation scripts and rules is designed in a functional style since transformations are functions in some tree domain, and scripts may use higher order functions constructed from transformations and function composition.

The design of the transformation language TrafoLa-H has been based on the functional languages Hope [Burstall et al. 80], SML [Milner 85], and Miranda [Turner 85], which are quite similar with respect to their functional kernel. They all admit only very restricted forms of pattern matching. Patterns may be used to match some fixed region near the root of a term and to select subtrees adjacent to this region. Thus it is neither possible to specify the region for a match, nor to refer to a subtree whose root is far from the root of the whole tree, nor to bind the context of such a subtree to a variable.

In these languages, sequences are represented as trees, and thus their treatment is always biased toward their leftmost item. Patterns only allow to select a fixed number of items at the left end of the sequence. It is impossible to directly access items from the middle or the rear of a sequence. This restricted form of pattern matching seemed to be inappropriate for a language designed for specifying transformations at arbitrary subtrees.

Thus, TrafoLa-H has been developed by increasing the power of patterns compared to Hope, SML and Miranda. Non-determinism is allowed to ease the description – the user needs only specify the shape of the subtree he/she wants to select, not where or how such a subtree is to be found (allthough this is also possible). The resulting set of solutions is handled either by selecting one and discarding all others, or by considering all solutions in an arbitrary order.

[…]

The type discipline of TrafoLa-H is based on the Hindley/Milner theory of type polymorphism ([Hindley 68], [Milner 78]) and is extended with respect to the complex operations and patterns provided by TrafoLa-H. Type polymorphism allows to define functions that work uniformly for arguments of different types. Instead of a fully determined (monomorphic) type, we use a “type pattern” with type variables that describes the structure of the values permitted as arguments. The insertion/extraction operations of TrafoLa-H require that some monomorphic types occur in several versions, i,e., are treated similarly to type variables.

LINQ

[…] general-purpose query facilities […] that apply to all sources of information, not just relational or XML data

XSLT

XSLT is a two-language system. Pattern matching is done by XPath (for XML input) or regular expressions (for text input). XPath expressions can be nested inside XSLT instructions, but not vice versa. You can get around this in XSLT 2.0 using function calls.(Comparing XSLT and XQuery)

XQuery

XQuery is fully composable: any expression can be nested inside any other. (Comparing XSLT and XQuery)