Monarch Documentation

原文出处 Monarch Documentation

Monarch: create declarative syntax highlighters using JSON

This document describes how to create a syntax highlighter using the Monarch library. This library allows you to specify an efficient syntax highlighter, using a declarative lexical specification (written as a JSON value). The specification is expressive enough to specify sophisticated highlighters with complex state transitions, dynamic brace matching, auto completion, other language embeddings, etc. as shown in the 'advanced' topic sections of this document. On a first read, it is safe to skip any section or paragraph marked as (Advanced) since many of the advanced features are rarely used in most language definitions. – Daan Leijen.

Creating a language definition

A language definition is basically just a JSON value describing various properties of your language. Recognized attributes are:


(optional=false, boolean) Is the language case insensitive?. The regular expressions in the tokenizer use this to do case (in)sensitive matching, as well as tests in the casesconstruct.


(optional="source", string) The default token returned if nothing matches in the tokenizer. It can be convenient to set this to "invalid" during development of your colorizer to easily spot what is not matched yet.


(optional, array of bracket definitions) This is used by the tokenizer to easily define matching braces. See @brackets and bracket for more information. Each bracket definition is an array of 3 elements, or object, describing the open brace, the close brace, and the token class. The default definition is:

[ ['{','}','delimiter.curly'],
  ['','delimiter.angle'] ]


(required, object with states) This defines the tokenization rules – see the next section for a detailed description.

There are more attributes that can be specified which are described in the advanced attributes section later in this document.

Creating a tokenizer

The tokenizer attribute describes how lexical analysis takes place, and how the input is divided into tokens. Each token is given a CSS class name which is used to render each token in the editor. Standard CSS token classes include:

identifier         entity           constructor
operators          tag              namespace
keyword            info-token       type
string             warn-token       predefined
string.escape      error-token      invalid
comment            debug-token
comment.doc        regexp
constant           attribute

delimiter .[curly,square,parenthesis,angle,array,bracket]
number    .[hex,octal,binary,float]
variable  .[name,value]
meta      .[content]

Note: The token classes in the third column are currently only highlighted correctly if you include the monarch.css style file (Aug 2012).


A tokenizer consists of an object that defines states. The initial state of the tokenizer is the first state defined in the tokenizer. When a tokenizer is in a certain state, only the rules in that state will be applied. All rules are matched in order and when the first one matches its action is used to determine the token class. No further rules are tried. Therefore, it can be important to order the rules in a way that is most efficient, i.e. whitespace and identifiers first.

(Advanced) A state is interpreted as dot (.) separated sub-states. When looking up the rules for a state, the tokenizer first tries the entire state name, and then looks at its parent until it finds a definition. For example, in our example, the states "comment.block" and "" would both be handled by the comment rules. Hierarchical state names can be used to maintain complex lexer states, as shown for example in the section on complex embeddings.


Each state is defined as an array of rules which are used to match the input. Rules can have the following form:

[regex, action]

Shorthand for { regex: _regex_, action: _action_ }

[regex, action, next]

Shorthand for { regex: _regex_, action: _action_{next: _next_} }

{regex: regex, action: action }

When _regex_ matches against the current input, then _action_ is applied to set the token class. The regular expression _regex_ can be either a regular expression (using /_regex_/), or a string representing a regular expression. If it starts with a ^ character, the expression only matches at the start of a source line. The `

{ include: state }

Used for nice organization of your rules and expands to all the rules defined in _state_. This is pre-expanded and has no influence on performance. Many samples include the '@whitespace' state for example.


An action determines the resulting token class. An action can have the following forms:


Shorthand for { token: _string_ }.


An array of N actions. This is only allowed when the regular expression consists of exactly N groups (ie. parenthesized parts). Due to the way the tokenizer works, you must define the groups in such a way that all groups appear at top-level and encompass the entire input, for example, we could define characters with an ascii code escape sequence as:

/(')(\\(?:[abnfrt]|[xX][0-9]{2}))(')/, ['string','string.escape','string']]

Note how we used a non-capturing group using (?: ) in the inner group

{ token: tokenclass }

An object that defines the token class used with CSS rendering. Common token classes are for example 'keyword', 'comment' or 'identifier'. You can use a dot to use hierarchical CSS names, like 'type.identifier' or 'string.escape'. You can also include `




Signifies that brackets were tokenized. The token class for CSS is determined by the token class defined in the brackets attribute (together with _tokenclass_ if present). Moreover, bracket attribute is set such that the editor is matches the braces (and does auto indentation). For example:

[/[{}()\[\]]/, '@brackets']


(Advanced) Backs up the input and re-invokes the tokenizer. This of course only works when a state change happens too (or we go into an infinite recursion), so this is usually used in combination with the next attribute. This can be used for example when you are in a certain tokenizer state and want to get out when seeing certain end markers but don't want to consume them while being in that state. See also nextEmbedded.

An action object can contain more fields that influence the state of a lexer. The following attributes are recognized:

next: state

(string) If defined it pushes the current state onto the tokenizer stack and makes _state_ the current state. This can be used for example to start tokenizing a block comment:

['/\\*', 'comment', '@comment' ]

Note that this is a shorthand for

{ regex: '/\\*', action: { token: 'comment', next: '@comment' } }

Here the matched /* is given the "comment" token class, and the tokenizer proceeds with matching the input using the rules in state @comment.

There are a few special states that can be used for the next attribute:


Pops the tokenizer stack to return to the previous state. This is used for example to return from block comment tokenizing after seeing the end marker:

['\\*/', 'comment', '@pop']


Pushes the current state and continues in the current state. Nice for doing nested block comments when seeing a comment begin marker, i.e. in the @comment state, we can do:

['/\\*', 'comment', '@push']


Pops everything from tokenizer stack and returns to the top state. This can be used during recovery to 'jump' back to the initial state from a deep nesting level.

switchTo: state

(Advanced) Switch to _state_ without changing the stack.

goBack: number

(Advanced) Back up the input by _number_ characters.

bracket: kind

(Advanced) The _kind_ can be either '@open' or '@close'. This signifies that a token is either an open or close brace. This attribute is set automatically if the token class is @brackets. The editor uses the bracket information to show matching braces (where an open bracket matches with a close bracket if their token classes are the same). Moreover, when a user opens a new line the editor will do auto indentation on open braces. Normally, this attribute does not need to be set if you are using the brackets attribute and it is only used for complex brace matching. This is discussed further in the next section on advanced brace matching.

nextEmbedded: langId or '@pop'

(Advanced) Signifies to the editor that this token is followed by code in another language specified by the _langId_, i.e. for example javascript. Internally, our syntax highlighter keeps tokenizing the source until it finds an an ending sequence. At that point, you can use nextEmbedded with a '@pop' value to pop out of the embedded mode again. Usually, we need to use a next attribute too to switch to a state where we can tokenize the foreign code. As an example, here is how we could support CSS fragments in our language:

root: [
  [//,   { token: 'keyword', bracket: '@open'
                   , next: '@css_block', nextEmbedded: 'text/css' }],

  [//, { token: 'keyword', bracket: '@close' }],

css_block: [
  [/[^"/, { token: '@rematch', next: '@pop', nextEmbedded: '@pop' }],
  [/"/, 'string', '@string' ],
  [/` inside a string. When we find the closing tag, we also [`"@pop"`]( the state to get back to normal tokenization. Finally, we need to [`"@rematch"`]( the token (in the `root` state) since the editor ignores our token classes until we actually exit the embedded mode. See also a later section on [complex dynamic embeddings]( (Bug: you can only start an embedded section if the you consume characters at the start of the embedded block (like consuming the `` tag) (Aug 2012))

log: _message_

Used for debugging. Logs `_message_` to the console window in the browser (press F12 to see it). This can be useful to see if a certain action is executing. For example:

[/\d+/, { token: 'number', log: 'found number $0 in state $S0' } ]

{ cases: { _guard1_: _action1_, ..., _guardN_: _actionN_ } }

The final kind of action object is a cases statement. A cases object contains an object where each field functions as a guard. Each guard is applied to the matched input and as soon as one of them matches, the corresponding action is applied. Note that since these are actions themselves, cases can be nested. Cases are used for efficiency: for example, we match for identifiers and then test whether the identifier is possibly a keyword or builtin function:

[/[a-z_\$][a-zA-Z0-9_\$]*/, { cases: { '@typeKeywords': 'keyword.type' , '@keywords': 'keyword' , '@default': 'identifier' } } ]

The guards can consist of:


The attribute `_keywords_` must be defined in the language object and consist of an array of strings. The guard succeeds if the matched input matches any of the strings. (Note: all cases are pre-compiled and the list is tested using efficient hash maps). Advanced: if the attribute refers to a single string (instead of an array) it is compiled to a regular expression which is tested against the matched input.


(or `"@"` or `""`) The default guard that always succeeds.


Succeeds if the matched input has reached the end of the line.


If the guard does not start with a `@` (or `

(Advanced) In general, a guard has the form `[_pat_][_op_]_match_`, with an optional pattern, and operator (which are `$#` and `~` by default). The pattern can be any of:


(default) The matched input (or the group that matched when the action is an array).


The _n_th group of the matched input, or the entire matched input for `$0`.


The _n_th part of the state, i.e. `$S2` returns `foo` in a state ``. Use `$S0` for the full state name.

The above patterns can actually occur in many attributes and are automatically expanded. Attributes where these patterns expand are [`token`](, [`next`](, [`nextEmbedded`](, [`switchTo`](, and [`log`]( Also, these patterns are expanded in the `_match_` part of a guard.

The guard operator `_op_` and `_match_` can be any of:

~_regex_ or !~_regex_

(default for `_op_` is `~`) Tests `_pat_` against the regular expression or its negation.

@_attribute_ or !@_attribute_

Tests whether `_pat_` is an element (`@`), or not an element (`!@`), of an array of strings defined by `_attribute_`.

==_str_ or !=_str_

Tests if `_pat_` is equal or unequal to the given string `_str_`.

For example, here is how to check if the second group is not equal to `foo` or `bar`: `$2!~foo|bar`, or if the first captured group equals the name of the current lexer state: `$1==$S0`.

If both `_op_` and `_match_` are empty and there is just a pattern, then the guard succeeds if the pattern is non-empty. This can be used for example to improve efficiency. In the Koka language, an upper case identifier followed by a dot is module name, but without the following dot it is a constructor. This can be matched for in one go using:

[/(A-Z*)(.?)/, { cases: { '$2' : ['identifier.namespace',''] , '@default': 'identifier.constructor' }} ]

## Advanced: complex brace matching

This section gives some advanced examples of brace matching using the [`bracket`]( attribute in an action. Usually, we can match braces just using the [`brackets`]( attribute in combination with the [`@brackets`]( token class. But sometimes we need more fine grained control. For example, in Ruby many declarations like `class` or `def` are ended with the `end` keyword. To make them match, we all give them the same token class (`keyword.decl`) and use bracket `@close` for `end` and `@open` for all declarations:

declarations: ['class','def','module', ... ]

tokenizer: { root: { [/[a-zA-Z]\w*/, { cases: { 'end' : { token: 'keyword.decl', bracket: '@close' } , '@declarations': { token: 'keyword.decl', bracket: '@open' } , '@keywords' : 'keyword' , '@default' : 'identifier' } } ],

Note that to make _outdentation_ work on the `end` keyword, you would also need to include the `'d'` character in the [`outdentTriggers`]( string.

Another example of complex matching is HTML where we would like to match starting tags, like `` with an ending tag ``. To make an end tag only match its specific open tag, we need to dynamically generate token classes that make the braces match correctly. This can be done using `$` expansion in the token class:

tokenizer: { root: { [/?)/, { token: 'tag-$1', bracket: '@open' }], [//, { token: 'tag-$1', bracket: '@close' }],

Note how we captured the actual tag name as a group and used that to generate the right token class. Again, to make outdentation work on the closing tag, you would also need to include the `'>'` character in the [`outdentTriggers`]( string.

A final advanced example of brace matching is Visual Basic where declarations like `Structure` are matched with end declarations as `End Structure`. Just like HTML we need to dynamically set token classes so that an `End Enum` does not match with a `Structure`. A tricky part is that we now need to match multiple tokens at once, and we match a construct like `End Enum` as one closing token, but non declaration endings, like `End` `Foo`, as three tokens:

decls: ["Structure","Class","Enum","Function",...],

tokenizer: { root: { [/(End)(\s+)([A-Z]\w)/, { cases: { '$3@decls': { token: 'keyword.decl-$3', bracket: '@close'}, '@default': ['keyword','white','identifier.invalid'] }}], [/[A-Z]\w/, { cases: { '@decls' : { token: 'keyword.decl-$0', bracket: '@open' }, '@default': 'constructor' } }],

Note how we used `$3` to first test if the third group is a declaration, and then use `$3` in the `token` attribute to generate a declaration specific token class (so we match correctly). Also, to make outdentation work correctly, we would need to include all the ending characters of the declarations in the [`outdentTriggers`]( string.

## Advanced: more attributes on the language definition

Here are more advanced attributes that can be defined in the language definition:


(optional=`"." + name`, string) Optional postfix attached to all returned tokens. By default this attaches the language name so in the CSS you can refer to your specific language. For example, for the Java language, we could use `` to target all Java identifiers specifically in CSS.


(optional, string) The start state of the tokenizer. By default, this is the first entry in the tokenizer attribute.


(optional, string) Optional string that defines characters that when typed could cause _outdentation_. This attribute is only used when using advanced brace matching in combination with the [`bracket`]( attribute. By default it always includes the last characters of the closing brackets in the [`brackets`]( list. Outdentation happens when the user types a closing bracket word on an line that starts with only white space. If the closing bracket matches a open bracket it is indented to the same amount of that bracket. Usually, this causes the bracket to outdent. For example, in the Ruby language, the `end` keyword would match with an open declaration like `def` or `class`. To make outdentation happen though, we would need to include the `d` character in the [`outdentTriggers`]( attribute so it is checked when the users type `end`:

outdentTriggers: 'd',

## Über Advanced: complex embeddings with dynamic end tags

Many times, embedding other language fragments is easy as shown in the earlier CSS example, but sometimes it is more dynamic. For example, in HTML we would like to start embeddings on a `script` tag and `style` tag. By default, the script language is `javascript` but if the `type` attribute is set, that defines the script language mime type. First, we define general tag open and close rules:

[//, { token: 'tag.tag-$1', bracket: '@close' } ],

Here, we use the `$1` to capture the open tag name in both the token class and the next state. By putting the tag name in the token class, the brace matching will match and auto indent corresponding tags automatically. Next we define the `@tag` state that matches content within an HTML tag. Because the open tag rule will set the next state to `@tag._tagname_`, this will match the `@tag` state due to dot seperation.

tag: [ [/[ \t\r\n]+/, 'white'], [/(type)(\s=\s)(['"])([^'"]+)(['"])/, [ 'attribute', 'delimiter', 'string', // todo: should match up quotes properly {token: 'string', switchTo: '@tag.$S2.$4' }, 'string'] ], [/(\w+)(\s=\s)(['"][^'"]+['"])/, ['keyword', 'delimiter', 'string' ]], [/>/, { cases: { '$S2==style' : { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: 'text/css'} , '$S2==script': { cases: { '$S3' : { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: '$S3' }, '@default': { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: 'javascript' } } , '@default' : { token: 'delimiter', next: '@pop' } } }] [/[^>]/,''] // catch all ],

Inside the `@tag._tagname_` state, we access the `_tagname_` through `$S2`. This is used to test if the tag name matches a script of style tag, in which case we start an embedded mode. We also need [`switchTo`]( here since we do not want to get back to the `@tag` state at that point. Also, on a `type` attribute we extend the state to `@tag._tagname_._mimetype_` which allows us to access the mime type as `$S3` if it was set. This is used to determine the script language (or default to `javascript`). Finally, the script and style start an embedded mode and switch to a state `@embedded._tagname_`. The tag name is included in the state so we can scan for exactly a matching end tag:

embedded: [ [/[^"/, { cases: { '$1==$S2' : { token: '@rematch', next: '@pop', nextEmbedded: '@pop' }, '@default': '' } }], [/"/, 'string', '@string' ], [/</, ''] ], `

Only when we find a matching end tag (outside a string), $1==$S2, we pop the state and exit the embedded mode. Note that we need @rematch since the editor is ignoring our token classes until we actually exit the embedded mode (and we handle the close tag again in the @root state).