Janry

Monarch Documentation

Janry · 2017-10-20推荐 · 1850阅读 CET/4 CET/6 原文链接

Monarch: create declarative syntax highlighters using JSON

This document describes how to create a syntax highlighter using the Monarch library. This library allows you to specify an efficient syntax highlighter, using a declarative lexical specification (written as a JSON value). The specification is expressive enough to specify sophisticated highlighters with complex state transitions, dynamic brace matching, auto completion, other language embeddings, etc. as shown in the 'advanced' topic sections of this document. On a first read, it is safe to skip any section or paragraph marked as (Advanced) since many of the advanced features are rarely used in most language definitions. – Daan Leijen.

Creating a language definition

A language definition is basically just a JSON value describing various properties of your language. Recognized attributes are:

ignoreCase
(optional=false, boolean) Is the language case insensitive?. The regular expressions in the tokenizer use this to do case (in)sensitive matching, as well as tests in the casesconstruct.
defaultToken
(optional="source", string) The default token returned if nothing matches in the tokenizer. It can be convenient to set this to "invalid" during development of your colorizer to easily spot what is not matched yet.
brackets
(optional, array of bracket definitions) This is used by the tokenizer to easily define matching braces. See @brackets and bracket for more information. Each bracket definition is an array of 3 elements, or object, describing the open brace, the close brace, and the token class. The default definition is: [ ['{','}','delimiter.curly'], ['[',']','delimiter.square'], ['(',')','delimiter.parenthesis'], ['<','>','delimiter.angle'] ]
tokenizer
(required, object with states) This defines the tokenization rules – see the next section for a detailed description.

There are more attributes that can be specified which are described in the advanced attributes section later in this document.

Creating a tokenizer

The tokenizer attribute describes how lexical analysis takes place, and how the input is divided into tokens. Each token is given a CSS class name which is used to render each token in the editor. Standard CSS token classes include:

identifier         entity           constructor
operators          tag              namespace
keyword            info-token       type
string             warn-token       predefined
string.escape      error-token      invalid
comment            debug-token
comment.doc        regexp
constant           attribute

delimiter .[curly,square,parenthesis,angle,array,bracket]
number    .[hex,octal,binary,float]
variable  .[name,value]
meta      .[content]

Note: The token classes in the third column are currently only highlighted correctly if you include the monarch.css style file (Aug 2012).

States

A tokenizer consists of an object that defines states. The initial state of the tokenizer is the first state defined in the tokenizer. When a tokenizer is in a certain state, only the rules in that state will be applied. All rules are matched in order and when the first one matches its action is used to determine the token class. No further rules are tried. Therefore, it can be important to order the rules in a way that is most efficient, i.e. whitespace and identifiers first.

(Advanced) A state is interpreted as dot (.) separated sub-states. When looking up the rules for a state, the tokenizer first tries the entire state name, and then looks at its parent until it finds a definition. For example, in our example, the states "comment.block" and "comment.foo" would both be handled by the comment rules. Hierarchical state names can be used to maintain complex lexer states, as shown for example in the section on complex embeddings.

Rules

Each state is defined as an array of rules which are used to match the input. Rules can have the following form:

[regex, action]
Shorthand for { regex: _regex_, action: _action_ }
[regex, action, next]
Shorthand for { regex: _regex_, action: _action_{next: _next_} }
{regex: regex, action: action }
When _regex_ matches against the current input, then _action_ is applied to set the token class. The regular expression _regex_ can be either a regular expression (using /_regex_/), or a string representing a regular expression. If it starts with a ^ character, the expression only matches at the start of a source line. The &lt;dd can be used to match against the end of a source line. Note that the tokenizer is not called when the end of the line is already reached, and the empty pattern/$/will therefore never match (but see ['@eos'](https://microsoft.github.io/monaco-editor/monarch.html#@eos) too). Inside a regular expression, you can reference a string attribute namedattras@attr, which is automatically expanded. This is used in the standard example to share the regular expression for escape sequences using'@escapes'inside the regular expression for characters and strings. Regular expression primer: common regular expression escapes we use are\dfor[0-9],\wfor[a-zA-Z0-9_], and\sfor[ \t\r\n]. The notationregex{n}stands fornoccurrences ofregex. Also, we use(?=regex)for non-consumingfollowed by _regex_', (?!_regex_) for not followed by', and(?:regex)for a non-capturing group (i.e. cannot use$nto refer to it). </dd> <dt>{ include: _state_ }</dt> <dd>Used for nice organization of your rules and expands to all the rules defined instate. This is pre-expanded and has no influence on performance. Many samples include the'@whitespace'` state for example.

Actions

An action determines the resulting token class. An action can have the following forms:

string
Shorthand for { token: _string_ }.
[action1,...,actionN]
An array of N actions. This is only allowed when the regular expression consists of exactly N groups (ie. parenthesized parts). Due to the way the tokenizer works, you must define the groups in such a way that all groups appear at top-level and encompass the entire input, for example, we could define characters with an ascii code escape sequence as: /(')(\\(?:[abnfrt]|[xX][0-9]{2}))(')/, ['string','string.escape','string']] Note how we used a non-capturing group using (?: ) in the inner group
{ token: tokenclass }
An object that defines the token class used with CSS rendering. Common token classes are for example 'keyword', 'comment' or 'identifier'. You can use a dot to use hierarchical CSS names, like 'type.identifier' or 'string.escape'. You can also include &lt;dd patterns that are substituted with a captured group from the matched input or the tokenizer state. The patterns are described in the [guard section](https://microsoft.github.io/monaco-editor/monarch.html#pattern) of this document. There are some special token classes: <dl> <dt>"@brackets"</dt> or <dt>"@brackets._tokenclass_</dt> <dd>Signifies that brackets were tokenized. The token class for CSS is determined by the token class defined in the [brackets](https://microsoft.github.io/monaco-editor/monarch.html#brackets) attribute (together withtokenclassif present). Moreover, [bracket](https://microsoft.github.io/monaco-editor/monarch.html#bracket) attribute is set such that the editor is matches the braces (and does auto indentation). For example: ``` [/[{}()\[\]]/, '@brackets'] ``` </dd> <dt>"@rematch"</dt> <dd>(Advanced) Backs up the input and re-invokes the tokenizer. This of course only works when a state change happens too (or we go into an infinite recursion), so this is usually used in combination with thenextattribute. This can be used for example when you are in a certain tokenizer state and want to get out when seeing certain end markers but don't want to consume them while being in that state. See also [nextEmbedded`](https://microsoft.github.io/monaco-editor/monarch.html#nextEmbedded).

An action object can contain more fields that influence the state of a lexer. The following attributes are recognized:

next: state
(string) If defined it pushes the current state onto the tokenizer stack and makes _state_ the current state. This can be used for example to start tokenizing a block comment: ['/\\*', 'comment', '@comment' ] Note that this is a shorthand for { regex: '/\\*', action: { token: 'comment', next: '@comment' } } Here the matched /* is given the "comment" token class, and the tokenizer proceeds with matching the input using the rules in state @comment. There are a few special states that can be used for the next attribute:
"@pop"
Pops the tokenizer stack to return to the previous state. This is used for example to return from block comment tokenizing after seeing the end marker: ['\\*/', 'comment', '@pop']
"@push"
Pushes the current state and continues in the current state. Nice for doing nested block comments when seeing a comment begin marker, i.e. in the @comment state, we can do: ['/\\*', 'comment', '@push']
"@popall"
Pops everything from tokenizer stack and returns to the top state. This can be used during recovery to 'jump' back to the initial state from a deep nesting level.

switchTo: state
(Advanced) Switch to _state_ without changing the stack.
goBack: number
(Advanced) Back up the input by _number_ characters.
bracket: kind
(Advanced) The _kind_ can be either '@open' or '@close'. This signifies that a token is either an open or close brace. This attribute is set automatically if the token class is @brackets. The editor uses the bracket information to show matching braces (where an open bracket matches with a close bracket if their token classes are the same). Moreover, when a user opens a new line the editor will do auto indentation on open braces. Normally, this attribute does not need to be set if you are using the brackets attribute and it is only used for complex brace matching. This is discussed further in the next section on advanced brace matching.
nextEmbedded: langId or '@pop'
(Advanced) Signifies to the editor that this token is followed by code in another language specified by the _langId_, i.e. for example javascript. Internally, our syntax highlighter keeps tokenizing the source until it finds an an ending sequence. At that point, you can use nextEmbedded with a '@pop' value to pop out of the embedded mode again. Usually, we need to use a next attribute too to switch to a state where we can tokenize the foreign code. As an example, here is how we could support CSS fragments in our language: root: [ [/&lt;style\s*&gt;/, { token: 'keyword', bracket: '@open' , next: '@css_block', nextEmbedded: 'text/css' }], [/&lt;\/style\s*&gt;/, { token: 'keyword', bracket: '@close' }], ... ], css_block: [ [/[^"&lt;]+/, ''], [/&lt;\/style\s*&gt;/, { token: '@rematch', next: '@pop', nextEmbedded: '@pop' }], [/"/, 'string', '@string' ], [/&lt;/, ''] ], Note how we switch to the css_block state for tokenizing the CSS source. Also inside the CSS we use the @string state to tokenize CSS strings such that we do not stop the CSS block when we find &lt;/style&gt; inside a string. When we find the closing tag, we also "@pop" the state to get back to normal tokenization. Finally, we need to "@rematch" the token (in the root state) since the editor ignores our token classes until we actually exit the embedded mode. See also a later section on complex dynamic embeddings. (Bug: you can only start an embedded section if the you consume characters at the start of the embedded block (like consuming the &lt;style&gt; tag) (Aug 2012))
log: message
Used for debugging. Logs _message_ to the console window in the browser (press F12 to see it). This can be useful to see if a certain action is executing. For example: [/\d+/, { token: 'number', log: 'found number $0 in state $S0' } ]

{ cases: { guard1: action1, ..., guardN: actionN } }
The final kind of action object is a cases statement. A cases object contains an object where each field functions as a guard. Each guard is applied to the matched input and as soon as one of them matches, the corresponding action is applied. Note that since these are actions themselves, cases can be nested. Cases are used for efficiency: for example, we match for identifiers and then test whether the identifier is possibly a keyword or builtin function: [/[a-z_\$][a-zA-Z0-9_\$]*/, { cases: { '@typeKeywords': 'keyword.type' , '@keywords': 'keyword' , '@default': 'identifier' } } ] The guards can consist of:
"@keywords"
The attribute _keywords_ must be defined in the language object and consist of an array of strings. The guard succeeds if the matched input matches any of the strings. (Note: all cases are pre-compiled and the list is tested using efficient hash maps). Advanced: if the attribute refers to a single string (instead of an array) it is compiled to a regular expression which is tested against the matched input.

"@default"

(or "@" or "") The default guard that always succeeds.

"@eos"

Succeeds if the matched input has reached the end of the line.
"regex"
If the guard does not start with a @ (or &lt;dd) character it is interpreted as a regular expression that is tested against the matched input. Note: theregexis prefixed with^and postfixed with<dd so it must match the matched input entirely. This can be used for example to test for specific inputs, here is an example from the Koka language which uses this to enter various tokenizer states based on the declaration: [/[a-z](\w|\-[a-zA-Z])*/, { cases:{ '@keywords': { cases: { 'alias' : { token: 'keyword', next: '@alias-type' } , 'struct' : { token: 'keyword', next: '@struct-type' } , 'type|cotype|rectype': { token: 'keyword', next: '@type' } , 'module|as|import' : { token: 'keyword', next: '@module' } , '@default' : 'keyword' } } , '@builtins': 'predefined' , '@default' : 'identifier' } } ] Note the use of nested cases to improve efficiency. Also, the library recognizes simple regular expressions like the ones above and compiles them efficiently. For example, the list of words type|cotype|rectype is tested using a Javascript hashmap/object.

(Advanced) In general, a guard has the form [_pat_][_op_]_match_, with an optional pattern, and operator (which are $# and ~ by default). The pattern can be any of:

$#
(default) The matched input (or the group that matched when the action is an array).
$n
The _n_th group of the matched input, or the entire matched input for $0.
$Sn
The _n_th part of the state, i.e. $S2 returns foo in a state @tag.foo. Use $S0 for the full state name.

The above patterns can actually occur in many attributes and are automatically expanded. Attributes where these patterns expand are token, next, nextEmbedded, switchTo, and log. Also, these patterns are expanded in the _match_ part of a guard.

The guard operator _op_ and _match_ can be any of:

~regex or !~regex
(default for _op_ is ~) Tests _pat_ against the regular expression or its negation.
@attribute or !@attribute
Tests whether _pat_ is an element (@), or not an element (!@), of an array of strings defined by _attribute_.
==str or !=str
Tests if _pat_ is equal or unequal to the given string _str_.

For example, here is how to check if the second group is not equal to foo or bar: $2!~foo|bar, or if the first captured group equals the name of the current lexer state: $1==$S0.

If both _op_ and _match_ are empty and there is just a pattern, then the guard succeeds if the pattern is non-empty. This can be used for example to improve efficiency. In the Koka language, an upper case identifier followed by a dot is module name, but without the following dot it is a constructor. This can be matched for in one go using:

[/([A-Z](?:[a-zA-Z0-9_]|\-[a-zA-Z])*)(\.?)/,
  { cases: { '$2'      : ['identifier.namespace','keyword.dot']
           , '@default': 'identifier.constructor' }}
]

Advanced: complex brace matching

This section gives some advanced examples of brace matching using the bracket attribute in an action. Usually, we can match braces just using the brackets attribute in combination with the @brackets token class. But sometimes we need more fine grained control. For example, in Ruby many declarations like class or def are ended with the end keyword. To make them match, we all give them the same token class (keyword.decl) and use bracket @close for end and @open for all declarations:

declarations: ['class','def','module', ... ]

tokenizer: {
  root: {
    [/[a-zA-Z]\w*/,
      { cases: { 'end'          : { token: 'keyword.decl', bracket: '@close' }
               , '@declarations': { token: 'keyword.decl', bracket: '@open' }
               , '@keywords'    : 'keyword'
               , '@default'     : 'identifier' }
      }
    ],

Note that to make outdentation work on the end keyword, you would also need to include the 'd' character in the outdentTriggers string.

Another example of complex matching is HTML where we would like to match starting tags, like <div> with an ending tag </div>. To make an end tag only match its specific open tag, we need to dynamically generate token classes that make the braces match correctly. This can be done using $ expansion in the token class:

tokenizer: {
  root: {
     [/&lt;(\w+)(&gt;?)/,   { token: 'tag-$1', bracket: '@open'  }],
     [/&lt;\/(\w+)\s*&gt;/, { token: 'tag-$1', bracket: '@close' }],

Note how we captured the actual tag name as a group and used that to generate the right token class. Again, to make outdentation work on the closing tag, you would also need to include the '&gt;' character in the outdentTriggers string.

A final advanced example of brace matching is Visual Basic where declarations like Structure are matched with end declarations as End Structure. Just like HTML we need to dynamically set token classes so that an End Enum does not match with a Structure. A tricky part is that we now need to match multiple tokens at once, and we match a construct like End Enum as one closing token, but non declaration endings, like End Foo, as three tokens:

decls: ["Structure","Class","Enum","Function",...],

tokenizer: {
  root: {
     [/(End)(\s+)([A-Z]\w*)/, { cases: { '$3@decls': { token: 'keyword.decl-$3', bracket: '@close'},
                                         '@default': ['keyword','white','identifier.invalid'] }}],
     [/[A-Z]\w*/, { cases: { '@decls'  : { token: 'keyword.decl-$0', bracket: '@open' },
                             '@default': 'constructor' } }],

Note how we used $3 to first test if the third group is a declaration, and then use $3 in the token attribute to generate a declaration specific token class (so we match correctly). Also, to make outdentation work correctly, we would need to include all the ending characters of the declarations in the outdentTriggers string.

Advanced: more attributes on the language definition

Here are more advanced attributes that can be defined in the language definition:

tokenPostfix
(optional="." + name, string) Optional postfix attached to all returned tokens. By default this attaches the language name so in the CSS you can refer to your specific language. For example, for the Java language, we could use .identifier.java to target all Java identifiers specifically in CSS.
start
(optional, string) The start state of the tokenizer. By default, this is the first entry in the tokenizer attribute.
outdentTriggers
(optional, string) Optional string that defines characters that when typed could cause outdentation. This attribute is only used when using advanced brace matching in combination with the bracket attribute. By default it always includes the last characters of the closing brackets in the brackets list. Outdentation happens when the user types a closing bracket word on an line that starts with only white space. If the closing bracket matches a open bracket it is indented to the same amount of that bracket. Usually, this causes the bracket to outdent. For example, in the Ruby language, the end keyword would match with an open declaration like def or class. To make outdentation happen though, we would need to include the d character in the outdentTriggers attribute so it is checked when the users type end: outdentTriggers: 'd',

Über Advanced: complex embeddings with dynamic end tags

Many times, embedding other language fragments is easy as shown in the earlier CSS example, but sometimes it is more dynamic. For example, in HTML we would like to start embeddings on a script tag and style tag. By default, the script language is javascript but if the type attribute is set, that defines the script language mime type. First, we define general tag open and close rules:

[/&lt;(\w+)/,       { token: 'tag.tag-$1', bracket: '@open', next: '@tag.$1' }],
[/&lt;\/(\w+)\s*&gt;/, { token: 'tag.tag-$1', bracket: '@close' } ],

Here, we use the $1 to capture the open tag name in both the token class and the next state. By putting the tag name in the token class, the brace matching will match and auto indent corresponding tags automatically. Next we define the @tag state that matches content within an HTML tag. Because the open tag rule will set the next state to @tag._tagname_, this will match the @tag state due to dot seperation.

tag: [
  [/[ \t\r\n]+/, 'white'],
  [/(type)(\s*=\s*)(['"])([^'"]+)(['"])/, [ 'attribute', 'delimiter', 'string', // todo: should match up quotes properly
                                            {token: 'string', switchTo: '@tag.$S2.$4' },
                                            'string'] ],
  [/(\w+)(\s*=\s*)(['"][^'"]+['"])/, ['keyword', 'delimiter', 'string' ]],
  [/&gt;/, { cases: { '$S2==style' : { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: 'text/css'}
                 , '$S2==script': { cases: { '$S3'     : { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: '$S3' },
                                             '@default': { token: 'delimiter', switchTo: '@embedded.$S2', nextEmbedded: 'javascript' } }
                 , '@default'   : { token: 'delimiter', next: '@pop' } } }]
  [/[^&gt;]/,'']  // catch all
],

Inside the @tag._tagname_ state, we access the _tagname_ through $S2. This is used to test if the tag name matches a script of style tag, in which case we start an embedded mode. We also need switchTo here since we do not want to get back to the @tag state at that point. Also, on a type attribute we extend the state to @tag._tagname_._mimetype_ which allows us to access the mime type as $S3 if it was set. This is used to determine the script language (or default to javascript). Finally, the script and style start an embedded mode and switch to a state @embedded._tagname_. The tag name is included in the state so we can scan for exactly a matching end tag:

embedded: [
  [/[^"&lt;]+/, ''],
  [/&lt;\/(\w+)\s*&gt;/, { cases: { '$1==$S2' : { token: '@rematch', next: '@pop', nextEmbedded: '@pop' },
                              '@default': '' } }],
  [/"/, 'string', '@string' ],
  [/&lt;/, '']
],

Only when we find a matching end tag (outside a string), $1==$S2, we pop the state and exit the embedded mode. Note that we need @rematch since the editor is ignoring our token classes until we actually exit the embedded mode (and we handle the close tag again in the @root state).

相关文章