\n`. ###### `options.allowDangerousHtml` Whether to allow embedded HTML (`boolean`, default: `false`). See [§ Security][security]. ###### `options.allowDangerousProtocol` Whether to allow potentially dangerous protocols in links and images (`boolean`, default: `false`). URLs relative to the current protocol are always allowed (such as, `image.jpg`). For links, the allowed protocols are `http`, `https`, `irc`, `ircs`, `mailto`, and `xmpp`. For images, the allowed protocols are `http` and `https`. See [§ Security][security]. ###### `options.extensions` Array of syntax extensions ([`Arraya
\n
Hello, Venus!
``` An opening curly brace, followed by one or more characters, and then a closing brace. We’ll then look up `planet` in some object and replace the variable with its corresponding value, to get something like `Venus` out. It looks simple enough, but with markdown there are often a couple more things to think about. For this case, I can see the following: * Is there a “block” version too? * Are spaces allowed? Line endings? Should initial and final white space be ignored? * Balanced nested braces? Superfluous ones such as `{{planet}}` or meaningful ones such as `{a {pla} net}`? * Character escapes (`{pla\}net}`) and character references (`{pla}net}`)? To keep things as simple as possible, let’s not support a block syntax, see spaces as special, support line endings, or support nested braces. But to learn interesting things, we *will* support character escapes and \-references. Note that this particular case is already solved quite nicely by [`micromark-extension-mdx-expression`][mdx-expression]. It’s a bit more powerful and does more things, but it can be used to solve this case and otherwise serve as inspiration. ##### Setup Create a new folder, enter it, and set up a new package: ```sh mkdir example cd example npm init -y ``` In this example we’ll use ESM, so add `type: 'module'` to `package.json`: ```diff @@ -2,6 +2,7 @@ "name": "example", "version": "1.0.0", "description": "", + "type": "module", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" ``` Add a markdown file, `example.md`, with the following text: ```markdown Hello, {planet}! {pla\}net} and {pla}net}. ``` To check if our extension works, add an `example.js` module, with the following code: ```js import {promises as fs} from 'node:fs' import {micromark} from 'micromark' import {variables} from './index.js' main() async function main() { const buf = await fs.readFile('example.md') const out = micromark(buf, {extensions: [variables]}) console.log(out) } ``` While working on the extension, run `node example` to see whether things work. Feel free to add more examples of the variables syntax in `example.md` if needed. Our extension doesn’t work yet, for one because `micromark` is not installed: ```sh npm install micromark --save-dev ``` …and we need to write our extension. Let’s do that in `index.js`: ```js export const variables = {} ``` Although our extension doesn’t do anything, running `node example` now somewhat works! ##### Syntax extension Much in micromark is based on character codes (see [§ Preprocess][preprocess]). For this extension, the relevant codes are: * `-5` — M-0005 CARRIAGE RETURN (CR) * `-4` — M-0004 LINE FEED (LF) * `-3` — M-0003 CARRIAGE RETURN LINE FEED (CRLF) * `null` — EOF (end of the stream) * `92` — U+005C BACKSLASH (`\`) * `123` — U+007B LEFT CURLY BRACE (`{`) * `125` — U+007D RIGHT CURLY BRACE (`}`) Also relevant are the content types (see [§ Content types][content-types]). This extension is a *text* construct, as it’s parsed alongsides links and such. The content inside it (between the braces) is *string*, to support character escapes and -references. Let’s write our extension. Add the following code to `index.js`: ```js const variableConstruct = {name: 'variable', tokenize: variableTokenize} export const variables = {text: {123: variableConstruct}} function variableTokenize(effects, ok, nok) { return start function start(code) { console.log('start:', effects, code); return nok(code) } } ``` The above code exports an extension with the identifier `variables`. The extension defines a *text* construct for the character code `123`. The construct has a `name`, so that it can be turned off (optional, see next case), and it has a `tokenize` function that sets up a state machine, which receives `effects` and the `ok` and `nok` states. `ok` can be used when successful, `nok` when not, and so constructs are a bit similar to how promises can *resolve* or *reject*. `tokenize` returns the initial state, `start`, which itself receives the current character code, prints some debugging information, and then returns a call to `nok`. Ensure that things work by running `node example` and see what it prints. Now we need to define our states and figure out how variables work. Some people prefer sketching a diagram of the flow. I often prefer writing it down in pseudo-code prose. I’ve also found that test driven development works well, where I write unit tests for how it should work, then write the state machine, and finally use a code coverage tool to ensure I’ve thought of everything. In prose, what we have to code looks like this: * **start**: Receive `123` as `code`, enter a token for the whole (let’s call it `variable`), enter a token for the marker (`variableMarker`), consume `code`, exit the marker token, enter a token for the contents (`variableString`), switch to *begin* * **begin**: If `code` is `125`, reconsume in *nok*. Else, reconsume in *inside* * **inside**: If `code` is `-5`, `-4`, `-3`, or `null`, reconsume in `nok`. Else, if `code` is `125`, exit the string token, enter a `variableMarker`, consume `code`, exit the marker token, exit the variable token, and switch to *ok*. Else, consume, and remain in *inside*. That should be it! Replace `variableTokenize` with the following to include the needed states: ```js function variableTokenize(effects, ok, nok) { return start function start(code) { effects.enter('variable') effects.enter('variableMarker') effects.consume(code) effects.exit('variableMarker') effects.enter('variableString') return begin } function begin(code) { return code === 125 ? nok(code) : inside(code) } function inside(code) { if (code === -5 || code === -4 || code === -3 || code === null) { return nok(code) } if (code === 125) { effects.exit('variableString') effects.enter('variableMarker') effects.consume(code) effects.exit('variableMarker') effects.exit('variable') return ok } effects.consume(code) return inside } } ``` Run `node example` again and see what it prints! The HTML compiler ignores things it doesn’t know, so variables are now removed. We have our first syntax extension, and it sort of works, but we don’t handle character escapes and -references yet. We need to do two things to make that work: a) skip over `\\` and `\}` in our algorithm, b) tell micromark to parse them. Change the code in `index.js` to support escapes like so: ```diff @@ -23,6 +23,11 @@ function variableTokenize(effects, ok, nok) { return nok(code) } + if (code === 92) { + effects.consume(code) + return insideEscape + } + if (code === 125) { effects.exit('variableString') effects.enter('variableMarker') @@ -35,4 +40,13 @@ function variableTokenize(effects, ok, nok) { effects.consume(code) return inside } + + function insideEscape(code) { + if (code === 92 || code === 125) { + effects.consume(code) + return inside + } + + return inside(code) + } } ``` Finally add support for character references and character escapes between braces by adding a special token that defines a content type: ```diff @@ -11,6 +11,7 @@ function variableTokenize(effects, ok, nok) { effects.consume(code) effects.exit('variableMarker') effects.enter('variableString') + effects.enter('chunkString', {contentType: 'string'}) return begin } @@ -29,6 +30,7 @@ function variableTokenize(effects, ok, nok) { } if (code === 125) { + effects.exit('chunkString') effects.exit('variableString') effects.enter('variableMarker') effects.consume(code) ``` Tokens with a `contentType` will be replaced by *postprocess* (see [§ Postprocess][postprocess]) by the tokens belonging to that content type. ##### HTML extension Up next is an HTML extension to replace variables with data. Change `example.js` to use one like so: ```diff @@ -1,11 +1,12 @@ import {promises as fs} from 'node:fs' import {micromark} from 'micromark' -import {variables} from './index.js' +import {variables, variablesHtml} from './index.js' main() async function main() { const buf = await fs.readFile('example.md') - const out = micromark(buf, {extensions: [variables]}) + const html = variablesHtml({planet: '1', 'pla}net': '2'}) + const out = micromark(buf, {extensions: [variables], htmlExtensions: [html]}) console.log(out) } ``` And add the HTML extension, `variablesHtml`, to `index.js` like so: ```diff @@ -52,3 +52,19 @@ function variableTokenize(effects, ok, nok) { return inside(code) } } + +export function variablesHtml(data = {}) { + return { + enter: {variableString: enterVariableString}, + exit: {variableString: exitVariableString}, + } + + function enterVariableString() { + this.buffer() + } + + function exitVariableString() { + var id = this.resume() + if (id in data) { + this.raw(this.encode(data[id])) + } + } +} ``` `variablesHtml` is a function that receives an object mapping “variables” to strings and returns an HTML extension. The extension hooks two functions to `variableString`, one when it starts, the other when it ends. We don’t need to do anything to handle the other tokens as they’re already ignored by default. `enterVariableString` calls `buffer`, which is a function that “stashes” what would otherwise be emitted. `exitVariableString` calls `resume`, which is the inverse of `buffer` and returns the stashed value. If the variable is defined, we ensure it’s made safe (with `this.encode`) and finally output that (with `this.raw`). ##### Further exercises It works! We’re done! Of course, it can be better, such as with the following potential features: * Add support for empty variables * Add support for spaces between markers and string * Add support for line endings in variables * Add support for nested braces * Add support for blocks * Add warnings on undefined variables * Use `micromark-build`, and use `uvu/assert`, `debug`, and `micromark-util-symbol` (see [§ Size & debug][size-debug]) * Add [`mdast-util-from-markdown`][from-markdown] and [`mdast-util-to-markdown`][to-markdown] utilities to parse and serialize the AST #### Case: turn off constructs Sometimes it’s needed to turn a default construct off. That’s possible through a syntax extension. Note that not everything can be turned off (such as paragraphs) and even if it’s possible to turn something off, it could break micromark (such as character escapes). To disable constructs, refer to them by name in an array at the `disable.null` field of an extension: ```js import {micromark} from 'micromark' const extension = {disable: {null: ['codeIndented']}} console.log(micromark('\ta', {extensions: [extension]})) ``` Yields: ```htmla
``` ## Architecture micromark is maintained as a monorepo. Many of its internals, which are used in `micromark` (core) but also useful for developers of extensions or integrations, are available as separate modules. Each module maintained here is available in [`packages/`][packages]. ### Overview The naming scheme in [`packages/`][packages] is as follows: * `micromark-build` — Small CLI to build dev code into production code * `micromark-core-commonmark` — CommonMark constructs used in micromark * `micromark-factory-*` — Reusable subroutines used to parse parts of constructs * `micromark-util-*` — Reusable helpers often needed when parsing markdown * `micromark` — Core module micromark has two interfaces: buffering (maintained in [`micromark/dev/index.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/index.js)) and streaming (maintained in [`micromark/dev/stream.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/stream.js)). The first takes all input at once whereas the last uses a Node.js stream to take input separately. They thinly wrap how data flows through micromark: ```txt micromark +-----------------------------------------------------------------------------------------------+ | +------------+ +-------+ +-------------+ +---------+ | | -markdown->+ preprocess +-chunks->+ parse +-events->+ postprocess +-events->+ compile +-html- | | +------------+ +-------+ +-------------+ +---------+ | +-----------------------------------------------------------------------------------------------+ ``` ### Preprocess The **preprocessor** ([`micromark/dev/lib/preprocess.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/lib/preprocess.js)) takes markdown and turns it into chunks. A **chunk** is either a character code or a slice of a buffer in the form of a string. Chunks are used because strings are more efficient storage than character codes, but limited in what they can represent. For example, the input `ab\ncd` is represented as `['ab', -4, 'cd']` in chunks. A character **code** is often the same as what `String#charCodeAt()` yields but micromark adds meaning to certain other values. In micromark, the actual character U+0009 CHARACTER TABULATION (HT) is replaced by one M-0002 HORIZONTAL TAB (HT) and between 0 and 3 M-0001 VIRTUAL SPACE (VS) characters, depending on the column at which the tab occurred. For example, the input `\ta` is represented as `[-2, -1, -1, -1, 97]` and `a\tb` as `[97, -2, -1, -1, 98]` in character codes. The characters U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR) are replaced by virtual characters depending on whether they occur together: M-0003 CARRIAGE RETURN LINE FEED (CRLF), M-0004 LINE FEED (LF), and M-0005 CARRIAGE RETURN (CR). For example, the input `a\r\nb\nc\rd` is represented as `[97, -5, 98, -4, 99, -3, 100]` in character codes. The `0` (U+0000 NUL) character code is replaced by U+FFFD REPLACEMENT CHARACTER (`�`). The `null` code represents the end of the input stream (called *eof* for end of file). ### Parse The **parser** ([`micromark/dev/lib/parse.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/lib/parse.js)) takes chunks and turns them into events. An **event** is the start or end of a token amongst other events. Tokens can “contain” other tokens, even though they are stored in a flat list, by entering before and exiting after them. A **token** is a span of one or more codes. Tokens are most of what micromark produces: the built in HTML compiler or other tools can turn them into different things. Tokens are essentially names attached to a slice, such as `lineEndingBlank` for certain line endings, or `codeFenced` for a whole fenced code. Sometimes, more info is attached to tokens, such as `_open` and `_close` by `attention` (strong, emphasis) to signal whether the sequence can open or close an attention run. These fields have to do with how the parser works, which is complex and not always pretty. Certain fields (`previous`, `next`, and `contentType`) are used in many cases: linked tokens for subcontent. Linked tokens are used because outer constructs are parsed first. Take for example: ```markdown - *a b*. ``` 1. The list marker and the space after it is parsed first 2. The rest of the line is a `chunkFlow` token 3. The two spaces on the second line are a `linePrefix` of the list 4. The rest of the line is another `chunkFlow` token The two `chunkFlow` tokens are linked together and the chunks they span are passed through the flow tokenizer. There the chunks are seen as `chunkContent` and passed through the content tokenizer. There the chunks are seen as a paragraph and seen as `chunkText` and passed through the text tokenizer. Finally, the attention (emphasis) and data (“raw” characters) is parsed there, and we’re done! #### Content types The parser starts out with a document tokenizer. *Document* is the top-most content type, which includes containers such as block quotes and lists. Containers in markdown come from the margin and include more constructs on the lines that define them. *Flow* represents the sections (block constructs such as ATX and setext headings, HTML, indented and fenced code, thematic breaks), which like *document* are also parsed per line. An example is HTML, which has a certain starting condition (such as `