micromark

[![Build][build-badge]][build] [![Coverage][coverage-badge]][coverage] [![Downloads][downloads-badge]][downloads] [![Size][bundle-size-badge]][bundle-size] [![Sponsors][sponsors-badge]][opencollective] [![Backers][backers-badge]][opencollective] [![Chat][chat-badge]][chat] The smallest CommonMark compliant markdown parser with positional info and concrete tokens. ## Feature highlights * [x] **[compliant][commonmark]** (100% to CommonMark) * [x] **[extensions][]** ([GFM][], [directives][], [frontmatter][], [math][], [MDX.js][mdxjs]) * [x] **[safe][security]** (by default) * [x] **[small][size]** (smallest CM parser that exists) * [x] **[robust][test]** (1800+ tests, 100% coverage, fuzz testing) ## When to use this * If you *just* want to turn markdown into HTML (with maybe a few extensions) * If you want to do *really complex things* with markdown See [§ Comparison][comparison] for more info ## Intro micromark is a long awaited markdown parser. It uses a [state machine][cmsm] to parse the entirety of markdown into concrete tokens. It’s the smallest 100% [CommonMark][] compliant markdown parser in JavaScript. It was made to replace the internals of [`remark-parse`][remark-parse], the most [popular][] markdown parser. Its API compiles to HTML, but its parts are made to be used separately, so as to generate syntax trees ([`mdast-util-from-markdown`][from-markdown]) or compile to other output formats. * to learn markdown, see this [cheatsheet and tutorial][cheat] * for more about us, see [`unifiedjs.com`][site] * for updates, see [Twitter][] * for questions, see [Discussions][chat] * to help, see [contribute][] or [sponsor][] below ## Contents * [Install](#install) * [Use](#use) * [API](#api) * [`micromark(value[, encoding][, options])`](#micromarkvalue-encoding-options) * [`stream(options?)`](#streamoptions) * [Extensions](#extensions) * [List of extensions](#list-of-extensions) * [`SyntaxExtension`](#syntaxextension) * [`HtmlExtension`](#htmlextension) * [Extending markdown](#extending-markdown) * [Creating a micromark extension](#creating-a-micromark-extension) * [Architecture](#architecture) * [Overview](#overview) * [Preprocess](#preprocess) * [Parse](#parse) * [Postprocess](#postprocess) * [Compile](#compile) * [Examples](#examples) * [GitHub flavored markdown (GFM)](#github-flavored-markdown-gfm) * [Math](#math) * [Syntax tree](#syntax-tree) * [Markdown](#markdown) * [CommonMark](#commonmark) * [Grammar](#grammar) * [Project](#project) * [Comparison](#comparison) * [Test](#test) * [Size & debug](#size--debug) * [Version](#version) * [Security](#security) * [Contribute](#contribute) * [Sponsor](#sponsor) * [Origin story](#origin-story) * [License](#license) ## Install This package is [ESM only][esm]. In Node.js (version 12.20+, 14.14+, 16.0+, 18.0+), install with [npm][]: ```sh npm install micromark ``` In Deno with [`esm.sh`][esmsh]: ```js import {micromark} from 'https://esm.sh/micromark@3' ``` In browsers with [`esm.sh`][esmsh]: ```html ``` ## Use Typical use (buffering): ```js import {micromark} from 'micromark' console.log(micromark('## Hello, *world*!')) ``` Yields: ```html

Hello, world!

``` You can pass extensions (in this case [`micromark-extension-gfm`][gfm]): ```js import {micromark} from 'micromark' import {gfm, gfmHtml} from 'micromark-extension-gfm' const value = '* [x] [email protected] ~~strikethrough~~' const result = micromark(value, { extensions: [gfm()], htmlExtensions: [gfmHtml()] }) console.log(result) ``` Yields: ```html ``` Streaming interface: ```js import fs from 'fs' import {stream} from 'micromark/stream' fs.createReadStream('example.md') .on('error', handleError) .pipe(stream()) .pipe(process.stdout) function handleError(error) { // Handle your error here! throw error } ``` ## API `micromark` core has two entries in its export map: `micromark` and `micromark/stream`. `micromark` exports the following identifier: `micromark`. `micromark/stream` exports the following identifier: `stream`. There are no default exports. The export map supports the endorsed [`development` condition](https://nodejs.org/api/packages.html#packages_resolving_user_conditions). Run `node --conditions development module.js` to get instrumented dev code. Without this condition, production code is loaded. See [§ Size & debug][size-debug] for more info. ### `micromark(value[, encoding][, options])` Compile markdown to HTML. ##### Parameters ###### `value` Markdown to parse (`string` or `Buffer`). ###### `encoding` [Character encoding][encoding] to understand `value` as when it’s a [`Buffer`][buffer] (`string`, default: `'utf8'`). ###### `options.defaultLineEnding` Value to use for line endings not in `value` (`string`, default: first line ending or `'\n'`). Generally, micromark copies line endings (`'\r'`, `'\n'`, `'\r\n'`) in the markdown document over to the compiled HTML. In some cases, such as `> a`, CommonMark requires that extra line endings are added: `
\n

a

\n
`. ###### `options.allowDangerousHtml` Whether to allow embedded HTML (`boolean`, default: `false`). See [§ Security][security]. ###### `options.allowDangerousProtocol` Whether to allow potentially dangerous protocols in links and images (`boolean`, default: `false`). URLs relative to the current protocol are always allowed (such as, `image.jpg`). For links, the allowed protocols are `http`, `https`, `irc`, `ircs`, `mailto`, and `xmpp`. For images, the allowed protocols are `http` and `https`. See [§ Security][security]. ###### `options.extensions` Array of syntax extensions ([`Array`][syntax-extension], default: `[]`). See [§ Extensions][extensions]. ###### `options.htmlExtensions` Array of HTML extensions ([`Array`][html-extension], default: `[]`). See [§ Extensions][extensions]. ##### Returns `string` — Compiled HTML. ### `stream(options?)` Streaming interface of micromark. Compiles markdown to HTML. `options` are the same as the buffering API above. Note that some of the work to parse markdown can be done streaming, but in the end buffering is required. micromark does not handle errors for you, so you must handle errors on whatever streams you pipe into it. As markdown does not know errors, `micromark` itself does not emit errors. ## Extensions micromark supports extensions. There are two types of extensions for micromark: [`SyntaxExtension`][syntax-extension], which change how markdown is parsed, and [`HtmlExtension`][html-extension], which change how it compiles. They can be passed in [`options.extensions`][option-extensions] or [`options.htmlExtensions`][option-htmlextensions], respectively. As a user of extensions, refer to each extension’s readme for more on how to use them. As a (potential) author of extensions, refer to [§ Extending markdown][extending-markdown] and [§ Creating a micromark extension][create-extension]. ### List of extensions * [`micromark/micromark-extension-directive`][directives] — support directives (generic extensions) * [`micromark/micromark-extension-frontmatter`][frontmatter] — support frontmatter (YAML, TOML, etc) * [`micromark/micromark-extension-gfm`][gfm] — support GFM (GitHub Flavored Markdown) * [`micromark/micromark-extension-gfm-autolink-literal`](https://github.com/micromark/micromark-extension-gfm-autolink-literal) — support GFM autolink literals * [`micromark/micromark-extension-gfm-footnote`](https://github.com/micromark/micromark-extension-gfm-footnote) — support GFM footnotes * [`micromark/micromark-extension-gfm-strikethrough`](https://github.com/micromark/micromark-extension-gfm-strikethrough) — support GFM strikethrough * [`micromark/micromark-extension-gfm-table`](https://github.com/micromark/micromark-extension-gfm-table) — support GFM tables * [`micromark/micromark-extension-gfm-tagfilter`](https://github.com/micromark/micromark-extension-gfm-tagfilter) — support GFM tagfilter * [`micromark/micromark-extension-gfm-task-list-item`](https://github.com/micromark/micromark-extension-gfm-task-list-item) — support GFM tasklists * [`micromark/micromark-extension-math`][math] — support math * [`micromark/micromark-extension-mdx`](https://github.com/micromark/micromark-extension-mdx) — support MDX * [`micromark/micromark-extension-mdxjs`][mdxjs] — support MDX.js * [`micromark/micromark-extension-mdx-expression`](https://github.com/micromark/micromark-extension-mdx-expression) — support MDX (or MDX.js) expressions * [`micromark/micromark-extension-mdx-jsx`](https://github.com/micromark/micromark-extension-mdx-jsx) — support MDX (or MDX.js) JSX * [`micromark/micromark-extension-mdx-md`](https://github.com/micromark/micromark-extension-mdx-md) — support misc MDX changes * [`micromark/micromark-extension-mdxjs-esm`](https://github.com/micromark/micromark-extension-mdxjs-esm) — support MDX.js import/exports #### Community extensions * [`wataru-chocola/micromark-extension-definition-list`](https://github.com/wataru-chocola/micromark-extension-definition-list) — support definition lists ### `SyntaxExtension` A syntax extension is an object whose fields are typically the names of hooks, referring to where constructs “hook” into. The fields at such objects are character codes, mapping to constructs as values. The built in [constructs][] are an example. See it and [existing extensions][extensions] for inspiration. ### `HtmlExtension` An HTML extension is an object whose fields are typically `enter` or `exit` (reflecting whether a token is entered or exited). The values at such objects are names of tokens mapping to handlers. See [existing extensions][extensions] for inspiration. ### Extending markdown micromark lets you change markdown syntax, yes, but there are alternatives. The alternatives are often better. Over the years, many micromark and remark users have asked about their unique goals for markdown. Some exemplary goals are: 1. I want to add `rel="nofollow"` to external links 2. I want to add links from headings to themselves 3. I want line breaks in paragraphs to become hard breaks 4. I want to support embedded music sheets 5. I want authors to add arbitrary attributes 6. I want authors to mark certain blocks with meaning, such as tip, warning, etc 7. I want to combine markdown with JS(X) 8. I want to support our legacy flavor of markdown-like syntax These can be solved in different ways and which solution is best is both subjective and dependant on unique needs. Often, there is already a solution in the form of an existing remark or rehype plugin. Respectively, their solutions are: 1. [`remark-external-links`](https://github.com/remarkjs/remark-external-links) 2. [`rehype-autolink-headings`](https://github.com/rehypejs/rehype-autolink-headings) 3. [`remark-breaks`](https://github.com/remarkjs/remark-breaks) 4. custom plugin similar to [`rehype-katex`](https://github.com/remarkjs/remark-math/tree/main/packages/rehype-katex) but integrating [`abcjs`](https://www.abcjs.net) 5. either [`remark-directive`](https://github.com/remarkjs/remark-directive) and a custom plugin or with [`rehype-attr`](https://github.com/jaywcjlove/rehype-attr) 6. [`remark-directive`](https://github.com/remarkjs/remark-directive) combined with a custom plugin 7. combining the existing micromark MDX extensions however you please, such as done by [`mdx-js/mdx`](https://github.com/mdx-js/mdx) or [`xdm`](https://github.com/wooorm/xdm) 8. Writing a micromark extension Looking at these from a higher level, they can be categorized: * **Changing the output by transforming syntax trees** (1 and 2) This category is nice as the format remains plain markdown that authors are already familiar with and which will work with existing tools and platforms. Implementations will deal with the syntax tree ([`mdast`][mdast]) and the ecosystems **[remark][]** and **[rehype][]**. There are many existing [utilities for working with that tree][utilities]. Many [remark plugins][] and [rehype plugins][] also exist. * **Using and abusing markdown to add new meaning** (3, 4, potentially 5) This category is similar to *Changing the output by transforming syntax trees*, but adds a new meaning to certain things which already have semantics in markdown. Some examples in pseudo code: ````markdown * **A list item with the first paragraph bold** And then more content, is turned into `
` / `
` / `
` elements Or, the title attributes on links or images is [overloaded](/url 'rel:nofollow') with a new meaning. ```csv fenced,code,can,include,data which,is,turned,into,a,graph ``` ```js data can="be" passed=true // after the code language name ``` HTML, especially comments, could be used as **markers** ```` * **Arbitrary extension mechanism** (potentially 5; 6) This category is nice when content should contain embedded “components”. Often this means it’s required for authors to have some programming experience. There are three good ways to solve arbitrary extensions. **HTML**: Markdown already has an arbitrary extension syntax. It works in most places and authors are already familiar with the syntax, but it’s reasonably hard to implement securely. Certain platforms will remove HTML completely, others sanitize it to varying degrees. HTML also supports custom elements. These could be used and enhanced by client side JavaScript or enhanced when transforming the syntax tree. **Generic directives**: although [a proposal][directive-proposal] and not supported on most platforms, directives do work with many tools already. They’re not the easiest to author compared to, say, a heading, but sometimes that’s okay. They do have potential: they nicely solve the need for an infinite number of potential extensions to markdown in a single markdown-esque way. **MDX** also adds support for components by swapping HTML out for JS(X). JSX is an extension to JavaScript, so MDX is something along the lines of literate programming. This does require knowledge of React (or Vue) and JavaScript, excluding some authors. * **Extending markdown syntax** (7 and 8) Extend the syntax of markdown means: * Authors won’t be familiar with the syntax * Content won’t work in other places (such as on GitHub) * Defeating the purpose of markdown: being simple to author and looking like what it means …and it’s hard to do as it requires some in-depth knowledge of JavaScript and parsing. But it’s possible and in certain cases very powerful. ### Creating a micromark extension This section shows how to create an extension for micromark that parses “variables” (a way to render some data) and one to turn a default construct off. > Stuck? > See [`support.md`][support]. #### Prerequisites * You should possess an intermediate to high understanding of JavaScript: it’s going to get a bit complex * Read the readme of [unified][] (until you hit the API section) to better understand where micromark fits * Read the [§ Architecture][architecture] section to understand how micromark works * Read the [§ Extending markdown][extending-markdown] section to understand whether it’s a good idea to extend the syntax of markdown #### Extension basics micromark supports two types of extensions. Syntax extensions change how markdown is parsed. HTML extensions change how it compiles. HTML extensions are not always needed, as micromark is often used through [`mdast-util-from-markdown`][from-markdown] to parse to a markdown syntax tree So instead of an HTML extension a `from-markdown` utility is needed. Then, a [`mdast-util-to-markdown`][to-markdown] utility, which is responsible for serializing syntax trees to markdown, is also needed. When developing something for internal use only, you can pick and choose which parts you need. When open sourcing your extensions, it should probably contain four parts: syntax extension, HTML extension, `from-markdown` utility, and a `to-markdown` utility. On to our first case! #### Case: variables Let’s first outline what we want to make: render some data, similar to how [Liquid](https://github.com/Shopify/liquid/wiki/Liquid-for-Designers) and the like work, in our markdown. It could look like this: ```markdown Hello, {planet}! ``` Turned into: ```html

Hello, Venus!

``` An opening curly brace, followed by one or more characters, and then a closing brace. We’ll then look up `planet` in some object and replace the variable with its corresponding value, to get something like `Venus` out. It looks simple enough, but with markdown there are often a couple more things to think about. For this case, I can see the following: * Is there a “block” version too? * Are spaces allowed? Line endings? Should initial and final white space be ignored? * Balanced nested braces? Superfluous ones such as `{{planet}}` or meaningful ones such as `{a {pla} net}`? * Character escapes (`{pla\}net}`) and character references (`{pla}net}`)? To keep things as simple as possible, let’s not support a block syntax, see spaces as special, support line endings, or support nested braces. But to learn interesting things, we *will* support character escapes and \-references. Note that this particular case is already solved quite nicely by [`micromark-extension-mdx-expression`][mdx-expression]. It’s a bit more powerful and does more things, but it can be used to solve this case and otherwise serve as inspiration. ##### Setup Create a new folder, enter it, and set up a new package: ```sh mkdir example cd example npm init -y ``` In this example we’ll use ESM, so add `type: 'module'` to `package.json`: ```diff @@ -2,6 +2,7 @@ "name": "example", "version": "1.0.0", "description": "", + "type": "module", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" ``` Add a markdown file, `example.md`, with the following text: ```markdown Hello, {planet}! {pla\}net} and {pla}net}. ``` To check if our extension works, add an `example.js` module, with the following code: ```js import {promises as fs} from 'node:fs' import {micromark} from 'micromark' import {variables} from './index.js' main() async function main() { const buf = await fs.readFile('example.md') const out = micromark(buf, {extensions: [variables]}) console.log(out) } ``` While working on the extension, run `node example` to see whether things work. Feel free to add more examples of the variables syntax in `example.md` if needed. Our extension doesn’t work yet, for one because `micromark` is not installed: ```sh npm install micromark --save-dev ``` …and we need to write our extension. Let’s do that in `index.js`: ```js export const variables = {} ``` Although our extension doesn’t do anything, running `node example` now somewhat works! ##### Syntax extension Much in micromark is based on character codes (see [§ Preprocess][preprocess]). For this extension, the relevant codes are: * `-5` — M-0005 CARRIAGE RETURN (CR) * `-4` — M-0004 LINE FEED (LF) * `-3` — M-0003 CARRIAGE RETURN LINE FEED (CRLF) * `null` — EOF (end of the stream) * `92` — U+005C BACKSLASH (`\`) * `123` — U+007B LEFT CURLY BRACE (`{`) * `125` — U+007D RIGHT CURLY BRACE (`}`) Also relevant are the content types (see [§ Content types][content-types]). This extension is a *text* construct, as it’s parsed alongsides links and such. The content inside it (between the braces) is *string*, to support character escapes and -references. Let’s write our extension. Add the following code to `index.js`: ```js const variableConstruct = {name: 'variable', tokenize: variableTokenize} export const variables = {text: {123: variableConstruct}} function variableTokenize(effects, ok, nok) { return start function start(code) { console.log('start:', effects, code); return nok(code) } } ``` The above code exports an extension with the identifier `variables`. The extension defines a *text* construct for the character code `123`. The construct has a `name`, so that it can be turned off (optional, see next case), and it has a `tokenize` function that sets up a state machine, which receives `effects` and the `ok` and `nok` states. `ok` can be used when successful, `nok` when not, and so constructs are a bit similar to how promises can *resolve* or *reject*. `tokenize` returns the initial state, `start`, which itself receives the current character code, prints some debugging information, and then returns a call to `nok`. Ensure that things work by running `node example` and see what it prints. Now we need to define our states and figure out how variables work. Some people prefer sketching a diagram of the flow. I often prefer writing it down in pseudo-code prose. I’ve also found that test driven development works well, where I write unit tests for how it should work, then write the state machine, and finally use a code coverage tool to ensure I’ve thought of everything. In prose, what we have to code looks like this: * **start**: Receive `123` as `code`, enter a token for the whole (let’s call it `variable`), enter a token for the marker (`variableMarker`), consume `code`, exit the marker token, enter a token for the contents (`variableString`), switch to *begin* * **begin**: If `code` is `125`, reconsume in *nok*. Else, reconsume in *inside* * **inside**: If `code` is `-5`, `-4`, `-3`, or `null`, reconsume in `nok`. Else, if `code` is `125`, exit the string token, enter a `variableMarker`, consume `code`, exit the marker token, exit the variable token, and switch to *ok*. Else, consume, and remain in *inside*. That should be it! Replace `variableTokenize` with the following to include the needed states: ```js function variableTokenize(effects, ok, nok) { return start function start(code) { effects.enter('variable') effects.enter('variableMarker') effects.consume(code) effects.exit('variableMarker') effects.enter('variableString') return begin } function begin(code) { return code === 125 ? nok(code) : inside(code) } function inside(code) { if (code === -5 || code === -4 || code === -3 || code === null) { return nok(code) } if (code === 125) { effects.exit('variableString') effects.enter('variableMarker') effects.consume(code) effects.exit('variableMarker') effects.exit('variable') return ok } effects.consume(code) return inside } } ``` Run `node example` again and see what it prints! The HTML compiler ignores things it doesn’t know, so variables are now removed. We have our first syntax extension, and it sort of works, but we don’t handle character escapes and -references yet. We need to do two things to make that work: a) skip over `\\` and `\}` in our algorithm, b) tell micromark to parse them. Change the code in `index.js` to support escapes like so: ```diff @@ -23,6 +23,11 @@ function variableTokenize(effects, ok, nok) { return nok(code) } + if (code === 92) { + effects.consume(code) + return insideEscape + } + if (code === 125) { effects.exit('variableString') effects.enter('variableMarker') @@ -35,4 +40,13 @@ function variableTokenize(effects, ok, nok) { effects.consume(code) return inside } + + function insideEscape(code) { + if (code === 92 || code === 125) { + effects.consume(code) + return inside + } + + return inside(code) + } } ``` Finally add support for character references and character escapes between braces by adding a special token that defines a content type: ```diff @@ -11,6 +11,7 @@ function variableTokenize(effects, ok, nok) { effects.consume(code) effects.exit('variableMarker') effects.enter('variableString') + effects.enter('chunkString', {contentType: 'string'}) return begin } @@ -29,6 +30,7 @@ function variableTokenize(effects, ok, nok) { } if (code === 125) { + effects.exit('chunkString') effects.exit('variableString') effects.enter('variableMarker') effects.consume(code) ``` Tokens with a `contentType` will be replaced by *postprocess* (see [§ Postprocess][postprocess]) by the tokens belonging to that content type. ##### HTML extension Up next is an HTML extension to replace variables with data. Change `example.js` to use one like so: ```diff @@ -1,11 +1,12 @@ import {promises as fs} from 'node:fs' import {micromark} from 'micromark' -import {variables} from './index.js' +import {variables, variablesHtml} from './index.js' main() async function main() { const buf = await fs.readFile('example.md') - const out = micromark(buf, {extensions: [variables]}) + const html = variablesHtml({planet: '1', 'pla}net': '2'}) + const out = micromark(buf, {extensions: [variables], htmlExtensions: [html]}) console.log(out) } ``` And add the HTML extension, `variablesHtml`, to `index.js` like so: ```diff @@ -52,3 +52,19 @@ function variableTokenize(effects, ok, nok) { return inside(code) } } + +export function variablesHtml(data = {}) { + return { + enter: {variableString: enterVariableString}, + exit: {variableString: exitVariableString}, + } + + function enterVariableString() { + this.buffer() + } + + function exitVariableString() { + var id = this.resume() + if (id in data) { + this.raw(this.encode(data[id])) + } + } +} ``` `variablesHtml` is a function that receives an object mapping “variables” to strings and returns an HTML extension. The extension hooks two functions to `variableString`, one when it starts, the other when it ends. We don’t need to do anything to handle the other tokens as they’re already ignored by default. `enterVariableString` calls `buffer`, which is a function that “stashes” what would otherwise be emitted. `exitVariableString` calls `resume`, which is the inverse of `buffer` and returns the stashed value. If the variable is defined, we ensure it’s made safe (with `this.encode`) and finally output that (with `this.raw`). ##### Further exercises It works! We’re done! Of course, it can be better, such as with the following potential features: * Add support for empty variables * Add support for spaces between markers and string * Add support for line endings in variables * Add support for nested braces * Add support for blocks * Add warnings on undefined variables * Use `micromark-build`, and use `uvu/assert`, `debug`, and `micromark-util-symbol` (see [§ Size & debug][size-debug]) * Add [`mdast-util-from-markdown`][from-markdown] and [`mdast-util-to-markdown`][to-markdown] utilities to parse and serialize the AST #### Case: turn off constructs Sometimes it’s needed to turn a default construct off. That’s possible through a syntax extension. Note that not everything can be turned off (such as paragraphs) and even if it’s possible to turn something off, it could break micromark (such as character escapes). To disable constructs, refer to them by name in an array at the `disable.null` field of an extension: ```js import {micromark} from 'micromark' const extension = {disable: {null: ['codeIndented']}} console.log(micromark('\ta', {extensions: [extension]})) ``` Yields: ```html

a

``` ## Architecture micromark is maintained as a monorepo. Many of its internals, which are used in `micromark` (core) but also useful for developers of extensions or integrations, are available as separate modules. Each module maintained here is available in [`packages/`][packages]. ### Overview The naming scheme in [`packages/`][packages] is as follows: * `micromark-build` — Small CLI to build dev code into production code * `micromark-core-commonmark` — CommonMark constructs used in micromark * `micromark-factory-*` — Reusable subroutines used to parse parts of constructs * `micromark-util-*` — Reusable helpers often needed when parsing markdown * `micromark` — Core module micromark has two interfaces: buffering (maintained in [`micromark/dev/index.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/index.js)) and streaming (maintained in [`micromark/dev/stream.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/stream.js)). The first takes all input at once whereas the last uses a Node.js stream to take input separately. They thinly wrap how data flows through micromark: ```txt micromark +-----------------------------------------------------------------------------------------------+ | +------------+ +-------+ +-------------+ +---------+ | | -markdown->+ preprocess +-chunks->+ parse +-events->+ postprocess +-events->+ compile +-html- | | +------------+ +-------+ +-------------+ +---------+ | +-----------------------------------------------------------------------------------------------+ ``` ### Preprocess The **preprocessor** ([`micromark/dev/lib/preprocess.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/lib/preprocess.js)) takes markdown and turns it into chunks. A **chunk** is either a character code or a slice of a buffer in the form of a string. Chunks are used because strings are more efficient storage than character codes, but limited in what they can represent. For example, the input `ab\ncd` is represented as `['ab', -4, 'cd']` in chunks. A character **code** is often the same as what `String#charCodeAt()` yields but micromark adds meaning to certain other values. In micromark, the actual character U+0009 CHARACTER TABULATION (HT) is replaced by one M-0002 HORIZONTAL TAB (HT) and between 0 and 3 M-0001 VIRTUAL SPACE (VS) characters, depending on the column at which the tab occurred. For example, the input `\ta` is represented as `[-2, -1, -1, -1, 97]` and `a\tb` as `[97, -2, -1, -1, 98]` in character codes. The characters U+000A LINE FEED (LF) and U+000D CARRIAGE RETURN (CR) are replaced by virtual characters depending on whether they occur together: M-0003 CARRIAGE RETURN LINE FEED (CRLF), M-0004 LINE FEED (LF), and M-0005 CARRIAGE RETURN (CR). For example, the input `a\r\nb\nc\rd` is represented as `[97, -5, 98, -4, 99, -3, 100]` in character codes. The `0` (U+0000 NUL) character code is replaced by U+FFFD REPLACEMENT CHARACTER (`�`). The `null` code represents the end of the input stream (called *eof* for end of file). ### Parse The **parser** ([`micromark/dev/lib/parse.js`](https://github.com/micromark/micromark/blob/main/packages/micromark/dev/lib/parse.js)) takes chunks and turns them into events. An **event** is the start or end of a token amongst other events. Tokens can “contain” other tokens, even though they are stored in a flat list, by entering before and exiting after them. A **token** is a span of one or more codes. Tokens are most of what micromark produces: the built in HTML compiler or other tools can turn them into different things. Tokens are essentially names attached to a slice, such as `lineEndingBlank` for certain line endings, or `codeFenced` for a whole fenced code. Sometimes, more info is attached to tokens, such as `_open` and `_close` by `attention` (strong, emphasis) to signal whether the sequence can open or close an attention run. These fields have to do with how the parser works, which is complex and not always pretty. Certain fields (`previous`, `next`, and `contentType`) are used in many cases: linked tokens for subcontent. Linked tokens are used because outer constructs are parsed first. Take for example: ```markdown - *a b*. ``` 1. The list marker and the space after it is parsed first 2. The rest of the line is a `chunkFlow` token 3. The two spaces on the second line are a `linePrefix` of the list 4. The rest of the line is another `chunkFlow` token The two `chunkFlow` tokens are linked together and the chunks they span are passed through the flow tokenizer. There the chunks are seen as `chunkContent` and passed through the content tokenizer. There the chunks are seen as a paragraph and seen as `chunkText` and passed through the text tokenizer. Finally, the attention (emphasis) and data (“raw” characters) is parsed there, and we’re done! #### Content types The parser starts out with a document tokenizer. *Document* is the top-most content type, which includes containers such as block quotes and lists. Containers in markdown come from the margin and include more constructs on the lines that define them. *Flow* represents the sections (block constructs such as ATX and setext headings, HTML, indented and fenced code, thematic breaks), which like *document* are also parsed per line. An example is HTML, which has a certain starting condition (such as `