22 September 2024 – Breaking changes. Where previously baremark.add(…)
was used to extend rules, now use baremark().push([…])
. (Note the added
brackets!) – This makes Baremark smaller and adds more flexibility! All
Javascript array methods may now be used to work with the ruleset. (For
example, one may now use baremark().unshift([…])
to add a rule to be executed
first, which wasn’t previously possible.)
Baremark is a minimal (but extendable) Markdown parser written in Javascript, originally inspired by Van Tigranyan’s Gist Landmark, but with added bugfixes, optimizations, support for labeled links/images, and a little more CommonMark compliance. (Note that Baremark never will be fully CommonMark compliant, as the intent of Baremark is source code brevity above featurefulness.)
It is currently 1984 bytes in size before minification and zipping!
Baremark can be invoked in two ways.
baremark(MARKDOWN)
– ExpandMARKDOWN
and return the resulting HTML. (Most of the time, this is the only function you need.)baremark()
– Returns the list of rules used internally by Baremark. This is used to extend the Baremark rules (for supporting your own non-standard Markdown). See below.baremark.escape(STRING)
– Expands any characters inSTRING
that are special in Markdown into HTML&#…;
entities. This means that further processing will not affectSTRING
. For example use, see below.
Baremark’s internals are very simple. It consists of a list of rules, which are
applied, in order, to the inputted Markdown text. Each rule is passed on
exactly as-is to the Javascript replace()
string method. Yet, from this
simplicity come remarkable versatility.
Line endings are normalized by the first builtin rule of Baremark.
Normalization strips any trailing spaces and tabs, and make sure all lines end
in \n
(converting any found Windows \r\n
and old Mac \r
line endings).
This means that your rules need not match trailing space, or, if you want to
match trailing space, that your new rule have to be added before the builtin
rules (using baremark().unshift()
).
Let’s take an example. The below rule turns [#text]
into <a id="text"></a>
,
allowing you to use add fragment URL anchors to your text (so that you to put
#text
into your URL to scroll to that part of the page). – This rule is added
to the end of the current ruleset using baremark().push()
(meaning that it
will be applied after all the previously existing rules).
// Fragment URL anchor: Turns `[#text]` into <a id="text"></a>.
baremark().push([/\[#([^.:\[\]\s]+)\][\t ]*/g, '<a id="$1"></a>'])
Below’s is another, more involved, example of a Baremark rule. This one parses
the first paragraph of the Markdown input as metadata if possible. (If the
first paragraph isn’t formatted like an email or HTTP header then it is left
untouched, otherwise it’s removed and the metadata is stored in a variable for
later use.) This rule needs to be processed first, before any other rules, and
so it is added using baremark().unshift()
.
First a small Javascript module called baremarkHeaders
is created (this acts
as a container for the returned metadata). It consists of an internal scope
(with the private variable meta
hidden in it), and a Javascript array with an
extra method get()
that can be used to return the metadata after invoking
baremark()
.
// Baremark rule for reading header style metadata. Processes first paragraph
// as metadata if (and only if) it looks like an email headers (e.g. 'Author:
// <name>'). After `baremark()` call `baremarkHeaders.get()` to get object
// with metadata values.
const baremarkHeaders = (meta => Object.assign([
/^(\n*)((\w+:.*\n)+)\n+/,
(_, nl, txt) => {
meta = {}
txt.split(/^/m).forEach(x => {
const [_, name, value] = /^(\w+):\s*(.*)\n/.exec(x)
meta[name.toLowerCase()] = value
})
return nl
}],
{ get: () => meta })
)()
// Invoking it.
baremark().unshift(baremarkHeaders) // add rule
const html = baremark(markdown)
const meta = baremarkHeaders.get() // get metadata
Also, multiple rules can be added at the same time:
baremark().unshift(
[/\[#([^.:\[\]\s]+)\][\t ]*/g, '<a id="$1"></a>'], // hash anchor
[/\b[a-z]+:\/\/[^ \n<>]*\w/gi,x => // autolink URL
`<a href="${baremark.escape(x)}">${baremark.escape(x)}</a>`],
)
Above we also use the baremark.escape()
to prevent the autolinked URL from
being further processed by Baremark. This stops Markdown characters (like _
)
which might occur in the URL from being expanded by later Baremark rules (which
would result in HTML tags being inserted, breaking the link it).
Finally, since rules are passed exactly as-is to the Javascript string method
replace()
, so the MDN docs on the subject is recommended reading.
Forgetting the [
and ]
around the rules. – If you forget the brackets
when adding rules (with baremark().push([…])
or baremark().unshift([…])
)
you’ll get a very cryptic error message upon running baremark(MARKDOWN)
.
Uncaught TypeError: r is not iterable
Forgetting the /g
flag on the regex. – If you forget this flag, your
regex will only be applied once. This is very seldom the right choice and can
lead to some hand-to-find errors. (Though, for a counterexample, look at the
baremarkHeaders
extension above.)
Each regex is applied to the whole of the Markdown source. – Thus, for
inline elements, you need to make sure that you allow single newlines to
match inside your Markdown element, but never two newlines after each other
(or your element will match across paragraph borders). The rule for **bold**
,
for example, look like this:
[/(\*\*|__)(\n?(.+\n)*?.*?)\1/g,'<b>$2</b>']
Notice the (\n?(.+\n)*?.*?)
part in the middle? That matches, ‘one optional
newline’ (\n?
), followed by ‘as few as possible, optional, lines that has at
least one non-newline character, and ends in newline’ ((.+\n)*?
), followed by
‘as few as possible, optional, non-newline characters’ (.*?
). – That’s a
pretty elaborate way to say that **…**
shouldn’t match if there are two
newlines next to each other inside it.
These limitations might change in the future.
- Indentation is ignored.
- Fenced code blocks are supported, but not indented ones.
- Blockquotes cannot be nested.
- Lists cannot be nested.
- Autolinks
<URL>
are not supported. - Determining what is a paragraph and what is a HTML block from is somewhat simplistic.
- Whitespace in not allowed between
](
or][
in links and images. This allowed in the CommonMark specification, but can lead to weird errors. (Baremark allow space inside the brackets though, so your line wrapping shouldn’t be too affected.)
Even though much inspiration is taken from CommonMark, Baremark sometimes go in a slightly different direction (usually to keep the code minimal), see Limitations. Most advanced Markdown features (such as tables) are not supported out-of-the-box, but you may use extensions to add missing functionality.
The first step of Markdown processing normalizes line endings. This is done by
trimming off all trailing tabs and spaces, and converting the
end-line-character to \n
(both Windows \r\n
and old Mac \r
line endings
are supported). This means that subsequent rules can be simplified, as they do
not have to factor in line-ending space.
Block elements are paragraph-level stuff, like lists, headings and the like. Most of them are separated from the surrounding blocks by blank lines, though some of them (like blockquotes, lists and label definitions) do not require that.
Paragraphs are any text that is surrounded by blank lines, which isn’t
recognized as any other type of block. They are wrapped in a paragraph tag
<p>...</p>
, and any Markdown contained in the paragraph is also expanded.
The only difference between HTML blocks and paragraphs is that paragraphs
are outputted wrapped in paragraph tags <p>...</p>
, while HTML blocks are
not. (Markdown is expanded in both paragraphs and HTML blocks.)
If one of the following HTML tags <...>
, or their corresponding end tags
</...>
is found at the beginning of a paragraph, it is considered an HTML
block: address
, article
, aside
, blockquote
, details
, div
, dl
,
fieldset
, figcaption
, figure
, footer
, form
, h1
… h6
, header
,
hgroup
, hr
, main
, menu
, nav
, noscript
, ol
, p
, pre
, script
,
search
, section
, style
, table
and ul
. (These are the tags that close
any currently open <p>
tag. See also: MDN: The <p>
Element).
There are two types of headings: Atx headings which start with 1–6 hashtags
#
(optionally also followed by hashtags), and Setext headings which
underlined with a line of equal signs =
or hyphens -
.
This is an atx style heading, it starts with 1–6 #
on a line of their own.
They must be preceded and followed by a blank line. They are expanded into HTML
tags <h1>
to <h6>
.
A heading may also (optionally) be followed by (any number of) #
.
# Heading 1
## Heading 2
### Heading 3
#### Heading 4
##### Heading 5
###### Heading 6
This is a Setext heading, which consist of (one or more lines) of text,
underlined by a line consisting of either =
or -
characters. If the
underlining uses =
the heading expands into <h1>
, while if the underlining
consists of -
it expands into <h2>
. The heading must be preceded by a blank
line, no blank line is required after the underlining.
Heading 1
=========
Blabla...
Heading 2
---------
Blabla...
A blockquote is any paragraph where each lines start with >
. Blockquotes
cannot be nested, but they may contain lists and span elements.
> Information is noise, unless it’s
> the information you’re looking for.
A bullet list item is any line that starts with -
, +
or *
followed by a
space. The different bullet characters may be mixed freely within the same
list. If an item is longer than one line, then the subsequent lines must start
with space character. (In this way lists items is the only instance where
indentation matters in Baremark.)
- bullet one
+ bullet two, which also happens to be
a very long multi-line bullet item
* bullet three
Bullet lists cannot be nested.
A numbered list item is any line that starts with a number, followed by .
or
)
and then a space. Whether you use .
or )
may be mixed freely within the
same list. If an item is longer than one line, then the subsequent lines must
start with space character. (In this way lists items is the only instance where
indentation matters in Baremark.)
Like most Markdown parsers, Baremark ignores the actual numbers. (The outputted list will always be numbered from 1.)
1. numbered item 1
2. numbered item 2
3. numbered item 3
4. numbered item 4
Numbered lists cannot be nested.
Code blocks start and end with ```
on a line of its own. (Markdown’s
indented code blocks are not supported by Baremark.) A code block may contain
any kind of preformatted text (not just code).
```
first line of code
second line of code
etc...
```
A dinkus indicates a break in the text. It is sometimes used to mark the end of the chapter, or a scene change, or to separate stanzas in poetry. In HTML it marked by a horizontal ruler by default. (The CommonMark spec calls them “thematic breaks”.)
A dinkus consist of three or more hyphens -
, underlines _
or asterisks *
,
optionally separated by spaces. It must be separated from the surrounding text
with blank lines. Examples:
* * *
________________________________________
Spans elements are the markup used for formatting text and adding links within block elements. All of a span element must occur within the same block (otherwise you’ll see the literal markup characters in the output).
NOTE: The Markdown bold, italic and
underline can be nested in, shall we say, interesting ways. For example
*italic **bold-italic* bold**
will generate HTML which is technically
non-standard, since the resulting HTML tags <i>
and <b>
will be
overlapping (<i>italic <b>bold-italic</i> bold</b>
→ “italic bold-italic
bold”), and not neatly nested, in the way the HTML standard thinks is
appropriate. In practice, however, I’ve never seen a browser which fails to
render this correctly though.—When in doubt, I suggest you just avoid using
this “feature”. :)
Any ASCII punctuation character may be escaped by preceding it with a
backslash. An escaped character is always interpreted literally. So if you want
an actual asterisk in your text, you can use \*
to indicate that this is an
asterisk that is not part of a Markdown tag (like for example an *italic*
tag). This is feature CommonMark compliant.
The ASCII punctuation characters are:
!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~
NOTE: Unescaped brackets and parentheses are not allowed in the text of links and images.
Both links and images exists in two types: inline
and labeled. The inline form contain the URL right there, in the Markdown element itself, while
the labeled form and shortcut form hides away the URL in a separate label definition,
placed elsewhere in the document, where the (often lengthy) URLs don’t wreak
havoc with the line wrapping of your text. Links are expanded into the HTML tag
<a href="URL" title="TITLE">TEXT</a>
and images into <img src="URL" alt="TEXT" title="TITLE">
. (Baremark only supports TITLE
for labeled links
and images.)
The Markdown for inline links and images look the same, except that a leading
leading exclamation point !
is added for the image form.
NOTE: TEXT
may not contain unescaped brackets []
or parentheses ()
.
Put a backslash in front of these characters if you need them.
- Link
TEXT
may contain Markdown span elements (including images, but not links). - Image
TEXT
is a literal string. It is used for thealt
attribute of the outputted<img>
tag, and browsers use it as a replacement for the image itself (if it fails to load, or if text is read aloud, rather than displayed). It should contain a brief description of the image (avoiding phrasings like “Picture of …”).
URL
given may contain any valid URL. (You can use relative paths, URL
fragments, or other protocols, like mailto:
etc.) For images the URL should
point to a valid image. This value is not further expanded (all characters,
even backslash \
, are interpreted literally).
Links:
[Example link](http://example.com/)
[Link with relative path](..)
[Link with fragment URL](#top)
Image:
data:image/s3,"s3://crabby-images/d03fa/d03fae3390b67942250cd4d2c14a77c7b9f9744b" alt="Logo"
Linked image:
[data:image/s3,"s3://crabby-images/9e3dd/9e3dd1be3088587103a4fd6d88d51c5c4c205824" alt="About us"](../about.html)
It is often more convenient to use the labeled form of links and images, since this form allow you to move the (often very long) URLs out of the flow of the text, so as interfere reading, or mess up line wrapping.
NOTE: TEXT
and LABEL
may not contain unescaped brackets []
or
parentheses ()
. Put a backslash in front of these characters if you need
them.
- Link
TEXT
may contain Markdown span elements (including images, but not links). - Image
TEXT
is a literal string. It is used for thealt
attribute of the outputted<img>
tag, and browsers use it as a replacement for the image itself (if it fails to load, or if text is read aloud, rather than displayed). It should contain a brief description of the image (avoiding phrasings like “Picture of …”).
LABEL
names a label definition which contains the link URL (and optionally
title) of the link. LABEL
is case insensitive and whitespace is normalized
(so that a link label may be word wrapped without causing trouble).
If LABEL
is not defined in the document, then the link will not be expanded,
but remain as-is in the output. Optionally, LABEL
may be left empty (or
dropped altogether), see shortcut links and images.
Links:
[Example link][example]
[Link with relative path][up]
[Link with fragment URL][top]
Image:
![Logotype][logo]
Linked image:
[![About us][logo]][about]
[about]: ../about.html (About Page)
[example]: http://example.com/ (An Example Page)
[logo]: baremark.svg (Our Logo)
[top]: #top
[up]: .. (Go up one page)
When writing labeled links and images, the [LABEL]
part at the end of the
link/image element can be left empty, [TEXT][]
or ![TEXT][]
, or dropped
completely, [TEXT]
or ![TEXT]
. In these cases TEXT
will be used as both
link/image text and as LABEL
.
Links:
[About][] page.
[About] page.
Images:
![Logo][]
![Logo]
[about]: ../about.html (About Page)
[logo]: baremark.svg (Our Logo)
TEXT
is used as-is for the link (or image) text, but when looking up the link
URL it is normalized in the same way as LABEL
(in labeled links and
images).—This means that, in the above example, [About]
and ![Logo]
will
be capitalized in the output, while still using the URL and title defined in
the lower case label definitions [about]: …
and [logo]: …
.
A label definition associates a LABEL
with a URL
and (optionally) a
TITLE
. The label definitions themselves do not show up in the outputted HTML,
but they are used when expanding labeled links and images. Label definitions
can be placed anywhere in the document, but are usually placed either after the
paragraph where they are used, or at the end of the document.
A label definition are not required to be surrounded by blank lines, but its
[LABEL]:
part must be the first thing on the line (without indentation).
Thereafter the URL is specified either in angle brackets <URL>
or without
URL
. Finally, the optional TITLE
may be given either in single quotes
'TITLE'
, double quotes "TITLE"
or parentheses (TITLE)
. They can look like
this:
[about]: about.html
[up]: .. (Go up one page)
[example]: http://example.com/ (An Example Page)
[logo page]: <../logo.html> "The History of Our Logo"
[ch5]: #chapter-5 'About Ancient Anthologies'
LABEL
cannot contain brackets []
or parentheses ()
unless they are
escaped by backslashes. The value is case insensitive, and
whitespace normalized (so that it may be word wrapped without causing trouble).
URL
can be written either plainly, or bracketed by bigger than/less than
<>
.
TITLE
(which is optional) may be surrounded with either parentheses (…)
,
double quotes "…"
or single quotes '…'
. TITLE
is a literal string (any
Markdown inside it is not expanded), since this is outputted as an HTML
attribute value. This value is typically used by browsers as a hover text for
the link or image.
Asterisks *
or underlines _
are used to mark italic text. These are
expanded into HTML tags <i>...</i>
. (See also Note,
under “Span Elements”, regarding nesting of bold, italic and
underline.)
_italic_ or *italic*
Double asterisks **
or underlines __
are used to mark bold text. These
are expanded into HTML tags <b>...</b>
. (See also Note,
under “Span Elements”, regarding nesting of bold, italic and
underline.)
__bold__ or **bold**
Triple underlines ___
(but not asterisks) are used to mark underlined
text. These are expanded into HTML tags <u>…</u>
. (See also Note, under “Span Elements”, regarding nesting of bold, italic
and underline.)
___UNDERLINE___
Double tildes ~~
is used to mark strikethrough text. These are expanded
into HTML tags <s>...</s>
.
~~STRIKETHROUGH~~
Colons :
and straight quotes "
are used to mark :"quoted text":. These are
expanded into HTML tags <q>...</q>
.
:"QUOTE":
NOTE: The quotes produced by most (all?) browsers by the HTML <q>
tag
cannot be copied and pasted, so I tend to stay away from using this Markdown
element.
Backticks `
are used to mark code
. CODE
is literal string (Markdown
inside it is not further processed, and HTML is escaped so that it shows up as
text in the browser) before being wrapped in the HTML <tt>...</tt>
.
Backslash escapes cannot be used in CODE
(they will simply show up as
backslashes in the output).
Any number of backticks may be used to start the tag, and the same number
number of backticks is used to terminate it. If CODE
both starts and ends
with space, then exactly one space is stripped off of either end. All of which
means, that to, for example, write a singe backtick as CODE
, you can use `` ` ``
.
`CODE`
Baremark was originally based on based on VanTigranyan’s Gist Landmark (which is 2060 bytes in size) but it adds several features (such as labeled links & images) and fixes some bugs (see below).
It was shortened by:
- Regexes are expanded by preprocessing (so that they may be written even shorter).
- Tabs for indentation.
- Shortening all variable names to single letter.
- Using fat arrow functions (
=>
). - Use of
.reduce()
instead offor
loops. - Rules is a list-of-lists (instead of a list of objects).
- Single letter groups in regexes changed to character classes (e.g.
(-|*)
to[-*]
). - Using Javascript template strings (
`…`
). - Not allowing space before
#
,---
and===
in headings, or before/after```
. - Replaced unnecessary
.*?
with.*
(greediness is fine if anchored by\n
, since.
won’t match newlines). - Use HTML
<b>
instead of<strong>
and<i>
instead of<em>
. - Removing unneccesary spaces in source.
escape()
rewritten.
- Exports
escape()
method (for use in extensions). - Handles spaces & tabs at end-of-line a bit more consistently.
- CommonMark compatible dinkus (uses 3 or more underscores
_
, hyphens-
or asterisks*
, optionally separated by space, where Landmark requires 5 or more asterisks*
, and does not allow spaces between them). - Support for backslash escapes
\X
(CommonMark). - Support for labeled links and images
[TEXT][LABEL]
![TEXT][LABEL]
(CommonMark). - Support for shortcut links and images
[TEXT]
![TEXT]
(CommonMark).
- Fixes atx style headings (
# HEADING
…###### HEADING
). - Bold, italics and inline code may span newlines (but not empty lines).
- Bullet lists now require space after
*
or-
(avoids confusion with italics). - Blockquotes may contain lists.
- Blank line in lists, or between lines with leading
>
starts new list or blockquote.