Using Tree-sitter for syntax highlighting in Jekyll

While exploring a particular rabbit hole for Maparoni, I came across this challenge:

How can you use a Tree-sitter grammer for syntax highlighting in a Jekyll blog – or, for that matter, just any HTML page by using JavaScript?

Jekyll is implemented in Ruby; however, while there is a repo with Ruby bindings for Tree-sitter, these are out-of-date (last commit in early 2020) and are for a previous version of Tree-sitter. The question of this challenge was also raised there, pointing at the answer, which is:

Disable syntax highlighting in Jekyll,
Use Tree-sitter’s JavaScript bindings to parse code blocks, and
Write a JavaScript highlighter that uses that to highlight the code.

This is not well documented and required a lot of trial and error to get it working. Here are my steps, for anyone who faces the same challenge.

Assumptions

The main assumption here is that you have a tree-sitter grammar for a language that isn’t covered by existing syntax highlighting, such as Pygments, Rouge, or Highlight.js. In my case, I wrote¹ a tree-sitter grammer for Maparoni’s formulas. The app uses that for syntax highlighting in its built-in formula editor, and I want to use the same syntax highlighting for the documentation on the website.

The second assumption is that you only care about highlighting code in that language, and not any other language. Handling multiple languages with this method, or handling this on top of another syntax highlighter are separate issues.

Disable syntax highlighting in Jekyll

Either disable all syntax highlighting by editing your _config.yml:

kramdown:
  highlighter: none
  syntax_highlighter: none

Or on a per-page basis by adding this line:

{::options syntax_highlighter="nil" /}

Add Tree-sitter’s web bindings

First, get the tree-sitter.js and tree-sitter.wasm from the official web bindings. This will provide the main parser and query functionality. Put them in an appropriate place, e.g., your assets/js folder.

Then load these, by adding this to the relevant head template <head>:

<script type="text/javascript" src="/assets/js/tree-sitter.js"></script>

Add your language’s files

Next, you’ll need two files from the repo of your tree-sitter grammar:

The tree-sitter-MyLanguage.wasm for the language of your choice, i.e., the custom tree-sitter grammar.
The highlights.scm file for the language of your choice, i.e., the queries that the syntax highlighter will need. This is typically in a queries folder.

Also put these into an appropriate place, such as as assets/js/tree-sitter-MyLanguage.wasm and tree-sitter-MyLanguage/highlights.scm.

Building a highlighter

With that, we can get going. I’ll walk step-by-step how to build the syntax highlighter.

Configure a Parser object and provide it your specific language:

const Parser = window.TreeSitter;
(async () => {
  await Parser.init();
  const parser = new Parser();
  const MyLanguage = 
    await Parser.Language.load('/assets/js/tree-sitter-MyLanguage.wasm');
  parser.setLanguage(MyLanguage);
  // ...
});

Let’s assume our code sits in a HTML element of the class .language-MyLanguage. We can grab these elements using and then tell the parser to parse them:

// ...
const codeBlocks = document.querySelectorAll('.language-MyLanguage');
codeBlocks.forEach((el) => {
  const tree = parser.parse(el.innerHTML);
  console.log(tree.rootNode.toString());
  // ...
});

That print the syntax tree for each of your code blocks.

One thing you might encounter here is that characters such as < would be converted to <, which the grammar probably won’t handle. So let’s fix that by decoding that html:

function htmlDecode(input) {
  var doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

And we can then call use htmlDecode(el.innerHTML) rather than el.innerHTML direction.

Now that we have a syntax tree for the code, let’s highlight it. We’ll need to use the Query API from the tree-sitter web bindigns, which isn’t well documented at this stage, but we can see how to use it from the test suite.

This is where the highlights.scm comes into play. Let’s grab its contents, tell it to match against the syntax tree, and iterate over the matches:

// ...
let response = await fetch('/assets/js/tree-sitter-maparoni/highlights.scm');
let highlights = await response.text();
const query = Maparoni.query(highlights);

query.matches(tree.rootNode).forEach((match) => {
  console.log(match);
  // ...
});

This provides the code that was matched (by start and end indices) to the matching query name, such as “function”, “keyword” or “constant”. We can use that to build a new HTML string for the code block that adds CSS classes to each match.

const code = htmlDecode(el.innerHTML);
const tree = parser.parse(code);

var adjusted = "";
var lastEnd = 0;

query.matches(tree.rootNode).forEach((match) => {
  const name = match.captures[0].name;
  const text = match.captures[0].node.text;
  const start = match.captures[0].node.startIndex;
  const end = match.captures[0].node.endIndex;

  if (start < lastEnd) {
    return; // avoid duplicate matches for the same text
  }
  if (start > lastEnd) {
    adjusted += code.substring(lastEnd, start);
  }
  adjusted += `<span class="${name}">${text}</span>`;
  lastEnd = end;
});

if (lastEnd < code.length) {
  adjusted += code.substring(lastEnd);
}

el.innerHTML = adjusted;

Now what we need is provide the relevant CSS for those span classes, such as:

.language-MyLanguage .variable { color: #cc6666; }
.language-MyLanguage .function { color: #81a2be; }
.language-MyLanguage .type { color: #f0c674; }
//...

And we’re good to go.

See below for the full script. It’s a very simple syntax highlighter and surely has some issues, but it’s working fine for my purposes so far.

Full script

window.onload = function() { highlight(); };

function htmlDecode(input) {
  var doc = new DOMParser().parseFromString(input, "text/html");
  return doc.documentElement.textContent;
}

function highlight() {
  const Parser = window.TreeSitter;
  (async () => {
    await Parser.init();
    const parser = new Parser();
    const MyLanguage = 
      await Parser.Language.load('/assets/js/tree-sitter-MyLanguage.wasm');
    parser.setLanguage(MyLanguage);

    let response = 
      await fetch('/assets/js/tree-sitter-MyLanguage/highlights.scm');
    let highlights = await response.text();
    const query = MyLanguage.query(highlights);

    const codeBlocks = document.querySelectorAll('.language-MyLanguage');
    
    codeBlocks.forEach((el) => {
      const code = htmlDecode(el.innerHTML);
      const tree = parser.parse(code);

      var adjusted = "";
      var lastEnd = 0;

      query.matches(tree.rootNode).forEach((match) => {
        const name = match.captures[0].name;
        const text = match.captures[0].node.text;
        const start = match.captures[0].node.startIndex;
        const end = match.captures[0].node.endIndex;

        if (start < lastEnd) {
          return;
        }
        if (start > lastEnd) {
          adjusted += code.substring(lastEnd, start);
        }
        adjusted += `<span class="${name}">${text}</span>`;
        lastEnd = end;
      });

      if (lastEnd < code.length) {
        adjusted += code.substring(lastEnd);
      }

      el.innerHTML = adjusted;
    });
  })();
}

Quite a challenge in itself! The documentation, various grammars for other languages, and tree-sitter playground help a lot though. ↩