Creating a compiler-based Markdown parser in TypeScript - Part 1

October 25, 2024 · 6 min read

Motivation

In my previous article, I wrote about creating a simple Markdown parser in TypeScript. That parser, based on regular expressions, is simple to understand but has limitations in supporting Markdown syntax and generating HTML output. In this article, we'll explore an alternative approach to creating a Markdown parser using a compiler-style method. This three-part series begins with an exploration of tokenization. The implementation and ideas are adapted from this blog post from Federico Ramirez - Beezwax, which is well-written and easy to understand.

The purpose of this article and series is to document how I interpret the tutorials and note the key points I have learned. Unlike the Ruby language used in the tutorial, I'm using TypeScript. This may be helpful for those more familiar with TypeScript.

What is Tokenization

As compilation typically involves the following steps: tokenization, parsing, and code generation, we will first explore tokenization. Tokenization starts by transforming a raw string into a list of tokens, which are then used to create an abstract syntax tree (AST). The AST is the in-memory representation of the text and is used to generate the HTML output.

(Note that the following sections will be brief in certain areas that are trivial. You can check the codebase for reference)

Token

Token is a class representing a token. It contains the token type, its value, and utility methods. The TokenType enum identifies different token types, and TokenTypeLookup is a reverse lookup table allowing retrieval of the token type from its value.

enum TokenType {
  TEXT = 'TEXT',
  NULL = 'NULL',
  EOF = 'EOF',
  UNDERSCORE = 'UNDERSCORE',
  STAR = 'STAR',
  NEWLINE = 'NEWLINE',
  HASH = 'HASH',
}

// TokenType Reverse Lookup
export const TokenTypeLookup: { [key: string]: TokenType } = {
  '_': TokenType.UNDERSCORE, // inline
  '*': TokenType.STAR, // inline
  '\n': TokenType.NEWLINE,
  '#': TokenType.HASH,
};

export class Token {
  public constructor(public type: string, public value: string) {}

  isNull(): boolean {
    return this.type === TokenType.NULL;
  }

  get length(): number {
    return this.value.length;
  }

  static endOfFile(): Token {
    return new Token(TokenType.EOF, '');
  }

  static null(): Token {
    return new Token(TokenType.NULL, '');
  }

  static text(content: string): Token {
    return new Token(TokenType.TEXT, content);
  }

  public toString(): string {
    return `Token(${this.type}, ${this.value})`;
  }
}

Strategy

The Strategy interface defines the approach for tokenizing a string into a single token (the very first one). The tokenize method scans the raw string, decides on the first token, and returns it. If no match is found, it returns Token.null(). The implemented strategies demonstrate how this works: special characters are treated as tokens, while text is consumed until a special character is found. The raw string is processed by repeatedly invoking the tokenize method until it ends, as shown in the next section.

export interface Strategy {
  tokenize(raw: string): Token;
}

/**
 * Consume a special character
 * Return the special character as a token
 */
export class SpecialCharTokenizationStrategy implements Strategy {
  tokenize(raw: string): Token {
    if (raw.length === 0 || !TokenTypeLookup[raw[0]]) {
      return Token.null();
    }
    const char = raw[0];
    return new Token(TokenTypeLookup[char], char);
  }
}

/**
 * Consume text until you find a special character
 * Return the text as a token
 */
export class TextTokenizationStrategy implements Strategy {
  tokenize(raw: string): Token {
    let result = '';
    for (const char of raw.split('')) {
      if (TokenTypeLookup[char]) {
        break;
      }
      result += char;
    }
    return Token.text(result);
  }
}

Tokenizer

Tokenizer performs the tokenization process. It specifies the strategies to use and recursively tokenizes the raw string. The tokenizeToArr and tokenizeOneToken methods consume the string recursively. tokenizeToArr returns an array of tokens, while tokenizeOneToken returns the first token by iterating through strategies and calling their tokenize method. If a non-null token is found, it's returned; otherwise, the next strategy is tried. The order of strategies is crucial, as the first matching strategy is used. Here, SpecialCharTokenizationStrategy precedes TextTokenizationStrategy to correctly handle special characters. If the order is reversed, the special characters will be consumed as text instead of tokens.

export class Tokenizer {
  strategies: Strategy[] = [
    new SpecialCharTokenizationStrategy(), // order matters
    new TextTokenizationStrategy(),
  ];

  /**
   * Tokenize into a TokenList, which is an abstraction over an array of tokens
   * providing some convenience methods
   */
  tokenize(raw: string): TokenList {
    return new TokenList(this.tokenizeToArr(raw));
  }

  /**
   * Tokenize into an array of tokens
   */
  tokenizeToArr(raw: string): Token[] {
    if (raw === '') {
      return [Token.endOfFile()];
    }
    const token = this.tokenizeOneToken(raw);
    const rest = raw.substring(token.length);
    return [token].concat(this.tokenizeToArr(rest));
  }

  tokenizeOneToken(raw: string): Token {
    for (const strategy of this.strategies) {
      const token = strategy.tokenize(raw);
      if (!token.isNull()) {
        return token;
      }
    }
    throw new Error(`No strategy found for ${raw}`);
  }
}

TokenList

TokenList is an abstraction over a list of tokens, providing convenient methods like validate, which helps examine the list of tokens and match them to Markdown syntax. For instance, the sequence UNDERSCORE, TEXT, UNDERSCORE represents italic text. The validate method checks if the token sequence matches a specified pattern.

export class TokenList {
  constructor(public tokens: Token[]) {}

  /**
   * Make TokenList iterable like an array
   */
  [Symbol.iterator]() {
    return this.tokens[Symbol.iterator]();
  }

  public validateAny(...typeSequences: string[][]) {
    return typeSequences.some((typeSequence) => this.validate(...typeSequence));
  }

  /**
   * Strictly same sequence of tokens
   */
  public validate(...typeSequence: string[]) {
    if (typeSequence.length > this.tokens.length) {
      return false;
    }
    return typeSequence.every((type, i) => type === this.tokens[i].type);
  }

  public validateFrom(start: number, ...typeSequence: string[]) {
    return this.offset(start).validate(...typeSequence);
  }

  /**
   * Grab the first n tokens
   */
  public grab(amount: number): Token[] {
    if (amount > this.tokens.length) {
      throw new Error('Invalid amount requested');
    }
    return this.tokens.splice(0, amount);
  }

  /**
   * Grab the token list from the start index
   */
  public offset(start: number): TokenList {
    return start === 0 ? this : new TokenList(this.tokens.slice(start));
  }

  public get length() {
    return this.tokens.length;
  }

  public get first() {
    return this.tokens[0];
  }

  public get second() {
    return this.tokens[1];
  }

  public get third() {
    return this.tokens[2];
  }

  public toString() {
    return (
      '[\n\t' +
      this.tokens.map((token) => token.toString()).join(',\n\t') +
      '\n]'
    );
  }
}

Result

Now, the tokenizer is ready for use with new Tokenizer().tokenize('input text')

A set of unit tests have been written to demonstrate its functionality. Below, we test the tokenizeToArr method, which essentially performs the same function as tokenize:

  test('should tokenize empty string to EOF', () => {
    expect(new Tokenizer().tokenizeToArr('')).toEqual([Token.endOfFile()]);
  });
  test('should tokenize plain text', () => {
    expect(new Tokenizer().tokenizeToArr('Hello World')).toEqual([
      new Token('TEXT', 'Hello World'),
      Token.endOfFile(),
    ]);
  });
  test('should tokenize underscore', () => {
    expect(new Tokenizer().tokenizeToArr('_Hello World_')).toEqual([
      new Token('UNDERSCORE', '_'),
      new Token('TEXT', 'Hello World'),
      new Token('UNDERSCORE', '_'),
      Token.endOfFile(),
    ]);
  });
  // ...

Conclusion

This installment of the series has focused on the tokenization process. The next part will delve into interpreting the tokens and constructing the actual in-memory representation of the text.

References

The source code of CMark is available on GitHub at the CMark repo
Writing a Markdown Compiler – Part 1

Motivation​

What is Tokenization​

Token​

Strategy​

Tokenizer​

TokenList​

Result​

Conclusion​

References​