aozora — a Rust parser for Aozora Bunko notation

Share

aozora is a Rust parser for Aozora Bunko notation (青空文庫記法), the in-text annotation language used in .txt files distributed by Aozora Bunko. It does not handle CommonMark or any Markdown dialect — only the Aozora Bunko notation itself. The implementation ships as five distribution surfaces: a CLI binary, a Rust library, a WASM module, a C ABI driver, and Python bindings via PyO3 / maturin.

use aozora::Document;

let source = "|青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();

let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();

assert_eq!(canonical, "|青梅《おうめ》");

What Aozora Bunko is

Aozora Bunko (literally "Open-Sky Library") is a volunteer-run digital library of Japanese-language literature in the public domain. It has been operating since 1997 and currently distributes more than 18,000 works — Sōseki, Akutagawa, Dazai, and several thousand other authors whose copyrights have expired under Japanese law. Volunteers transcribe printed editions into plain text, proofread them, and post the results for free download.

It is not a government agency or a commercial publisher. There is no curation editorial committee. The closest English-speaking analogue is Project Gutenberg, with one important difference: Aozora Bunko texts use a domain-specific in-line annotation syntax to encode features that plain text cannot natively represent — ruby (furigana), emphasis dots, vertical/horizontal text orientation, and references to non-Unicode Japanese characters. Without that annotation language, a great deal of typographic information from the print edition would be lost.

What the notation looks like

A few examples of the notation Aozora Bunko texts contain, all of which aozora parses:

  • Ruby: |青梅《おうめ》 — phonetic gloss attached to a base text run (the kanji 青梅, which is a place name, gets the reading "おうめ")
  • Bouten (emphasis dots): [#「ここに傍点」に傍点] — the Japanese equivalent of italics, rendered as dots above each character in vertical text
  • Tate-chu-yoko: [#「23」は縦中横] — short horizontal runs inside vertical text (used for two-digit numbers)
  • Gaiji (out-of-character-set references): ※[#「魚+師」、第3水準1-94-37] — references to characters outside basic Japanese encodings, often via JIS X 0213 plane-row-cell coordinates
  • Kunten / kaeriten: classical Chinese reading marks
  • Indent containers: [#ここから2字下げ]… [#ここで字下げ終わり] — block-scoped indentation
  • Page and section breaks: chapter delimiters

aozora recognises every annotation that appears in a real Aozora Bunko .txt source and converts it into a structured AST plus a stream of diagnostics for malformed input.

Installation and use

Pre-built CLI binaries for Linux x86_64, macOS arm64, and Windows x86_64 are attached to every GitHub Release as aozora-vX.Y.Z-<target>.{tar.gz,zip} archives, with SHA256SUMS alongside.

To build from source:

cargo install --git https://github.com/P4suta/aozora --locked aozora-cli

CLI subcommands cover the common tasks:

aozora check FILE.txt           # lex and report diagnostics
aozora fmt --check FILE.txt     # parse-then-serialize round-trip check
aozora render FILE.txt          # render to HTML on stdout
aozora check -E sjis FILE.txt   # read Shift_JIS, the Aozora Bunko native encoding

All subcommands accept - (or no path) to read from stdin.

The Rust library entry point is the snippet at the top of this article. Document owns a bumpalo arena; tree borrows from it for the lifetime of the Document. WASM, C ABI, and Python bindings have their own minimal examples in the handbook's Bindings chapters.

What is distinctive about the implementation

Three design decisions are worth highlighting. Each has its own dedicated chapter in the handbook's Architecture section.

Markdown is out of scope. Aozora Bunko notation is not CommonMark, GFM, or any other Markdown dialect — it is a separate annotation system designed for transcribing print Japanese literature. aozora confines itself to that notation. If you want to write Markdown text that also contains Aozora Bunko annotations, the sibling project afm is a Markdown dialect built on top of aozora.

SIMD multi-pattern scanner. Aozora Bunko text is mostly UTF-8 Japanese (3 bytes per codepoint), with the seven trigger bytes (|《》※[] and full-width space) appearing roughly once per kilobyte of source. Scanning that sparse pattern set efficiently is the dominant cost in lexing, so aozora uses Intel Hyperscan's Teddy algorithm via aho-corasick. On x86_64 with AVX2 it runs at ~12 GB/s on the corpus benchmark. On targets without AVX2 it falls back via runtime dispatch to a Hoehrmann-style multi-pattern DFA (~3.5 GB/s). The wasm32 target uses memchr's portable multi-pattern path (~1.2 GB/s) until the WebAssembly SIMD proposal stabilises.

Borrowed-arena AST. A typical Aozora Bunko work parses to roughly 50,000 AST nodes. Allocating each one as an individual Box<Node> makes the allocator the bottleneck. aozora uses a single bumpalo::Bump arena per Document; the parser writes nodes into it in lex order and the tree borrows from the arena for its entire lifetime. On a corpus sweep over the full Aozora Bunko archive the arena variant is 6.4× faster than the equivalent Box<Node> shape, with peak RSS reduced by 30%. Drop is one Bump::reset call regardless of node count.

Where to go next

  • Handbook: https://p4suta.github.io/aozora/ — notation reference, architecture (borrowed-arena AST, SIMD scanner backends, Shift_JIS + gaiji resolver, Eytzinger-layout sorted-set lookup), performance (PGO pipeline, samply workflow, corpus sweep), and bindings.
  • API reference: https://p4suta.github.io/aozora/api/aozora/ — auto-deployed rustdoc.
  • Related projects:
    • P4suta/afm — CommonMark + GFM + Aozora Bunko notation, a unified Markdown dialect built on top of aozora.
    • P4suta/aozora-tools — authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension.

Publication to crates.io is gated on the v1.0 API freeze. Until then, depend on a tagged commit (the install chapter of the handbook keeps the current Cargo.toml snippet).

Read more

外字と訓点を compile-time hash で解く

aozora は青空文庫の外字参照 (※[#「魚+師」、第3水準1-94-37] のような形) を約 14,000 件のテーブルで解決する。このテーブルを runtime の HashMap ではなく phf (perfect hash function) で持ち、コンパイル時に static 配列に焼き込んでいる。この記事はその選択の根拠と、JIS X 0213 → Unicode フォールバックの設計をまとめたもの。 handbook の対応章: Shift_JIS + 外字 resolver。 外字テーブルの形 外字エントリには 3 種類の解決結果があり、それぞれに対応する variant を GaijiEntry に持たせている。 static GAIJI_TABLE: phf::Map<

By Sakashita Yasunobu

青空文庫の .txt を HTML に変換する最短手順

青空文庫 で配布されている .txt ファイルを HTML に変換したい、という用途向けの手順。Rust の知識は要らない。コマンド 1 行で済む。 1. CLI バイナリを取ってくる aozora の Releases ページ から自分の OS 向けのアーカイブを落とす。 OS アーカイブ名 Linux x86_64 aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz macOS arm64 aozora-vX.Y.Z-aarch64-apple-darwin.tar.gz Windows x86_64 aozora-vX.Y.Z-x86_64-pc-windows-msvc.zip SHA256SUMS も同梱されているので、

By Sakashita Yasunobu

50,000 ノードの AST を 16 回のアロケーションで: bumpalo 借用アリーナの実例

aozora の AST は bumpalo 単一アリーナの上に構築されている。Box<Node> を素直に並べた版に比べてパースが 6.4 倍速、ピーク RSS が 30% 減という結果が出ている。この記事は、その設計判断と Rust ライフタイムの取り回しを実装の視点から整理したもの。 handbook の対応章: Borrowed-arena AST。 問題設定 青空文庫の典型的な作品は約 500KiB のソースで、aozora がパースすると約 50,000 ノードの木に展開される。素直に Rust らしく書けば次のような形になる。 enum Node { Plain(String), Ruby { target: String, gloss: String }, Container { kind:

By Sakashita Yasunobu

7 個のトリガーバイトを 12 GB/s で探す: Teddy を選んだ理由

aozora は青空文庫記法の Rust パーサで、字句解析の最初のフェーズが「ソース全体から 7 種類のトリガーバイトを探す」というマルチパターンスキャンになっている。この記事は、その 1 フェーズに Intel Hyperscan 由来の Teddy アルゴリズムを採用した経緯と、対立候補に勝った算術的な根拠を整理したもの。 handbook の対応章: SIMD scanner backends。 問題設定 aozora-pipeline の Phase 1 (字句解析の最初のフェーズ) は、ソース文字列の中から次の 7 文字の出現位置を全て列挙する。 | 《 》 ※ [ ] 全角空白 これらは青空文庫記法の構文トリガー (ルビ・注釈・字下げの開始/終了マーカ) で、出現する位置だけが分かれば後段のフェーズで「これは何の構文か」を解釈できる。 UTF-8 で見ると 7 文字 × 3 バイト

By Sakashita Yasunobu