aozora — a Rust parser for Aozora Bunko notation

Share

aozora is a Rust parser for Aozora Bunko notation (青空文庫記法), the in-text annotation language used in .txt files distributed by Aozora Bunko. It does not handle CommonMark or any Markdown dialect — only the Aozora Bunko notation itself. The implementation ships as five distribution surfaces: a CLI binary, a Rust library, a WASM module, a C ABI driver, and Python bindings via PyO3 / maturin.

use aozora::Document;

let source = "|青梅《おうめ》".to_owned();
let doc = Document::new(source);
let tree = doc.parse();

let html: String = tree.to_html();
let canonical: String = tree.serialize();
let diagnostics = tree.diagnostics();

assert_eq!(canonical, "|青梅《おうめ》");

What Aozora Bunko is

Aozora Bunko (literally "Open-Sky Library") is a volunteer-run digital library of Japanese-language literature in the public domain. It has been operating since 1997 and currently distributes more than 18,000 works — Sōseki, Akutagawa, Dazai, and several thousand other authors whose copyrights have expired under Japanese law. Volunteers transcribe printed editions into plain text, proofread them, and post the results for free download.

It is not a government agency or a commercial publisher. There is no curation editorial committee. The closest English-speaking analogue is Project Gutenberg, with one important difference: Aozora Bunko texts use a domain-specific in-line annotation syntax to encode features that plain text cannot natively represent — ruby (furigana), emphasis dots, vertical/horizontal text orientation, and references to non-Unicode Japanese characters. Without that annotation language, a great deal of typographic information from the print edition would be lost.

What the notation looks like

A few examples of the notation Aozora Bunko texts contain, all of which aozora parses:

  • Ruby: |青梅《おうめ》 — phonetic gloss attached to a base text run (the kanji 青梅, which is a place name, gets the reading "おうめ")
  • Bouten (emphasis dots): [#「ここに傍点」に傍点] — the Japanese equivalent of italics, rendered as dots above each character in vertical text
  • Tate-chu-yoko: [#「23」は縦中横] — short horizontal runs inside vertical text (used for two-digit numbers)
  • Gaiji (out-of-character-set references): ※[#「魚+師」、第3水準1-94-37] — references to characters outside basic Japanese encodings, often via JIS X 0213 plane-row-cell coordinates
  • Kunten / kaeriten: classical Chinese reading marks
  • Indent containers: [#ここから2字下げ]… [#ここで字下げ終わり] — block-scoped indentation
  • Page and section breaks: chapter delimiters

aozora recognises every annotation that appears in a real Aozora Bunko .txt source and converts it into a structured AST plus a stream of diagnostics for malformed input.

Installation and use

Pre-built CLI binaries for Linux x86_64, macOS arm64, and Windows x86_64 are attached to every GitHub Release as aozora-vX.Y.Z-<target>.{tar.gz,zip} archives, with SHA256SUMS alongside.

To build from source:

cargo install --git https://github.com/P4suta/aozora --locked aozora-cli

CLI subcommands cover the common tasks:

aozora check FILE.txt           # lex and report diagnostics
aozora fmt --check FILE.txt     # parse-then-serialize round-trip check
aozora render FILE.txt          # render to HTML on stdout
aozora check -E sjis FILE.txt   # read Shift_JIS, the Aozora Bunko native encoding

All subcommands accept - (or no path) to read from stdin.

The Rust library entry point is the snippet at the top of this article. Document owns a bumpalo arena; tree borrows from it for the lifetime of the Document. WASM, C ABI, and Python bindings have their own minimal examples in the handbook's Bindings chapters.

What is distinctive about the implementation

Three design decisions are worth highlighting. Each has its own dedicated chapter in the handbook's Architecture section.

Markdown is out of scope. Aozora Bunko notation is not CommonMark, GFM, or any other Markdown dialect — it is a separate annotation system designed for transcribing print Japanese literature. aozora confines itself to that notation. If you want to write Markdown text that also contains Aozora Bunko annotations, the sibling project afm is a Markdown dialect built on top of aozora.

SIMD multi-pattern scanner. Aozora Bunko text is mostly UTF-8 Japanese (3 bytes per codepoint), with the seven trigger bytes (|《》※[] and full-width space) appearing roughly once per kilobyte of source. Scanning that sparse pattern set efficiently is the dominant cost in lexing, so aozora uses Intel Hyperscan's Teddy algorithm via aho-corasick. On x86_64 with AVX2 it runs at ~12 GB/s on the corpus benchmark. On targets without AVX2 it falls back via runtime dispatch to a Hoehrmann-style multi-pattern DFA (~3.5 GB/s). The wasm32 target uses memchr's portable multi-pattern path (~1.2 GB/s) until the WebAssembly SIMD proposal stabilises.

Borrowed-arena AST. A typical Aozora Bunko work parses to roughly 50,000 AST nodes. Allocating each one as an individual Box<Node> makes the allocator the bottleneck. aozora uses a single bumpalo::Bump arena per Document; the parser writes nodes into it in lex order and the tree borrows from the arena for its entire lifetime. On a corpus sweep over the full Aozora Bunko archive the arena variant is 6.4× faster than the equivalent Box<Node> shape, with peak RSS reduced by 30%. Drop is one Bump::reset call regardless of node count.

Where to go next

  • Handbook: https://p4suta.github.io/aozora/ — notation reference, architecture (borrowed-arena AST, SIMD scanner backends, Shift_JIS + gaiji resolver, Eytzinger-layout sorted-set lookup), performance (PGO pipeline, samply workflow, corpus sweep), and bindings.
  • API reference: https://p4suta.github.io/aozora/api/aozora/ — auto-deployed rustdoc.
  • Related projects:
    • P4suta/afm — CommonMark + GFM + Aozora Bunko notation, a unified Markdown dialect built on top of aozora.
    • P4suta/aozora-tools — authoring tools: formatter, LSP server, tree-sitter grammar, VS Code extension.

Publication to crates.io is gated on the v1.0 API freeze. Until then, depend on a tagged commit (the install chapter of the handbook keeps the current Cargo.toml snippet).

Read more

1Passwordを閉じるボタンが……ねえ!

1Passwordを使っていたら、いつの間にかウィンドウの 閉じる/最小化/最大化ボタンが消えていた。Ctrl+Wでウィンドウ自体は閉じられるので長らく放置していたけれど、調べてみたら原因がしょうもなかったので共有しておく。 💡結論 F11を押してみよう 症状 * ウィンドウ右上の最小化・最大化・閉じるボタンが表示されない * タイトルバーも消えている * Ctrl+W では普通に閉じられる * PC再起動、1Passwordの終了・再起動、アンインストール → 再インストール、いずれも変化なし 原因 ただフルスクリーンモードに入っていただけ。 1Passwordコミュニティの投稿「Lost window minimize buttons top rhc.」で全く同じ症状が報告されていて、コミュニティマネージャーの回答が「F11でフルスクリーンを切り替えてみて」だった。 解決手順 1. 1Passwordのウィンドウをクリックしてフォーカスを当てる 2. F11 を押す これでタイトルバーとボタン類が戻ってくる。ダメな場合は Win + ↓(ウィン

By Sakashita Yasunobu

外字と訓点を compile-time hash で解く

aozora は青空文庫の外字参照 (※[#「魚+師」、第3水準1-94-37] のような形) を約 14,000 件のテーブルで解決する。このテーブルを runtime の HashMap ではなく phf (perfect hash function) で持ち、コンパイル時に static 配列に焼き込んでいる。この記事はその選択の根拠と、JIS X 0213 → Unicode フォールバックの設計をまとめたもの。 handbook の対応章: Shift_JIS + 外字 resolver。 外字テーブルの形 外字エントリには 3 種類の解決結果があり、それぞれに対応する variant を GaijiEntry に持たせている。 static GAIJI_TABLE: phf:

By Sakashita Yasunobu

青空文庫の .txt を HTML に変換する最短手順

青空文庫 で配布されている .txt ファイルを HTML に変換したい、という用途向けの手順。Rust の知識は要らない。コマンド 1 行で済む。 1. CLI バイナリを取ってくる aozora の Releases ページ から自分の OS 向けのアーカイブを落とす。 OS アーカイブ名 Linux x86_64 aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz macOS arm64 aozora-vX.Y.Z-aarch64-apple-darwin.tar.gz Windows

By Sakashita Yasunobu

50,000 ノードの AST を 16 回のアロケーションで: bumpalo 借用アリーナの実例

aozora の AST は bumpalo 単一アリーナの上に構築されている。Box<Node> を素直に並べた版に比べてパースが 6.4 倍速、ピーク RSS が 30% 減という結果が出ている。この記事は、その設計判断と Rust ライフタイムの取り回しを実装の視点から整理したもの。 handbook の対応章: Borrowed-arena AST。 問題設定 青空文庫の典型的な作品は約 500KiB のソースで、aozora がパースすると約 50,000 ノードの木に展開される。素直に Rust らしく書けば次のような形になる。 enum Node { Plain(String), Ruby { target: String, gloss: String }, Container

By Sakashita Yasunobu