puml-syntax Shared Syntax + Highlighting Specification
Mirror of
docs/specs/puml_syntax_highlighting_spec(1).md— the in-repo file is the source of truth.
One syntax contract for every editor, Markdown renderer, browser surface, LSP, agent pack, and documentation pipeline.
This is the grammar layer. It is not a nice-to-have. It is the reason every surface can make puml feel native instead of bolted on.
Name
Package family:
- Rust/library:
puml-syntax - Tree-sitter grammar:
tree-sitter-puml - VS Code/TextMate grammar artifact:
puml.tmLanguage.json - Highlight query package:
@puml/syntax
Language ID:
puml
Accepted aliases:
plantuml
uml-sequence
puml-sequence
File extensions:
.puml
.plantuml
.iuml
Code fences:
```puml
```plantuml
```puml-sequence
```uml-sequence
puml is the canonical fence. The rest are compatibility aliases.
Product position
puml-syntax is the shared language surface.
It powers:
- syntax highlighting in editors
- Markdown code fence highlighting
- semantic tokens from the language server
- browser editor highlighting in
puml-studio - VS Code, Cursor, Windsurf, Zed, Neovim, Helix, JetBrains bridges
- Codex and Claude authoring tools that need token-aware repair
- docs, examples, generated screenshots, and fixture visualization
No surface gets its own private grammar. No product owns a shadow parser. Syntax drift is treated as a product bug.
Clarification: Tree-sitter, not Treehouse
The thing you probably meant is Tree-sitter.
Tree-sitter is the incremental parser/highlighting engine used by many editors and code hosts. In this stack, Tree-sitter is used for fast syntax highlighting and embedded-language support. It is not the renderer, not the semantic model, and not the source of truth for diagram correctness.
The source of truth remains puml-core.
Non-negotiables
- One syntax taxonomy for all products.
- One canonical language ID:
puml. - TextMate grammar exists because VS Code needs immediate lexical highlighting.
- Tree-sitter grammar exists because modern editors and hosted renderers need robust incremental highlighting.
- LSP semantic tokens exist because regex highlighting cannot resolve aliases, participant identity, includes, lifecycle state, or semantic errors.
- No feature ships unless syntax coverage ships with it.
- No regex-only product surface.
- No editor-specific primitive names.
- No Markdown renderer with a different tokenizer.
- No web editor with a different tokenizer.
- No agent prompt that invents its own syntax categories.
- No syntax highlighter that performs filesystem IO.
- No highlighter expands includes.
- No highlighter runs the preprocessor.
- No highlighter executes directives.
- Highlighting must be safe on hostile input.
- 90% line coverage minimum for syntax tooling code.
- 100% primitive coverage across fixture snapshots.
Architecture religion
Use a layered grammar contract and treat it as law.
puml language spec
-> puml-core parser and AST
-> token taxonomy
-> syntax fixtures
-> TextMate grammar
-> Tree-sitter grammar + queries
-> LSP semantic token legend
-> editor / Markdown / browser consumers
Rules:
puml-coredecides what is valid.puml-syntaxdecides how valid and partially valid source is tokenized.- TextMate is a lexical bootstrap layer only.
- Tree-sitter is an incremental syntax layer only.
- LSP semantic tokens are the semantic layer.
- TextMate, Tree-sitter, and semantic tokens must share the same taxonomy.
- Every taxonomy change updates every grammar artifact in the same PR.
- Every grammar artifact is tested against the same fixture corpus.
- Syntax highlighting never affects parse or render output.
- Syntax tooling must tolerate malformed source better than the parser.
- Diagnostics come from
puml-core/puml-language, not from the highlighter. - If the TextMate grammar, Tree-sitter grammar, and parser disagree on a fixture, the build fails until the disagreement is explained and snapshotted.
Repository layout
crates/
puml-core/
puml-language/
puml-syntax/
packages/
puml-syntax/
package.json
README.md
grammars/
puml.tmLanguage.json
tree-sitter-puml/
grammar.js
tree-sitter.json
queries/
highlights.scm
injections.scm
locals.scm
folds.scm
indents.scm
themes/
puml-light.json
puml-dark.json
fixtures/
valid/
invalid/
markdown/
snapshots/
crates/puml-syntax owns Rust-side token classification and semantic token translation.
packages/puml-syntax owns editor-consumable grammar artifacts.
Language scopes
Canonical TextMate root scope:
source.puml
Specific scopes:
comment.line.apostrophe.puml
constant.character.escape.puml
constant.language.directive.puml
constant.numeric.puml
entity.name.participant.puml
entity.name.alias.puml
entity.name.section.puml
invalid.illegal.puml
keyword.control.group.puml
keyword.control.lifecycle.puml
keyword.control.note.puml
keyword.declaration.participant.puml
keyword.other.skinparam.puml
keyword.other.include.puml
markup.heading.title.puml
markup.raw.message.puml
punctuation.definition.comment.puml
punctuation.definition.string.begin.puml
punctuation.definition.string.end.puml
punctuation.separator.comma.puml
storage.type.participant-kind.puml
string.quoted.double.puml
string.unquoted.puml
support.constant.color.puml
support.function.arrow.puml
variable.other.participant-ref.puml
Do not overfit to one theme. Use standard-ish scopes where possible, but make puml-specific scopes available for exact styling.
Semantic token legend
Semantic token types:
namespace
class
type
variable
parameter
property
enumMember
function
method
keyword
modifier
comment
string
number
regexp
operator
decorator
label
participant
action
message
note
group
lifecycle
style
directive
alias
Semantic token modifiers:
declaration
definition
reference
implicit
readonly
defaultLibrary
deprecated
invalid
unresolved
ambiguous
created
destroyed
activated
deactivated
self
found
lost
generated
Mapping rules:
- Explicit participant names:
participant declaration. - Participant aliases in declarations:
alias declaration. - Participant references in messages:
participant reference. - Auto-created participants:
participant implicit. - Unknown references:
participant unresolved invalid. - Ambiguous references:
participant ambiguous invalid. - Message labels:
message. - Notes:
note. - Group labels:
group. - Lifecycle commands:
lifecycle keyword. - Arrows:
operatorplusself,found, orlostwhen applicable. - Unsupported non-sequence syntax:
invalid.
Token taxonomy
Every primitive in the sequence spec must be tokenized.
Document block directives
Required tokens:
@startuml
@enduml
@startuml name
@startuml "Display Name"
Rules:
- Outside-block content is tokenized as
commentorsource.ignored.pumldepending on host support. - Multiple blocks in one file are highlighted independently.
- Unterminated blocks produce invalid ranges without breaking the rest of the file.
Comments
Supported:
' full-line comment
Alice -> Bob: hello ' inline comment
Alice -> Bob: "don't break quoted text"
Rules:
- Apostrophe starts a comment outside quoted strings.
- Apostrophe inside quoted strings is text.
- Comments never create semantic tokens.
Participants
Tokenize declarations for:
participant
actor
boundary
control
entity
database
collections
queue
Tokenize modifiers:
as
order
#color
Tokenize participant groups:
box "Frontend"
box "Backend" #e0f2fe
end box
Messages and arrows
Every arrow operator is a first-class operator token:
->
-->
<-
<--
->>
-->>
<<-
<<--
->x
x->
-x
->o
o->
<->
<-->
-[#red]>
-[#ff0000]>
-[#red,dashed]>
-[#red,bold]>
Lifecycle suffixes attached to messages are tokenized separately:
++
--
**
!!
Message label separator:
:
Rules:
- Arrow color/style bracket is part of arrow syntax but exposes nested color/style tokens.
- Empty labels remain valid.
- Multiline escaped labels tokenize
\nas escape.
Self, found, and lost messages
Tokenize:
Alice -> Alice: internal
[-> Alice: found
Alice ->]: lost
[--> Alice: dashed found
Alice -->>]: async lost
Rules:
[and]are endpoint punctuation tokens.- Found/lost semantics come from LSP semantic tokens, not TextMate.
Lifecycle
Tokenize:
activate Alice
activate Alice #ff0000
deactivate Alice
destroy Alice
create Bob
return
return value
autoactivate on
autoactivate off
Notes
Tokenize inline and multiline:
note left of Alice: text
note right of Alice: text
note over Alice, Bob: text
note across: text
note left of Alice
text
end note
hnote over Alice: hex note
rnote over Bob: rounded note
Rules:
note,hnote,rnote,left,right,over,across,of,end noteare structural tokens.- Note body is text, not parsed as directives.
- Colors after placement are color tokens.
References
Tokenize:
ref over Alice: external flow
ref over Alice, Bob
external flow
end ref
Groups
Tokenize:
alt condition
else other condition
opt condition
loop condition
par branch
break condition
critical condition
group label
group label [secondary label]
end
Rules:
elseis a branch separator token.- Nested groups must highlight cleanly even before semantic validation.
Dividers, delays, spacers
Tokenize:
== Initialization ==
...
...5 minutes later...
|||
||45||
Autonumber
Tokenize:
autonumber
autonumber 10
autonumber 10 5
autonumber "<b>[000]"
autonumber stop
autonumber resume
autonumber resume 100
autonumber off
Titles, headers, footers, captions, legends
Tokenize:
title Login flow
title
Login flow
end title
header text
footer text
caption text
legend
text
end legend
legend left
legend right
Styling
Tokenize supported skinparam primitives:
skinparam backgroundColor #ffffff
skinparam sequence {
ArrowColor #111827
LifeLineBorderColor #9ca3af
}
skinparam sequenceArrowColor #111827
Rules:
- Known-but-unsupported skinparams are styled as style directives and receive LSP warnings.
- Unknown garbage inside
skinparam sequence {}gets invalid syntax highlighting only when structurally impossible.
Includes and preprocessing
Tokenize:
!include path
!include ./relative/path.puml
!define NAME value
!undef NAME
!theme name
Rules:
- Syntax highlighter does not load includes.
- LSP resolves includes.
- Remote include directives tokenize but receive diagnostics from semantic validation.
Pages
Tokenize:
newpage
newpage Title for next page
TextMate grammar contract
TextMate grammar is required for immediate VS Code highlighting.
File:
packages/puml-syntax/grammars/puml.tmLanguage.json
Rules:
- JSON format only.
- Deterministic key ordering.
- No generated unreadable blob unless the generator is checked in.
- Regexes must be documented with comments in the generator or adjacent source.
- It must highlight partially typed files without waiting for LSP.
- It must not attempt semantic resolution.
- It must not parse includes.
- It must not treat every unquoted word as a participant declaration.
- It must prefer useful degraded highlighting over false precision.
Required grammar tests:
- Full fixture corpus scope snapshots.
- Partial-line editing snapshots.
- Broken quote snapshots.
- Broken block snapshots.
- Markdown fenced block injection snapshots.
- Large-file smoke test with 10,000 message lines.
Tree-sitter grammar contract
Tree-sitter grammar is required for robust cross-editor syntax.
Files:
packages/puml-syntax/tree-sitter-puml/grammar.js
packages/puml-syntax/tree-sitter-puml/tree-sitter.json
packages/puml-syntax/tree-sitter-puml/queries/highlights.scm
packages/puml-syntax/tree-sitter-puml/queries/injections.scm
packages/puml-syntax/tree-sitter-puml/queries/locals.scm
packages/puml-syntax/tree-sitter-puml/queries/folds.scm
packages/puml-syntax/tree-sitter-puml/queries/indents.scm
Tree-sitter scope:
source.puml
Tree-sitter file types:
puml
plantuml
iuml
Required node families:
document
ignored_text
diagram_block
start_directive
end_directive
participant_declaration
participant_kind
participant_reference
alias_declaration
message
arrow
arrow_head
arrow_line
arrow_style
message_label
note
note_body
ref_block
group_block
group_branch
lifecycle_event
divider
delay
spacer
autonumber
title_block
header
footer
caption
legend
skinparam
include_directive
preprocessor_directive
newpage
comment
string
color
number
error
Rules:
- Grammar accepts incomplete source.
- Grammar is optimized for incremental editing, not final semantic validation.
- Query captures map to the same taxonomy as TextMate and LSP semantic tokens.
locals.scmcaptures participant declarations and references where possible.folds.scmcaptures diagram blocks, notes, refs, groups, title blocks, legends, and skinparam blocks.indents.scmcaptures notes, refs, groups, boxes, legends, title blocks, and skinparam blocks.
Markdown injection contract
Markdown code fences for puml and compatibility aliases must highlight as source.puml.
Supported host surfaces:
- VS Code Markdown editor
- VS Code Markdown preview support package
puml-studioembedded Markdown panes- static docs generated by
puml-markdown - docs examples in README and website
Rules:
- Syntax highlighting is independent from rendering.
- A fence can be highlighted even if rendering fails.
- A rendered diagram can show diagnostics even if highlighting is unavailable.
- Code fences use the same language ID registry as file extensions.
Error highlighting contract
Malformed source must remain navigable.
Required degraded-highlighting cases:
- missing
@enduml - missing
end note - missing
end ref - missing
end - unclosed quote
- malformed arrow
- malformed color
- malformed skinparam block
- dangling
else - invalid participant alias
- non-sequence syntax inside a diagram block
Rules:
- Highlight errors locally.
- Never flood the entire rest of the file as invalid unless the structure is truly unrecoverable.
- Semantic errors are marked by LSP diagnostics, not by TextMate.
Theme contract
Ship default light and dark highlight themes for screenshots and tests.
Do not force themes in editors. Provide suggested token colors only.
Theme tokens:
puml.directive
puml.participant
puml.alias
puml.arrow
puml.message
puml.note
puml.group
puml.lifecycle
puml.style
puml.include
puml.comment
puml.string
puml.color
puml.error
Fixtures
Every syntax fixture mirrors the renderer fixture suite.
Required fixture groups:
basic/
participants/
arrows/
notes/
groups/
lifecycle/
styling/
structure/
preprocessor/
markdown/
errors/
partial/
large/
Every valid fixture snapshots:
- Parser AST from
puml-core. - Rust token stream from
puml-syntax. - TextMate scope ranges.
- Tree-sitter parse tree.
- Tree-sitter highlight captures.
- LSP semantic token ranges.
Every invalid fixture snapshots:
- Degraded token stream.
- Tree-sitter error nodes.
- TextMate scopes.
- LSP diagnostics.
Drift detection
Add a required test binary:
cargo test -p puml-syntax syntax_contract
It must:
- parse every fixture with
puml-core - tokenize every fixture with
puml-syntax - run TextMate grammar snapshots through a grammar test harness
- run Tree-sitter parse and query snapshots
- compare token categories across all three systems
- fail if a known primitive is missing from any layer
No fixture can be marked “expected mismatch” without a written reason in the snapshot metadata.
Browser editor contract
puml-studio may use one of two highlighting paths:
- LSP semantic tokens through
puml-languagecompiled to WASM. - Tree-sitter highlighting through
tree-sitter-pumlcompiled for the browser.
Rules:
- Browser syntax highlighting must share the same token taxonomy.
- Browser syntax highlighting must not duplicate grammar rules in ad-hoc TypeScript.
- Browser syntax highlighting must continue working while WASM rendering is loading.
- Highlighting large files must not block the main thread.
Agent tooling contract
Agent-facing packages use syntax metadata for repair loops.
Required capabilities:
- extract participant declarations
- extract aliases
- identify unresolved participant references
- identify diagram block boundaries
- identify message rows
- identify group/note/ref block ranges
- identify syntax-invalid lines
Agents must never rely on prompt-only syntax descriptions when token metadata is available.
Security
Syntax tooling handles untrusted text.
Rules:
- No filesystem IO.
- No include expansion.
- No network access.
- No
eval. - No regex with catastrophic backtracking.
- No unbounded recursion.
- No panics on malformed Unicode.
- No raw HTML injection in generated highlighted output.
- No embedding of source text into HTML/SVG without escaping.
Performance
Targets:
- Initial tokenization of a 1,000-message diagram: under 20ms in native tests.
- Incremental Tree-sitter edit response: under 5ms for common single-line edits.
- TextMate tokenization should not exhibit pathological behavior on 10,000-line files.
- Semantic token generation is linear in source size plus include graph size.
Benchmarks:
scripts/bench-syntax.sh
Benchmark cases:
- hello fixture
- 1,000-message fixture
- 10,000-message fixture
- pathological malformed quote fixture
- deeply nested group fixture
- Markdown file with 100 puml fences
Development commands
cargo fmt
cargo clippy --workspace -- -D warnings
cargo test -p puml-syntax
npm test --workspace @puml/syntax
npm run test:tree-sitter --workspace @puml/syntax
npm run test:textmate --workspace @puml/syntax
cargo llvm-cov --package puml-syntax --fail-under-lines 90
Definition of done
- Every sequence primitive has TextMate highlighting.
- Every sequence primitive has Tree-sitter grammar coverage.
- Every sequence primitive has highlight query coverage.
- Every sequence primitive has LSP semantic token mapping.
- Markdown fences highlight as
source.puml. .puml,.plantuml, and.iumlfiles activate the language.- Drift detection test passes across parser, Rust tokenizer, TextMate, Tree-sitter, and LSP semantic tokens.
- Invalid source highlights locally and remains editable.
- No regex catastrophes on large or hostile input.
- No syntax tool performs IO or network calls.
- 90% line coverage passes.
- Default clippy passes with warnings denied.
- The syntax package is usable by VS Code, Markdown renderer, SPA, LSP, and agent pack.
Reference docs checked
- VS Code uses TextMate grammars as its main tokenization engine and semantic tokens as an additional layer.
- VS Code language extensions can define language configuration for comments, brackets, folding, indentation, and word patterns.
- Tree-sitter supports syntax highlighting through grammar repositories and query files.
Reference URLs:
- https://code.visualstudio.com/api/language-extensions/semantic-highlight-guide
- https://code.visualstudio.com/api/language-extensions/syntax-highlight-guide
- https://code.visualstudio.com/api/language-extensions/language-configuration-guide
- https://tree-sitter.github.io/tree-sitter/3-syntax-highlighting.html
- https://tree-sitter.github.io/tree-sitter/using-parsers/queries/1-syntax.html