7.10.3. PEG-03 — CSV Parser
This tutorial builds a CSV parser that demonstrates collection-oriented PEG features. You will learn:
*rule(zero-or-more) and+rule(one-or-more) repetition!rule(negative lookahead)any,EOL,TSterminalsvoid?pattern rulesstring_anddouble_built-in terminalsThe comma-separated list idiom
7.10.3.1. Repetition Operators
Syntax |
Description |
|---|---|
|
Zero or more — collects into |
|
One or more — collects into |
|
Bind repeated matches to |
|
Capture repeated text into a string |
|
Optional (zero or one) |
When *rule or +rule is bound with as, the result is an
array of the rule’s return type.
7.10.3.2. Data Types
The parser produces typed rows using a variant and typedef:
variant Cell {
text : string
number : double
}
typedef Row = array<Cell>
7.10.3.3. The Comma-Separated List Pattern
The canonical PEG idiom for comma-separated lists is:
list -> *comma_item last_item
comma_item -> item ","
This avoids ambiguity with trailing commas. The last element has no comma:
var row : Row
rule(TS, *comma_cell as cells, cell as last) {
cells |> emplace(last)
return <- cells
}
var comma_cell : Cell
rule(cell as c, TS, ",", TS) {
return c
}
7.10.3.4. The Grammar
def parse_csv(input : string;
blk : block<(val : array<Row>; err : array<ParsingError>) : void>) {
parse(input) {
var csv : array<Row>
rule(*newline_row as rows, last_row as last, MB(trailing_eol), EOF) {
rows |> emplace(last)
return <- rows
}
rule(EOF) {
var empty_rows : array<Row>
return <- empty_rows
}
var newline_row : Row
rule(row as r, EOL) { return <- r }
var last_row : Row
rule(row as r) { return <- r }
var row : Row
rule(TS, *comma_cell as cells, cell as last) {
cells |> emplace(last)
return <- cells
}
var comma_cell : Cell
rule(cell as c, TS, ",", TS) { return c }
var cell : Cell
rule(string_ as text, TS) { return Cell(text = text) }
rule(double_ as value, TS) { return Cell(number = value) }
var trailing_eol : bool
rule(EOL) { return true }
}
}
7.10.3.5. Built-in Terminals
Terminal |
Description |
|---|---|
|
Matches |
|
Matches a floating-point number |
|
Matches a decimal integer |
|
Matches any single character |
|
End of input |
|
End of line ( |
|
Zero or more whitespace (including newlines) |
|
Zero or more tabs/spaces (no newlines) |
7.10.3.6. Negative Lookahead
!rule succeeds when rule does not match, without consuming
input. Useful for “match anything except”:
// Match any character that is not a newline
var expr_text : void?
rule(not_set('\n', '\r', ';')) {
return null
}