When handling delimited text, Evo prefers smart quoting over escape characters:
- Escape characters require the parser to mutate the input (remove escape prefixes), which prevents string Sharing and affects performance
- Escape characters also require the writer to mutate the output (insert escape prefixes), which increases the data size as well
Smart quoting avoids having to escape characters by using a quoting type adapted to the text being quoted.
Note: Manually applying smart quoting to text can be tricky in certain cases.
- Quoting Types
A "field" is the original text, which may contain whitespace, delimiters, and/or quote characters:
- Binary data is not supported here as it isn't text and may be impossible to correctly quote
- However, a quoted field may contain unprintable characters as well as UTF-8 multibyte characters
A "token" is a field as quoted or unquoted text, which is often followed by a delimiter acting as a separator for the next token:
- This text may contain delimiters and quote characters as well as any valid plain text
- If the field text contains the delimiter then it must be quoted, which effectively "escapes" the delimiter character
- If the field text begins or ends with literal quotes then it must be quoted, which effectively "escapes" those quotes
- If the field text begins or ends with either whitespace or unprintable characters then it must be quoted to correctly preserve the beginning and end
- If any combination of the above cases apply, then the field must be quoted with a quoting type that doesn't confuse the parser
- If none of the above cases apply, then quoting isn't required to parse the field
Single-char quoting:
- Single-quotes:
'foo bar'
- Double-quotes:
"foo bar"
- Backtick-quotes:
`foo bar`
Triple quoting (inspired by the Python language):
- Triple single-quotes:
'''foo bar'''
- Triple double-quotes:
"""foo bar"""
- Triple backtick-quotes:
```foo bar```
Following the above rules will correctly quote and escape text fields – see below for edge cases.
The non-single-quoting types (above) and edge cases (mentioned below) are rare with normal text, but must be handled correctly when using smart quoting.
No quoting:
- If a token doesn't begin and end with quotes then it's treated as unquoted
- Quoting at end is determined by context: a quote followed by a delimiter, or followed by end of input
- There may be whitespace (spaces, tabs, newlines) between the quote and the delimiter (or end of input), in which case the field is still end-quoted – see Formatting section below for example
- The parser doesn't get confused when quote characters (or apostrophes) appear inside unquoted tokens – in this case with unquoted text, the parser just splits by delimiter
- Example comma-delimited fields that aren't quoted at all – these particular cases don't confuse the parser:
can't,won't,'bout
can't
won't
'bout
– dangerous if unquoted due to beginning apostrophe (single quote)
'not' quoted,also not 'quoted'
'not' quoted
– dangerous if unquoted due to beginning single quote
also not 'quoted'
– dangerous if unquoted due to ending single quote
- Example that will confuse the parser with words that begin or end with an apostrophe (single quote):
can't,'bout,runnin',jumpin'
can't
bout,runnin
– considered quoted, fix with actual quoting
jumpin'
– dangerous if unquoted, the ending apostrophe will confuse a reverse-parser
- Reverse-Parser:
bout,runnin',jumpin
– gets different tokens due to the ending apostrophe
can't
- Fixed:
can't,"'bout","runnin'","jumpin'"
can't
'bout
runnin'
jumpin'
- Tricky examples – quoting like this should be avoided:
'one'two','three'
– the second quote isn't an end-quote because it isn't followed by a delimiter (or end of input)
'''
– 1 quote char, not triple-quoted since there's no end quote
''''
– 2 quote chars, not triple-quoted since there's no end quote
Backtick-DEL quoting:
- This is a fallback when no other quoting will work (which is very rare) and uses the
DEL
char (ASCII code 7F
, normally not printable but shown here as DEL
or ␡
)
- Here a backtick followed by a
DEL
char is used as a quote, and this pair is used at the beginning and the end, like with other quote types
- This assumes the quoted text is valid plain text (not binary data) – normal plain text doesn't use the
DEL
char at all, and is very unlikely to include a backtick-DEL pair followed by a delimiter (or end of input)
- This (and other quoting types) can still be used in combination with the
DEL
char as a delimiter
- Example with
DEL
delimiter and backtick-DEL
quoting:
`␡foo bar`␡␡`␡stuff things`␡
- Parsing
Parsers will look for a beginning quote, and if found try to find a matching end-quote followed by a delimiter or end of input.
- If no beginning quote, or if the end-quote isn't found, then the input is treated as unquoted (see above examples under "No quoting")
- Input that begins with a triple quote and ends with a single quote is treated as unquoted
- With triple end-quotes, any extra ending quote characters are kept as-is, i.e. included in field text
- Example with comma delim:
'''one'''',two
- The same applies to beginning quotes:
''''one''',two
- The parser must be aware of the delimiter that may follow the input
- A reverse-parser does the same thing in reverse, and either way will result in the same tokens as long fields are quoted correctly when required
- Formatting
Formatters will check the field text and write it with the appropriate quoting type.
- Quoting type preference order:
single, double, backtick, triple-single, triple-double, triple-backtick, backtick-DEL
- Beginning and/or ending quote characters should not be used as the quoting type if possible, even though it would still parse correctly
- Example input:
'foo,bar"
- Quoted (preferred):
`'foo,bar"`
- Quoted (avoid):
''foo,bar"'
– parsers should still handle this correctly
- Field text that contains a quoting type followed by the delimiter (or end of input) cannot be quoted with that quoting type
- Example with comma delim:
foo''',bar
– cannot be quoted with '
or '''
- Any whitespace (spaces, tabs, newlines) between a quote character and a delimiter is ignored when determining the quoting type to use
- Examples with comma delim:
foo', bar
– this cannot be quoted with a single-quote char since it contains this quote char followed by a delim
foo' , bar
– this cannot be quoted with a single-quote char since it contains this quote char followed by some whitespace then a delim
- Example using a comma delimiter that requires
backtick-DEL
quoting: foo''',""",```,bar
- Quoted:
`␡foo''',""",```,bar`␡