DSDL formal grammar specification

While working on the DSDL chapter, I identified that it is difficult to come up with a sufficiently rigorous definition of the language without resorting to a formal context-free grammar specification. One of our objectives that we discussed in New DSDL directive for simple compile-time checks was to ensure that the language is to be defined in terms of a regular grammar (as opposed to context-free grammar) for reasons of simplicity and ease of parsing. An obvious conflict was discovered that the complexity of constant expressions cannot be described in terms of a regular grammar, so the plan was to describe the general high-level structure of the language using a regular grammar, and then define the expression syntax using a more sophisticated context-free grammar.

However, upon a closer look one will see that such duality only increases the complexity of the specification, making it difficult to define, maintain, and (most of all) consume. Hence, I decided to define the entirety of the language using one notation. Implementers will still be able to use regular grammars for high-level parsing if desired, we just won’t explicitly encourage that in the specification.

I have constructed such a definition using a modified extended Backus-Naur notation from the Parsimonious library. The notation was chosen because it is easy to validate the correctness of the grammar definition using an automated tool (the library), and also for its perceived simplicity. Here it is:

# Top-level rule: a definition is a collection of lines, the collection may be empty.
# Lines may contain at most one statement, possibly followed by whitespaces, possibly followed by a comment.
# Lines are terminated by an ASCII CR, LF, or CRLF sequence.
# Unterminated line is allowed at the end of the definition.
definition  = line*
line        = statement? _? comment? end_of_line
comment     = ~r"#[^\r\n]*"
end_of_line = ~r"\r?\n?"
_           = ~r"[ \t]+"

# Possible kinds of statements.
statement   = directive / service_response_marker / attribute
attribute   = constant / field

# Fields define data entries that form the message body.
field           = scalar_field / array_field / padding_field
scalar_field    = field_type_spec _ attribute_name
array_field     = fixed_size_array_field / variable_size_array_field
field_type_spec = (cast_mode _)? type_name
padding_field   = ~r"void\d\d?"
cast_mode       = "truncated" / "saturated"

# Arrays are a special kind of fields, their syntax is slightly more complex.
# The capacity specification expression must yield an integer.
fixed_size_array_field    = field_type_spec _? "[" _? expression _? "]" _? attribute_name
variable_size_array_field = field_type_spec _? "[" _? variable_size_array_capacity_spec _? expression _? "]" _?
                            attribute_name
variable_size_array_capacity_spec = variable_size_array_capacity_spec_inclusive /
                                    variable_size_array_capacity_spec_exclusive
variable_size_array_capacity_spec_inclusive = "<="  # Handled first, otherwise "=" would never be consumed.
variable_size_array_capacity_spec_exclusive = "<"

# Constants do not participate in data exchange.
# The expression must yield a compatible type.
constant = type_name _ attribute_name _? "=" _? expression

# Attribute dependency tokens; some of them are simply aliased for clarity.
attribute_name = name_component
type_name      = compound_name ("." type_version)?
type_version   = decimal_integer "." decimal_integer

# Separates service request from service response,
# and also specifies that the current definition is of a service kind.
service_response_marker = "---"

# Type names, attribute names, namespace names, and directive names all follow the same syntax rules.
compound_name  = name_component ("." name_component)*
name_component = ~r"[a-zA-Z_][a-zA-Z0-9_]*"

# Directives may have an expression, depending on their semantics.
# The expression must yield a type that is expected by the directive.
directive      = "@" _? directive_name (_ expression)?
directive_name = name_component

# The rest of the rules is used to define expressions. The aliasing is added for clarity.
# Since DSDL definitions cannot be used to describe runtime logic, all expressions are constant expressions,
# evaluated at the time of definition processing.
expression      = logical_or_ex
expression_list = expression (_? "," _? expression)*

# Operators. The precedence relations are expressed in the rules; the order here is from lower to higher.
# Operators that share common prefix (e.g. < and <=, // and /, and so on) are arranged so that the longest form
# comes first; otherwise it would be unreachable.
logical_or_ex     = logical_and_ex (_? "||" _? logical_and_ex)*
logical_and_ex    = logical_not_ex (_? "&&" _? logical_not_ex)*
logical_not_ex    = (_? "!" _? logical_not_ex) / comparison_ex
comparison_ex     = bitwise_or_ex (_? ("==" / ">=" / "<=" / "!=" / "<" / ">") _? bitwise_or_ex)*
bitwise_or_ex     = bitwise_xor_ex    (_? "|" _? bitwise_xor_ex)*
bitwise_xor_ex    = bitwise_and_ex    (_? "^" _? bitwise_and_ex)*
bitwise_and_ex    = logical_shift_ex  (_? "&" _? logical_shift_ex)*
logical_shift_ex  = additive_ex       (_? ("<<" / ">>") _? additive_ex)*
additive_ex       = multiplicative_ex (_? ("+" / "-") _? multiplicative_ex)*
multiplicative_ex = unary_ex          (_? ("*" / "//" / "/" / "%") _? unary_ex)*
unary_ex          = power_ex / (("-" / "+") _? unary_ex)
power_ex          = atom (_? "**" _? unary_ex)?

# An atom is a basic part of an expression.
atom = set / parenthetical / literal / compound_name

# A set is a notation used to represent mathematical sets.
set = "{" _? expression_list _? "}"

# Parenthetical expressions are used for managing precedence.
parenthetical = "(" _? expression _? ")"

# Literals. Reals are defined first to avoid ambiguities.
literal = real / integer / string

# Integers.
integer             = binary_integer / octal_integer / hexadecimal_integer / decimal_integer
binary_integer      = ~r"0[bB](_?(0|1))+"
octal_integer       = ~r"0[oO](_?[0-7])+"
hexadecimal_integer = ~r"0[xX](_?[0-9a-fA-F])+"
decimal_integer     = ~r"(0(_?0)*)+|([1-9](_?[0-9])*)"

# Reals (floats). Exponent notation is defined first to avoid ambiguities.
real                   = exponent_notation_real / point_notation_real
exponent_notation_real = (point_notation_real / real_digits) real_exponent
point_notation_real    = (real_digits? real_fraction) / (real_digits ".")
real_fraction          = "." real_digits
real_exponent          = ~r"[eE][+-]?" real_digits
real_digits            = ~r"[0-9](_?[0-9])*"

# String literals.
string               = single_quoted_string / double_quoted_string
single_quoted_string = ~r"'[^'\\]*(\\[^\r\n][^'\\]*)*'"
double_quoted_string = ~r'"[^"\\]*(\\[^\r\n][^"\\]*)*"'

This is an early draft which may be slightly modified if there are issues with it (I am still testing it).

I would like @kjetilkjeka especially to validate the definition. It is modeled after the Python language specification, taking into account our prior discussions on this forum and on Github.

1 Like

The original definition turned out to be flawed, so I fixed it and updated the OP post. We need a complete test suite to validate the definition, though; preferably before the specification is released.