Module Sexplib0__Raw_grammar

Representation of S-expression grammars

Goals and non-goals

Functionality goals: With post-processing, sexp grammars can be pretty-printed in a human-readable format and provides enough information to implement completion and validation tools.

Performance goals: @@deriving sexp_grammar adds minimal overhead and introduces no toplevel side effect. The compiler can lift the vast majority of ASTs generated by @@deriving sexp_grammar as global constants. Common sub-grammars are usually shared, particularly when they derive from multiple applications of the same functor.

Non-goals: Stability, although we will make changes backwards-compatible or at least provide a reasonable upgrade path.

In what follows, we describe how this is achieved.

Encoding of generated grammars to maximize sharing

A group contains the grammars for all types of a mutually recursive group of OCaml type declarations.

To ensure maximum sharing, a group is split into two parts:

  • The generic_group depends only on the textual type declarations. Where the type declaration refers to an existing concrete type, the generic group takes a variable to represent the grammar of that type. This means that the compiler can lift each type declaration in the source code to a shared global constant.
  • The group binds the type variables of the generic_group, either to concrete grammars where the type declaration refers to a concrete type, or to another variable where the type declaration itself was polymorphic.

To understand this point better, imagine the following type declaration

type t = X of u

were explicitly split into its generic_group and group parts:

type 'u t_generic = X of 'u
type t = u t_generic

If u came from a functor argument, it's easy to see that t_generic would be exactly the same in all applications of the functor and only t would vary. The grammar of t_generic, which is the biggest part, would be shared between all applications of the functor.

Processing of grammars

The Raw_grammar.t type optimizes for performance over ease of use. To help users process the raw grammars into a more usable form, we keep two identifiers in the generated grammars:

  • The generic_group_id uniquely identifies a generic_group. It is a hash of the generic group itself. (It is okay that this scheme would conflate identical type declarations, because the resulting generic groups would be identical as well.)
  • The group_id uniquely identifies a group. It is a unique integer, generated lazily so that we don't create a side effect at module creation time.

The exact processing would depend on the final application. We expect that a typical consumer of sexp grammars would define less-indirected equivalents of the t and group types, possibly re-using the _ type_ and Atom.t types.

type label = string

The label of a field, constructor, or constant.

type generic_group_id = string
type group_id = Sexplib0__.Lazy_group_id.t
type var_name = string

Variable names. These are used to improve readability of the printed grammars. Internally, we use numerical indices to represent variables; see Implicit_var below.

type type_name = string
module Atom : sig ... end

A grammatical type which classifies atoms.

type 't type_ =
| Any

Any list or atom.

| Apply of 't type_ * 't type_ list

Assign types to (explicit) type variables.

| Atom of Atom.t

An atom, in particular one of the given Atom.t.

| Explicit_bind of var_name list * 't type_

In Bind ([ "a"; "b" ], Explicit_var 0), Explicit_var 0 is "a". One must bind all available type variables: free variables are not permitted.

| Explicit_var of int

Indices for type variables, e.g. 'a, introduced by polymorphic definitions.

Unlike de Bruijn indices, these are always bound by the nearest ancestral Explicit_bind.

| Grammar of 't

Embeds other types in a grammar.

| Implicit_var of int

Indices for type constructors, e.g. int, in scope. Unlike de Bruijn indices, these are always bound by the implicit_vars of the nearest enclosing generic_groups.

| List of 't sequence_type

A list of a certain form. Depending on the sequence_type, this might correspond to an OCaml tuple, list, or embedded record.

| Option of 't type_

An optional value. Either syntax recognized by option_of_sexp is supported: (Some 42) or (42) for a value and None or () for no value.

| Record of 't record_type

A list of lists, representing a record of the given record_type. For validation, Record recty is equivalent to List [Fields recty].

| Recursive of type_name

A type in the same mutually recursive group, possibly the current one.

| Union of 't type_ list

Any sexp matching any of the given types. Variant should be preferred when possible, especially for complex types, since validation and other algorithms may behave exponentially.

One useful special case is Union [], the empty type. This is occasionally generated for things such as abstract types.

| Variant of 't variant_type

A sexp which matches the given variant_type.

A grammatical type which classifies sexps. Corresponds to a non-terminal in a context-free grammar.

and 't sequence_type = 't component list

A grammatical type which classifies sequences of sexps. Here, a "sequence" may mean either a list on its own or, say, the sexps following a constructor in a list matching a variant_type.

Certain operations may greatly favor simple sequence types. For example, matching List [ Many type_ ] is easy for any type type_ (assuming type_ itself is easy), but List [ Many type1; Many type2 ] may require backtracking. Grammars derived from OCaml types will only have "nice" sequence types.

and 't component =
| One of 't type_

Exactly one sexp of the given type.

| Optional of 't type_

One sexp of the given type, or nothing at all.

| Many of 't type_

Any number of sexps, each of the given type.

| Fields of 't record_type

A succession of lists, collectively defining a record of the given record_type. The fields may appear in any order. The number of lists is not necessarily fixed, as some fields may be optional. In particular, if all fields are optional, there may be zero lists.

Part of a sequence of sexps.

and 't variant_type = {
ignore_capitalization : bool;

If true, the grammar is insensitive to the case of the first letter of the label. This matches the behavior of derived sexp_of_t functions.

alts : (label * 't sequence_type) list;

An association list of labels (constructors) to sequence types. A matching sexp is a list whose head is the label as an atom and whose tail matches the given sequence type. As a special case, an alternative whose sequence is empty matches an atom rather than a list (i.e., label rather than (label)). This is in keeping with generated t_of_sexp functions.

As a workaround, to match (label) one could use ("label", [ Optional (Union []) ]).

}

A tagged union of grammatical types. Grammars derived from OCaml variants will have variant types.

and 't record_type = {
allow_extra_fields : bool;
fields : (label * 't field) list;
}

A collection of field definitions specifying a record type. Consists only of an association list from labels to fields.

and 't field = {
optional : bool;

If true, the field is optional.

args : 't sequence_type;

A sequence type which the arguments to the field must match. An empty sequence is permissible but would not be generated for any OCaml type.

}

A field in a record.

type t =
| Ref of type_name * group
| Inline of t type_
and group = {
gid : group_id;
generic_group : generic_group;
origin : string;

origin provides a human-readable hint as to where the type was defined.

For a globally unique identifier, use gid instead.

See ppx/ppx_sexp_conv/test/expect/test_origin.ml for examples.

apply_implicit : t list;
}
and generic_group = {
implicit_vars : var_name list;
ggid : generic_group_id;
types : (type_name * t type_) list;
}