Main Software Blog Other

Adjustable type definitions for data exchange

Toward harmony between data types, programming languages and data formats

The purpose of this document is to introduce a series of upcoming tools for data exchange and the reasons for their existence. First published 2010-08-01.

1. Problem division
1.1 Ideal types and actuals types
1.2 Accomodating multiple programming languages
1.3 Accomodating multiple data formats
2. Tools
2.1 ATD
2.2 Yojson
2.3 Biniou
2.4 Atdgen

1. Problem division

Experience shows that the problem of exchanging, storing and evolving data would benefit from being split into independent subproblems.

Although data usually have one canonical way of being thought of, a variety of technical constraints call for a variety of data formats and implementations.

The first problem is therefore to come up with a common language for describing data types without relying on features that are specific to a particular programming language or data format.

The second problem is to allow any programming language to join the party without having to redefine and reimplement the tools for describing the data types.

The third problem is to allow different serialization formats to be used to represent the same data.

Instead of pushing for a supposedly best combination of programming language and data format, we acknowledge that real-world data will be represented in different ways by different tools no matter what.

1.1 Ideal types and actuals types

After a little bit of thinking and not so much tweaking, all data can be described using:

These are the core features allowed by the ATD syntax (ATD = "Adjustable Type Definitions"). In practice, each programming language may have a choice of different representations of a particular ATD type. The idea here is to allow annotations in the ATD file that can be used to specify language-specific options.

1.2 Accomodating multiple programming languages

Although the syntax of ATD is strongly based on the syntax of OCaml type definitions, and although the current tools using ATD are implemented in OCaml, the target languages can be really anything. OCaml shines when it comes to processing code and produce more code. The language of the input code here is ATD but the output can be anything: OCaml code, Java code, JSON, HTML documentation, etc.

It is of course possible to reimplement the tools of the atd package in another programming language at a reasonable cost, that is maybe a few weeks to months of work for a clean job, but this is not expected to happen anytime soon. Using the atd library for building code generators for other languages than OCaml makes perfect sense and is something OCaml is suited for. To date at MyLife we have been using the following tools all based on ATD type specifications:

1.3 Accomodating multiple data formats

A variety of data formats (JSON, XML, etc.) can be used and data types can be specified using ATD. The ATD syntax allows for annotations in various places which make it possible to adjust the basic ATD type definition to the idioms of the target language.

2. Tools

Several tools that make the ATD language relevant will be released around the same time. This is a list of features that these tools offer.

2.1 ATD

atd is the OCaml library providing a parser for the ATD language and various utilities. ATD stands for Adjustable Type Definitions in reference to its main property of supporting annotations that allow a good fit with a variety of data formats.

(* This is an ATD file *)

type 'a tree = [ Node of ('a tree * 'a tree) | Leaf of 'a ]

type record = {
  name : string;
    (* Required field *)

  ~friends : string list <ocaml repr="array">;
      Optional field with a default value, by default the empty list.

      <ocaml repr="array"> is an annotation for the OCaml code generators
      that only them need to interpret.

  ?descr : string option;
    (* Optional field without a default value. *)

  tree : int tree;

The atd package provides:

An interesting use of ATD annotations is that <doc text=...> nodes can be used to specify comments applicable to the generated code. Such comments can be interpreted by the code generators converted into ocamldoc-compliant or javadoc-compliant comments, allowing the production of quality HTML documentation.

2.2 Yojson

Yojson is an optimized parser/printer and pretty-printer for JSON. It addresses a few limitations of json-wheel and provides a number of low-level runtime functions on which code generated by atdgen hooks up.

The main differences with json-wheel are:

2.3 Biniou

Biniou (pronounced "be new") is a new binary format vastly equivalent to JSON since it has the following properties:

Field names and variant names are represented using 31-bit hashes like method names and polymorphic variants in OCaml.

Strings have no encoding requirement and are stored without any escaping.

Arrays of records can be represented using a specific representation called tables. A table does not repeat field information shared by all its records, resulting in space gains.

Biniou data typically take 25-30% less space than their JSON equivalent. biniou is the OCaml package that provides optimized readers, writers and pretty-printers for the biniou format. The library also provides the runtime functions used by the code generated by atdgen, as well as the buffer types used by yojson.

2.4 Atdgen

atdgen is a program that generates optimized OCaml code for reading and writing either biniou or JSON data. Generated code directly converts between byte buffers and the desired OCaml representation without going through a generic tree like json-static does.

Benchmarks performed on an amd64-Linux machine for combined reading and writing show that: