(Outdated wiki content from 2007.)

Solid Specification

This page is essentially a 'specification' for Solid. It is a work in progress; we would like your input in order to better understand what is needed before actually building the Solid tool.

Solid is: a set of assumptions

So, what is Solid? First, it is a set of assumptions about what constitutes a well-formed SFM (Standard-Format Markers) file. The Solid tool will only work on a file that meets these conditions:
  • It is a text file containing one or more records, each of which begins with the same field (the record marker). Conceptually, an SFM file represents a tree structure of fields and data, similarly to XML.
  • Each 'object' in the file begins with a newline immediately followed by a backslash, a field marker, and a single space. If that 'object' contains any simple text content, that content will immediately follow that space; the end of that content is terminated by the next newline+backslash.
  • The end of the 'object' itself (as opposed to its simple text content) is determined by a schema. Unlike XML, the tree structure of an SFM file is not self-evident without a schema.
  • The first few (two?) lines of the file can provide metadata about the 'database' file as a whole, rather than providing actual record data.

This is an example of the contents of a well-formed SFM file. It contains two records:

\lx aman
\ps adj
\ge safe
\de Free from danger. Safe.
\lx keamanan
\ps 
\ge safety
\de The condition of not being in danger.

The following file is NOT well-formed, because the first line does not begin with a backslash. If we deleted the opening space, it would technically be valid, but would contain only one record and one field (the first lx field). Toolbox helps prevent this kind of thing from happening, but a plain text editor cannot.
 \lx aman
 \ps adj
   \ge safe
   \de Free from danger. Safe.
 \lx keamanan
  \ps 
   \ge safety
   \de The condition of not being in danger.

Solid is: a way to define schemas

Second, Solid is a specific way of defining a schema that describes the tree structure of one or more SFM files, as well as constraints on the contents of individual fields, so that those SFM files can be validated.

  • Solid assumes that a 'Solid schema" is defined using a ??? schema (the Flex import-mapping XML file? a Relax NG schema? a Schematron schema? and XML Schema schema?). This schema describes a tree structure that (i.e. it applies to an SFM file), but .

Solid is: a tool to edit schemas and fix up data

Third, Solid provides software and a user interface than can validate a specific SFM file against a Solid schema. It will either report that the SFM file is 100% compliant, or it will provide an exception report. The exceptions will be grouped and totalled by type, so that the user can page through sets of similar problems and diagnose the problem, which will typically be fixed in one of these ways:

  • Editing the schema to match the SFM file.
  • Manually editing the SFM file to match the schema.
  • Transforming the SFM file (by a rule) to match the schema.

Rough development plan: Initially, the Solid tool will only provide functionality for task 1 above. Next, functionality for task 2 will be added, and finally (beginning with version 1.0?), functionality for task 3 will be provided little by little in manageable chunks. One of the most essential transforms is going to be the Node Inference transform, which will add implicit structure to an underspecified record as in the following example.

FROM:

\lx ngingkandi
\ge eat
\gn makan

TO:
\lx ngingkandi
\sn
\ps
\ge eat
\gn makan

SINCE IT MEANS:
lx: ngingkandi
    sn:
        ps:
            ge eat
            gn makan

The Node Inference transform(s?) needed for transforming underspecified MDF into (structurally) complete MDF will be provided, along with two schemas: Underspecified MDF (under which the FROM and TO above are both valid) and Specified MDF (by which TO is valid, but FROM is not). Note that 'complete MDF' would still allow for optional items ('objects'), but no implicit parent objects would be allowed. For example, even though the existence of an example sentence object as a whole is optional, the \rf field is required for any example sentence that does exist. Thus, in addition to inferring sn and ps, the Node Inference transform from Underspecified MDF to Specified MDF would also need to infer rf:

FROM:

\lx ngingkandi
\sn
\ps
\ge eat
\xv Sina ngingkandi.
\xe He is eating. Or: He has eaten.
\xv Sira mingkandi.
\xe They are going to eat.

TO:
\lx ngingkandi
\sn
\ps
\ge eat
\rf
\xv Sina ngingkandi.
\xe He is eating. Or: He has eaten.
\rf
\xv Sira mingkandi.
\xe They are going to eat.

SINCE IT MEANS:
lx: ngingkandi
    sn:
        ps:
           ge eat
           rf: 
               xv: Sina ngingkandi.
               xe: He is eating. Or: He has eaten.
           rf: 
               xv: Sira mingkandi.
               xe: They are going to eat.

NOTE: Although we have just described this inference as involving two 'pure' schemas and an inference transform, it may be desirable to instead make the schema more powerful. The Flex import process does this by allowing infer-parent rules to be specified directly in the schema. Regardless of the underlying representation, we should certainly consider presenting this transform-plus-schema combination to the user as simply a schema.

It is not yet clear how best to represent the transforms on disk. XSLT is one option, although we suspect that XSLT cannot handle the inference transform described above. One option would be to hard-code the inference transformation (configurable by two parameters: child, parent), soft-code the other transforms as XSLT, and to use a common interface for presenting their parameters to the user (Phil). If there were an "Advanced..." button for directly editing the XSLT, it would need to be disabled for the inference transform.

  • Question: Do we package each schema (and any transforms targeting that schema) as an individual file, or could one file contain an ordered sequence of schemas and transforms?