Linguistic Markup Language 0.01

Open specification for translational filtration description to be incorporated into the Scribeling user-generated translation engine

Note - this specification is very young. The central linguistic description system itself is subject to more change at this point, but the way in which LML functions may change as is necessitated.

This markup specification describes how the syntax expression in a "master description document," (essentially a document that attempts to index all possible forms of grammatical expression, at their most fundamental grammatical functions) will relate to the actual objects of speech and relevant translations present in the Scribeling database. The translation engine searches for translations to larger phrases present in input, then parses through the rest of the input looking for all feasible individual syntactical objects, observing their grammatical functions then attempting to determine an appropriate grammatical conversion, based on user submissions, or a direct rearrangement and translation, if possible. The "master description" document is essential, as it provides the context for a mediative "interlingua." The rules that are specified in accordance with the syntax described in the "master description" document are created entirely by the users of the database, while the functions of individual words or phrases submitted to the database are specified in similar syntax, although only as a description of their grammatical functions (in case a given expression functions as more than one individual grammatical object).

As a convention in this document, all LML code is enclosed within <tt></tt> tags.

Structural description metatags:

All of the following metatags serve to specify the structure of Linguistic Markup Language itself.

<part class="...">

As a rule, the <part> tag serves only to specify the fundamental units of speech - the idea being that all grammatical functionality can be narrowed down to a manageable number of discrete objects and reconstructed according to user-implemented rules..

The values of class attributes of <part> tags is the only source of tags in LML besides those specified in this document. If <part class="noun"> is declared in the master description document, for instance, the tag <noun> must subsequently be interpreted by any LML parsers running with that master description document.

<attrib class="...">

Describes attributes in which a part of speech may be described - several of these overlap, or negate others.

The values of class attributes of <attrib> tags act somewhat similarly to those of <part> tags, the distinction being that the class of <attrib> tags specifies the attributes of the tag that their containing tag describes. For example:

<part class="noun">
	<attrib class="case">
		<option value="ablative" />
		<option value="genitive" />
		<option value="dative" />
		<option value="nominative" />
                (etc.)
	</attrib>
</part>

This markup in a MDD indicates that the object <noun case="genitive"> is specified in acceptable LML syntax in a sentence structure submission for any given language. This is, of course, just an example - these tags are nearly universally markedly more complex.

If an <attrib> tag is specified within an <option> or <boolean> tag, it's dependent upon the <attrib> tag that encloses it, and therefore <attrib> tags in a negated scenario should not be parsed. It's important to note that in all other instances, boolean or option tags should be self-contained, as they are shown below.

<clausetype function="noun" />

<clausetype> tags describe the greater purpose of a clause, so that it's greater role in the sentence may be analyzed (in effect, it creates another branch in the parse tree). It should be noted that these are distinct from the <clause> tags, in that the <clausetype> tags serve, in the master description document itself, as the specification for the <clause> tags.

Accepted input types for <attrib> tags:

<integer limit="..." />

The data allotted to the attribute, limit specified. This may not see too many uses. Integers are unsigned and begin at zero.

<option value="..." />

Indicates that its parent tag must be categorized as one of the <option> tags to be accepted.

<boolean />

Indicates that only boolean values are accepted ("true" or "false", to avoid confusion).

<string limit="..." />

Indicates that a string is expected. Probably not the case very often.

Syntactical object specification tags

An individual expression should be described in terms of individual part of speech tags, such as <noun> or <preposition>. The expression, if it functions as an amalgam of more than one part of speech, should be described in a <clause> tag that documents its function as a whole.

Sentence structure specification tags

Header tags

These tags, as a whole, describe features or attributes of a language, those besides the actual allowed syntax and grammatical forms. Conjugations and declensions should similarly be described with these tags, if such modifications of words can be described in a systematic manner - in other cases, the words should exist as independent entries in the database.

<language isocode="xxx" />
<author name="xxx" /> (optional)
<timestamp value="YYYY-MM-DD HH:MM:SS" /> (optional, though recommended)

The <language> tag, simply enough, specifies what language the supplied sentence structures are appropriate for. Be very careful what you write here, more careful than usual. The isocode attribute accepts ISO-639/3 codes for languages, which should always be three lowercase letters, although the parser itself is case insensitive (really, for most purposes).

The <author> tag is just for credit - it is, of course, optional.

The <timestamp> tag is just that - a timestamp. If Scribeling sticks around for a while, these might come in handy to document changes in syntax, though otherwise it's only for the record.

<sep domain="clause/sentence/quote" value="n" clauseclass="dependent/independent" clausefunction="noun/adverb/adjective" />

The <sep> tag is used to classify characters that separate clauses, sentences, or quotes (while words that separate clauses are defined as ordinary words are). The value attribute indicates what character it consists of (using a standardized syntax for individual characters, probably part of regex), while the clauseclass attribute, if specified, mandates what type of clause the character must indicate the presence of, which is only the case if the domain is set to clause. The clausefunction works similarly, although it instead describes the grammatical function the clause must fill (if the attribute is specified). If an attribute is defined as value="", the engine will assume that nothing exists to make such a distinction, and parse through the document trying to identify specific clauses, if such distinctions may still be made.

<conjugation num="x" rootsubstr="(int),(int)">
  <instance reqmodal="true/false" prefix="(string)" suffix="(string)" tense="pluperfect/perfect/present..." person="y" plural="true/false" rootsubstr="(int),(int)" ...  />
</conjugation>

Each set of <conjugation> tags describes a standardized way in the language in which a certain type of verb may be conjugated (in the likely case that there's more than one way for this to happen). The num attribute simply provides a method for remotely invoking multiple systems of conjugation, while the placement attribute indicates where the modifications to the infinite should be made. This system is very immature, and will certainly require more specification. Again, the more attributes that are specified, the more specific the rule's applications become - as a result of having such versatility, there's a necessary consequence that too many attributes will create nonsensical rules, so it's important this this system is understood properly before rules can successfully be created. The reqmodal attribute indicates that the conjugation creates a modal ("could", "might", etc) as a necessary part of the verb form. suffix and prefix indicate what prefix or suffix the conjugation adds onto the rootsubstr, which itself is defined as two natural numbers indicating how many characters should be cut off of the infinitive, from the beginning and end, respectively - if this is defined at the conjugation level, it indicates that this happens with all conjugations of a verb, while if it's defined in an instance of the conjugation, it will only happen within the context of that specific instance. It's also important to note that the rootsubstr attribute can be used to simply erase the root and define an entirely new word, if a conjugation diverges that far from its root.

<declension type="noun/adjective..." num="1">
  <instance prefix="(string)" suffix="(string)" case="dative/genitive/accusative..." plural="true/false" prefix="true/false"  rootsubstr="(int),(int)"></instance>
</declension>

The declension tags function similarly to the conjugation tags in general - the same attribute-narrowing rules as usual apply, although the attributes specified are obviously different than those used for verbs, and the rootsubstr, prefix, case and plural attributes act as previously defined.

Sentence description tags
<sentence>
</sentence>

These tags indicate a sentence group. These are the foundation of all sentence structure specifications.

<clause class="dependent/independent" function="noun/adverbial/adjective">
</clause>

<clause> tags indicate that the language at hand accepts just that, a clause, either independent or dependent, composed of the words contained within the <clause> tags. An independent clause functions essentially as a sentence of its own, while a dependent clause is seen to function as is specified in the function attribute (i.e., as a noun or modifying adjective - "I grant my possessions to (those who have seen me through this dilemma)." - the parenthesized words function as a noun would, and may thus be treated as one in parsing until the clause is to be deconstructed further.

The remaining tags in LML are specified by the description meta tags. They take the form

<a b="x|y" c="z">

where:

'a' is the class of the <part> tag in the specification,

'b' and 'c' are the classes of the <attrib> tags in the specification, contained within the <part class="a"> tag (<attrib> is a child of <part>)

'x', 'y', and 'z' are input values, deemed acceptable by whether or not the <attrib> tag contains a tag specifying 'z' as an acceptable input in either an <integer>, <option>, <boolean> or <string> tag. If there are two values contained within an attribute, separated by the bar character, | , as in b="x|y", then that serves to indicate that in every situation, either the value for x or the value for y is acceptable. The bar character is chosen over more conventional markup separators like commas and semicolons simply because commas and semicolons may have to be specified within attributes to describe phenomena such as clause separation, while I'm aware of no languages that use a bar - if there are any, they'll have to use a metacharacter to describe it.

So, for example:

<sentence>
	<clause class="independent">
		<subject>
			<clause class="dependent" function="noun" />
		</subject>
		<predicate>
			<verb valency="2" mood="active" />
			<clause class="dependent" function="noun" />
		</predicate>
	</clause>
</sentence>

This markup will match any sentence that complies with the form (subject) (active verb) (object), where the subject and object themselves may be further divided into dependent clauses or individual syntactical objects.

These tags:

<conversion precedence="x">
  <start>
    ...
  </start> 
  <end>
    ...
  </end>
</conversion>

serve as the translational underbelly for the site, indicating conversions that must be made to accommodate foreign syntax. The more attributes that are specified to individual parts of speech inside both the <start>...<start> and <end>...</end> tags (which themselves both enclosed within the <conversion>...</conversion> tags), the more narrow the specification becomes, so the specification of attributes here is, as you may predict, quite intricate. The most important strength of this system is that it is not only correctable by the users of the site in case an error is discovered, but the rules, through the precedence attribute of the <conversion> tags, may be favored over each other, allowing for more specific exceptions to be made to broader rules - and rules simply stop applying if a sufficient amount of users have expressed that they are fallacious. A very simple example of a use of this is converting a Latin noun in the dative to the English equivalent of the noun and the preposition "to."

Y'know, if you log in, you can write something here, or contact authors directly on the site. Create a New User if you don't already have an account.