Wilhem Streitberg: "Gotisch-Griechisch-Deutsches Wörterbuch" (1910)
-------------------------------------------------------------------------------
Preliminary XML encoding of the headwords and German glosses
-------------------------------------------------------------------------------

Every element and attribute is described in the RELAX NG schema file. Below is 
a prose description of the schema, along with a few general remarks. Note that 
attributes in the http://www.wulfila.be/archive/2006/DB/dictionary namespace 
(using prefix db:*) are not part of the dictionary text and may be ignored 
(they contain data from the original database).

(1) The schema borrows a lot from TEI. Some elements are equivalent or even 
have the same name. Future conversion to TEI (either as the primary file or as 
a secondary, derived format) should be fairly easy. But contrary to TEI, the 
preliminary schema is quite restrictive and specific. It describes the current 
(incomplete) transcription of Streitberg's dictionary -- nothing more, nothing 
less.

(2) The document root <dictionary> has a <meta> block with a few notes on the 
file, followed by 3500+ <entry> blocks. Each <entry> corresponds exactly to one 
entry (boldfaced lemma) in Streitberg's dictionary . As far as I know, the list 
is complete. Entries are uniquely identified by a key that is composed of the 
page number and position on the page. This has proven quite helpful when 
checking entries in the book.

(3) An <entry> has roughly the same structure as a structured entry in a TEI 
encoded dictionary. It has information on form and grammar, followed by either 
a <sense> element or one or more <hom> elements (homographs), optionally 
followed by related sub-entries or notes. The embedded homographs and related 
entries have a similar content model, but don't allow further nesting of 
entry-level elements.

(4) Whitespace between top-level 'children' of an <entry> is not important, but 
all elements are in text order, which means that the original text can normally 
be restored by putting a space after each top-level element. Within top-level 
elements, whitespace is important (xml:space="preserve"), with the exception of 
the <hom> and <related> elements.

(5) The <sense> element contains mixed content (text intermingled with a small 
set of inline elements), optionally followed by a sequence of one or more 
embedded <sense> elements. These may be nested indefinitely. It mirrors the 
structure of complex entries having a general translation followed by numbered 
subdivisions. Streitberg has up to four nested levels (labeled A. I. 1. a).
As a side note, this content model cannot be expressed in a DTD or W3C schema, 
where elements are either fully mixed, or not mixed at all.

(6) Inline elements cover quotations, translations, 'mentioned' foreign words, 
cross-references, manuscript sigla, grammatical labels and unspecified 
highlighting. There is also a semantically neutral span-element that allows 
grouping of text and inline elements. Language is always indicated by an 
xml:lang language identifier; the default language is German.

(7) Every text node is emendable: typographical errors can be marked (and 
optionally corrected) with a <sic> element. So far, I have found 8 obvious 
errors. More generally, a <note> element is available to signal encoding 
problems, include comments on the content of the entry, etc. 

(8) The goal is an editorial view of the data (as described in the TEI 
Guidelines). It should be trivial to construct the original text from the XML 
encoding, ideally by simply stripping all tags from the file. In the current 
encoding, the following items are not expressed as textual content:
   - Parentheses are implied around the content of <formInfo> and <grammarInfo>.
   - Numbered <hom> and <sense> blocks store their label in an attribute (e.g. 
<sense n="1">).
In both cases, the text can simply be restored at the stylesheet-level. More 
problematic are:
   - Separators between <sense> elements, usually one or two m-dash(es). I'm 
not sure if Streitberg has used them consistently. They could be entered as 
text nodes in between sense-blocks, but this complicates the editing process 
(densely tagged mixed content blocks quickly become unwieldy). Or is this a 
case where exact reproduction of the typographic form is arguably less 
important...? To be discussed.
   - Final devoicing ('Auslautverhärtung') is signaled in a boolean attribute, 
but not yet properly encoded. Streitberg generally puts the devoiced letter in 
italics and appends the stem consonant in parentheses, e.g. "hlaiFs (b)". He is 
not always consistent, this has to be tagged manually (all 128 occurrences can 
be selected with the XPath expression //form[@auslautverhärtung]).

(9) Here are a few known problems with the current encoding. They will be fixed 
later on.
   - In related subentries, the grammatical label usually (but not always) 
comes before the actual form (e.g. "Adv. abraba"). This is not reflected in the 
current encoding. Simply switching the elements won't do, there are 
complications (e.g. "Komp. airis Adv.").
   - About 20-30 entries have multiple forms or alternative reconstructions. 
This does not fit very well into the current structure. An obvious solution 
would be to make the entire entry mixed content, allowing <form>, <grammar>, 
etc. as inline elements, but having a few fixed fields really simplifies 
editing and querying. Currently, problematic alternatives are stored in 
<formAlternative> or <grammarAlternative> elements. They are somewhat similar 
in function to a TEI <dictScrap> container.
   - The distinction between quotation/mentioned and translation/mentioned is 
not clear. It is a matter of interpretation, that could be avoided by simply 
tagging foreign words and phrases according to language. The schema 
documentation has more information on this.

(10) Finally, a few conventions are not (yet) expressed in XML:
   - Double underscore indicates ellipsis, as in "entweder __ oder".
   - Tilde replaces dash whenever it is used as an iconic reference to the 
headword, as in "~ sik" (lemma "skaman") or "sa ~da" (lemma: 
"ufarhiminakunds"). This corresponds to TEI <oRef/> en <oVar/>.
   - The language identifier is missing from the headword. Since every <form> 
is Gothic, the identifier is quite redundant -- at least in this stage of 
encoding. It could easily be supplied later on, or defined in a schema language 
that allows default attribute values (RELAX NG deliberately does not include 
constructs that modify or augment the XML infoset). 

TDH 2006-11-30
http://www.wulfila.be