Wilhem Streitberg: "Gotisch-Griechisch-Deutsches Wörterbuch" (1910) ------------------------------------------------------------------------------- Preliminary XML encoding of the headwords and German glosses ------------------------------------------------------------------------------- Every element and attribute is described in the RELAX NG schema file. Below is a prose description of the schema, along with a few general remarks. Note that attributes in the http://www.wulfila.be/archive/2006/DB/dictionary namespace (using prefix db:*) are not part of the dictionary text and may be ignored (they contain data from the original database). (1) The schema borrows a lot from TEI. Some elements are equivalent or even have the same name. Future conversion to TEI (either as the primary file or as a secondary, derived format) should be fairly easy. But contrary to TEI, the preliminary schema is quite restrictive and specific. It describes the current (incomplete) transcription of Streitberg's dictionary -- nothing more, nothing less. (2) The document root has a block with a few notes on the file, followed by 3500+ blocks. Each corresponds exactly to one entry (boldfaced lemma) in Streitberg's dictionary . As far as I know, the list is complete. Entries are uniquely identified by a key that is composed of the page number and position on the page. This has proven quite helpful when checking entries in the book. (3) An has roughly the same structure as a structured entry in a TEI encoded dictionary. It has information on form and grammar, followed by either a element or one or more elements (homographs), optionally followed by related sub-entries or notes. The embedded homographs and related entries have a similar content model, but don't allow further nesting of entry-level elements. (4) Whitespace between top-level 'children' of an is not important, but all elements are in text order, which means that the original text can normally be restored by putting a space after each top-level element. Within top-level elements, whitespace is important (xml:space="preserve"), with the exception of the and elements. (5) The element contains mixed content (text intermingled with a small set of inline elements), optionally followed by a sequence of one or more embedded elements. These may be nested indefinitely. It mirrors the structure of complex entries having a general translation followed by numbered subdivisions. Streitberg has up to four nested levels (labeled A. I. 1. a). As a side note, this content model cannot be expressed in a DTD or W3C schema, where elements are either fully mixed, or not mixed at all. (6) Inline elements cover quotations, translations, 'mentioned' foreign words, cross-references, manuscript sigla, grammatical labels and unspecified highlighting. There is also a semantically neutral span-element that allows grouping of text and inline elements. Language is always indicated by an xml:lang language identifier; the default language is German. (7) Every text node is emendable: typographical errors can be marked (and optionally corrected) with a element. So far, I have found 8 obvious errors. More generally, a element is available to signal encoding problems, include comments on the content of the entry, etc. (8) The goal is an editorial view of the data (as described in the TEI Guidelines). It should be trivial to construct the original text from the XML encoding, ideally by simply stripping all tags from the file. In the current encoding, the following items are not expressed as textual content: - Parentheses are implied around the content of and . - Numbered and blocks store their label in an attribute (e.g. ). In both cases, the text can simply be restored at the stylesheet-level. More problematic are: - Separators between elements, usually one or two m-dash(es). I'm not sure if Streitberg has used them consistently. They could be entered as text nodes in between sense-blocks, but this complicates the editing process (densely tagged mixed content blocks quickly become unwieldy). Or is this a case where exact reproduction of the typographic form is arguably less important...? To be discussed. - Final devoicing ('Auslautverhärtung') is signaled in a boolean attribute, but not yet properly encoded. Streitberg generally puts the devoiced letter in italics and appends the stem consonant in parentheses, e.g. "hlaiFs (b)". He is not always consistent, this has to be tagged manually (all 128 occurrences can be selected with the XPath expression //form[@auslautverhärtung]). (9) Here are a few known problems with the current encoding. They will be fixed later on. - In related subentries, the grammatical label usually (but not always) comes before the actual form (e.g. "Adv. abraba"). This is not reflected in the current encoding. Simply switching the elements won't do, there are complications (e.g. "Komp. airis Adv."). - About 20-30 entries have multiple forms or alternative reconstructions. This does not fit very well into the current structure. An obvious solution would be to make the entire entry mixed content, allowing
, , etc. as inline elements, but having a few fixed fields really simplifies editing and querying. Currently, problematic alternatives are stored in or elements. They are somewhat similar in function to a TEI container. - The distinction between quotation/mentioned and translation/mentioned is not clear. It is a matter of interpretation, that could be avoided by simply tagging foreign words and phrases according to language. The schema documentation has more information on this. (10) Finally, a few conventions are not (yet) expressed in XML: - Double underscore indicates ellipsis, as in "entweder __ oder". - Tilde replaces dash whenever it is used as an iconic reference to the headword, as in "~ sik" (lemma "skaman") or "sa ~da" (lemma: "ufarhiminakunds"). This corresponds to TEI en . - The language identifier is missing from the headword. Since every is Gothic, the identifier is quite redundant -- at least in this stage of encoding. It could easily be supplied later on, or defined in a schema language that allows default attribute values (RELAX NG deliberately does not include constructs that modify or augment the XML infoset). TDH 2006-11-30 http://www.wulfila.be