Information Modeling with XML

XML allows us to model information systems in a natural and intuitive way. This is because XML allows us to express information in ways that better match the way we do business. We now have an information-modeling mechanism that allows us to characterize what we want to do, rather than how we have to do it. XML simply does a much better job of reflecting the way the real world operates than the data-modeling mechanisms that preceded it. XML brings a number of powerful capabilities to information modeling:

*  Heterogeneity: Where each "record" can contain different data fields. The real world is not neatly organized into tables, rows, and columns. There is great advantage in being able to express information, as it exists, without restrictions.

*  Extensibility: Where new types of data can be added at will and don't need to be determined in advance. This allows us to embrace, rather than avoid, change.

*  Flexibility: Where data fields can vary in size and configuration from instance to instance. XML imposes no restrictions on data; each data element can be as long or as short as necessary.

How XML Expresses Information

XML expresses information using four basic components?tags, attributes, data elements, and hierarchy. Each of these components serves a unique purpose; each represents a different "dimension" of information. In order to illustrate these basic components, we will use a simple XML fragment from an application dealing with readings from colorimeters (devices that measure colors using tri-stimulus readings).

Data elements are represented in bold type in Listing-1. In XML, a data element equates to "data" as we have traditionally thought of it. If we simply extract the data elements, we get "0, 255, 255", which is meaningless unless you know what the data definitions are. XML adds context to data, thereby giving it meaning, by adding tags (represented in regular type in the listing). Tags describe what data elements are. Attributes (represented in italics in the listing) tell us something about or how to interpret data elements. Colorimeters can represent RGB tri-stimulus values in a variety of resolutions. If the reading had been taken with a resolution of 16 bits, for example, values of "0, 255, 255" would represent a very dark cyan, instead of pure cyan. So, we need the "resolution=8" attribute to correctly interpret the RGB reading values in Listing-1.

Listing -1 Simple XML Fragment

<colorimeter_reading>
<RGB resolution=8>
<red> 0 </red>
<green> 255 </green>
<blue> 255 </blue>
</RGB>
</colorimeter_reading>

Now we have data (data elements), we know what they are (tags), and we know how to interpret them (attributes). The final step is to determine how to string it all together, and that is where hierarchy comes in. So far, we have represented three dimensions of information explicitly. The last dimension, how everything relates, is implied spatially. This means that much of what we need to know is contained in how we order the components of XML information. In order to give data meaning, a complete context must be provided, not just the most immediate tag or attribute. For example, if we simply say "red=0", it will not mean much because we have not provided an adequate context. If we include all tags in the hierarchy leading up to the reading of "0", we achieve a more complete context: "<colorimeter_reading><RGB><red> 0". Although we have a complete understanding of what the data element represents and its value, some ambiguity as to how to interpret the value still remains. The attribute "resolution58" belongs to the tag "<RGB>". Because "<RGB>" is a part of our context, any attribute belonging to it (or any attribute belonging to any tag in our context for that matter) applies. Now we know how to interpret the value of the data element as well. Related information is represented in the hierarchy as siblings at various levels; as a result, hierarchy tells us how data elements are related to each other.

Patterns in XML

In order to effectively model information using XML, we must learn how to identify the natural patterns inherent to it. First, we must determine whether we have used XML elements properly. To do this we will analyze the XML fragment shown in Listing-2.

Listing-2 Example XML Fragment

<colorimeter_reading>
<device> DeveloperIQ Magazine </device>
<patch> cyan </patch>
<RGB resolution=8>
<red> 0 </red>
<green> 255 </green>
<blue> 255 </blue>
</RGB>
</colorimeter_reading>

We examine each data element and ask the following question:

*      Is this data, or is it actually metadata (information about another data element)?
We examine every attribute and ask the following questions:

*   Does the attribute tell us something about or describe how to interpret, use, or present data elements?

*    Is the attribute truly metadata, and not actually a data element?

*   Does it apply to all data elements in its scope?

We examine every tag and ask the following question:

*   Does this tag help describe what all data elements in its scope are?

We examine the groupings we have created (the sibling relationships) and ask:

*  Are all members of the group related in a way the parent nodes describe?

*  Is the relationship between siblings unambiguous?

If the answer to any of the preceding questions is "no," then we need to cast the offending components differently.

After insuring that information has been expressed using the components of XML appropriately, we examine how everything has been stitched together. To do this we create an information context list from the XML fragment. This is done by simply taking each data element and writing down every tag and attribute leading up to it. The resulting lines will give us a flattened view of the information items contained in the XML fragment. A context list for the example XML fragment in Listing-2 would look like the one shown in Listing -3.

Listing -3 Context List for Example XML Fragment

<colorimeter_reading><device> DeveloperIQ Magazine
<colorimeter_reading><patch> cyan
<colorimeter_reading><RGB resolution=8><red> 0
<colorimeter_reading><RGB resolution=8><green> 255
<colorimeter_reading><RGB resolution=8><blue> 255

If we convert these lines to what they mean in English, we can see that each information item, and its context, makes sense and is contextually complete:

1. This colorimeter reading is from an DeveloperIQ Magazine.

2. This colorimeter reading is for a patch called cyan.

3. This colorimeter reading is RGB-red and has an 8-bit value of 0.

4. This colorimeter reading is RGB-green and has an 8-bit value of 255.

5. This colorimeter reading is RGB-blue and has an 8-bit value of 255.

Next we examine the groupings implied by the tag hierarchy:

* "<colorimeter_reading>" contains "<device>", "<patch>", and "<RGB>" (plus its children).

*  "<RGB>" contains "<red>", "<green>", and "<blue>".

"<colorimeter_reading>" represents the root tag, so everything else is obviously related to it. The only other implied grouping falls under "<RGB>". These are the actual readings, and the only entries that are, so they are logically related in an unambiguous way.

Finally, we examine the scope for each attribute:

  *  "resolution=8" has the items "<red>", "<green>", and "<blue>" in its scope.

"resolution=8" logically applies to every item in its scope and none of the items not in its scope, so it has been appropriately applied.

A self-constructing XML information system (like NeoCore XMS) will use the structure of and the natural patterns contained in XML to automatically determine what to index. Simple queries are serviced by direct lookups. Complex queries are serviced by a combination of direct lookups, convergences against selected parent nodes, and targeted substring searches. With NeoCore XMS no database design or indexing instructions are necessary?the behavior of XMS is driven entirely by the structure of the XML documents posted to it. Index entries are determined by inference and are built based on the natural patterns contained in XML documents. NeoCore XMS creates index entries according to the following rules:

*  An index entry is created for each data element.

*  An index entry is created for each complete tag context for each data element?that is, the concatenation of every tag leading up to the data element.

* An index entry is created for the concatenation of the two preceding items (tag context plus data element).

For the XML fragment in Listing-2, the following items would be added to the pattern indices :

1.  DeveloperIQ Magazine
2.  cyan
3.  0
4.  255
5. 255
6. <colorimeter_reading><device>
7.  <colorimeter_reading><patch>
8. <colorimeter_reading><RGB><red>
9.  <colorimeter_reading><RGB><green>
10. <colorimeter_reading><RGB><blue>
11. <colorimeter_reading><RGB resolution=8>
12.  <colorimeter_reading><device> DeveloperIQ Magazine
13.   <colorimeter_reading><patch> cyan
14.   <colorimeter_reading><RGB resolution=8><red> 0
15.  <colorimeter_reading><RGB resolution=8><green> 255
16.   <colorimeter_reading><RGB resolution=8><blue> 255

Entries 1?5 are data only, entries 6?11 are tag context only, and entries 12?16 are both.

At this point it is important to consider how performance will be affected by the structure of the XML document. Because the inherent patterns inferred from the XML itself can be used to automatically build a database, the degree to which those patterns match likely queries will have a big effect on performance, especially in data-centric applications where single data elements or subdocuments need to be accessed without having to process an entire XML document.

Common XML Information-Modeling Pitfalls

We could easily arrange the XML fragments from the previous section in other, perfectly acceptable ways. There are many more, albeit syntactically correct, unfortunate ways to arrange the information. Common mistakes made when creating XML documents include:

*   Inadequate context describing what a data element is (incomplete use of tags)
*  Inadequate instructions on how to interpret data elements (incomplete use of attributes)
*  Use of attributes as data elements (improper use of attributes)
*  Use of data elements as metadata instead of using tags (indirection through use of name/value pairings)
*  Unnecessary, unrelated, or redundant tags (poor hierarchy construction)
*  Attributes that have nothing to do with data element interpretation (poor hierarchy construction or misuse of attributes)

These mistakes sap XML of its power and usefulness. Time devoted to good information modeling will be paid back many times over as other components of applications are developed. We can put a great deal of intelligence into XML documents, which means we do not have to put that intelligence, over and over again, into every system component.

Because XML is very flexible, it is easy to abuse. Sometimes the best way to illustrate how to do something is by counterexample. Much, if not most, of the XML we have seen is not well designed. It is not difficult to design XML with good grammar and good style, and doing so will save a lot of time and effort in the long run?to say nothing of how it will affect performance. The following sections contain a few examples of poorly constructed XML fragments.

Attributes Used as Data Elements

This may be the most common misuse of XML. Attributes should be used to describe how to interpret data elements, or describe something about them?in other words, attributes are a form of metadata. They are often used to contain data elements, and that runs counter to the purpose of attributes.

Listing-4 contains no data elements from readings at all; the attributes apply to nothing. Attributes that apply to nothing, obviously, describe how to interpret nothing.

Listing-4 XML with No Data Elements

<colorimeter_reading>
<device> DeveloperIQ Magazine </device>
<patch> cyan </patch>
<RGB resolution=8 red=0 green=255 blue=255 />
</colorimeter_reading>

If we examine each attribute, especially the data portion (the part to the right of the equal sign), we can determine whether they actually represent data, or metadata:

    *  resolution=8: This is a true attribute because the value "8" does not mean anything by itself; rather it is an instruction for interpreting data elements, and therefore it is metadata.

*  red=0: This is clearly actually data because it is a reading from the colorimeter; moreover, in order to be correctly interpreted, it requires the "resolution=8" attribute. This attribute does not tell us how to interpret data?it is data. Consequently it should be recast as a tag/data element pair.

*   green=255, blue=255: The previous analysis of "red=0" applies.

Data Elements Used as Metadata

This is often a result of emulating extensibility in a relational database. Instead of creating columns accounting for different fields, a database designer will create two columns: one for field type and one for field contents. This basically amounts to representing metadata in data element fields and is shown in Listing-5.

Listing-5 XML Data Elements Used as Metadata

<colorimeter_reading>
<device> DeveloperIQ Magazine </device>
<patch> cyan </patch>
<RGB>
<item>
<band> red </band>
<value> 0 </value>
</item>
<item>
<band> green </band>
<value> 255 </value>
</item>
<item>
<band> blue </band>
<value> 255 </value>
</item>
</RGB>
</colorimeter_reading>

If we decompose this document into an information context, we get Listing-6.

Listing-6 Information Context for Listing-5

<colorimeter_reading><device> DeveloperIQ Magazine
<colorimeter_reading><patch> cyan
<colorimeter_reading><RGB ><item><band> red
<colorimeter_reading><RGB ><item><value> 0
<colorimeter_reading><RGB ><item><band> green
<colorimeter_reading><RGB ><item><value> 255
<colorimeter_reading><RGB ><item><band> blue
<colorimeter_reading><RGB ><item><value> 255

Listing-6 translates to approximately the following in English:

1. This colorimeter reading is from an DeveloperIQ Magazine.
2. This colorimeter reading is for a patch called cyan.
3. This colorimeter reading item is RGB band red.
4.  This colorimeter reading item is RGB and has a value of 0.
5. This colorimeter reading item is RGB band green.
6.  This colorimeter reading item is RGB and has a value of 255.
7.  This colorimeter reading item is RGB band red.
8.  This colorimeter reading item is RGB and has a value of 255.

The last six lines are contextually weak. Lines 3, 5, and 7 don't contain any readings; they contain metadata about the lines following them. Lines 4, 6, and 8 don't adequately describe the readings they contain; they are informationally incomplete and ambiguous. In fact, lines 6 and 8 are exactly the same, even though the readings they represent have different meanings.

Inadequate Use of Tags

This is often a result of emulating extensibility in a relational database. Instead of building separate tables for different data structures, a database designer will create one table for many different data structures by using name/value pairs. This represents unnecessary indirection of metadata and an inappropriate grouping of data elements, to the detriment of performance (because what should be direct queries become joins) and reliability (because grouping is ambiguous). This is shown in Listing-7.

Listing-7 Use of Name/Value Pairs

<colorimeter_reading>
<device> DeveloperIQ Magazine </device>
<patch> cyan </patch>
<mode> RGB </mode>
<band> red </band>
<value> 0 </value>
<band> green </band>
<value> 255 </value>
<band> blue </band>
<value> 255 </value>
</colorimeter_reading>

If we decompose this document into an information context, we get Listing-8.

Listing-8 Information Context for Listing-7

<colorimeter_reading><device> DeveloperIQ Magazine
<colorimeter_reading><patch> cyan
<colorimeter_reading><mode> RGB
<colorimeter_reading><band> red
<colorimeter_reading><value> 0
<colorimeter_reading><band> green
<colorimeter_reading><value> 255
<colorimeter_reading><band> blue
<colorimeter_reading><value> 255

Translated to English, Listing- 8 becomes:

1. This colorimeter reading is from an DeveloperIQ Magazine.
2. This colorimeter reading is for a patch called cyan.
3. This colorimeter reading is in RGB mode.
4. This colorimeter reading is red.
5. This colorimeter reading has a value of 0.
6. This colorimeter reading is green.
7. This colorimeter reading has a value of 255.
8. This colorimeter reading is blue.
9. This colorimeter reading is has a value of 255.

The last six lines are contextually weak, and line 3 represents nothing but context. Lines 3, 4, 6, and 8 do not contain any readings; they contain metadata about the lines following them. Lines 5, 7, and 9 don't describe the readings they contain at all; they are informationally incomplete and ambiguous. In fact, lines 7 and 9 are exactly the same and contained within the same group, even though the readings they represent have different meanings and should belong to different groups. We could add tags to encapsulate reading elements into groups so that the bands and reading values are unambiguously related to each other. But first, we should determine whether each data element truly represents data. If we examine the data elements, we can determine whether they really represent data or metadata, and whether they have an adequate context:

*  DeveloperIQ Magazine: This is clearly data.
*  cyan: This is also clearly data.
*   RGB: Although this could be considered data in the academic sense, it is not of much value by itself. Furthermore, it is needed to understand the meaning of data elements following it.
*  red, green, and blue: These are also data in the academic sense only. They lack adequate context as well. For example, a colorimeter reading in the red band could mean a number of different things.
*   0, 255, and 255: These are the actual colorimeter readings; they are clearly data. They are, however, nearly devoid of critical context?namely the color mode and the color band they represent.

Very Simple Way to Design XML

One great advantage of XML information modeling over traditional data modeling is that it serves as a much more intuitive analog of reality. Because of this, a very simple method for designing XML documents produces surprisingly good results. In fact, it will produce better results than many, if not most, "industry standard" XML schemas. Forget that you will be using a computer to manage information?in fact, forget almost everything you know about computers. Instead, imagine that you will be managing your information manually, and design simple forms accordingly. First, make the preprinted parts of the forms into tags; second, make the parts you fill in into data elements; and third, change things like units into attributes. Obviously, doing so will not produce totally optimum results, but it will serve quite well, and it's a good way to start.

Let's look at a simple example. A telephone number directory. We will start with a manual entry form.

If we convert this directly into an XML document (spaces become underscores), we get Listing-9.

Listing-9 Telephone Directory Listing as XML

<Telephone_Directory_Listing>
<Name> Mohan K. Das</Name>
<Address> 1st main</Address>
<City> Bangalore </City>
<State> KA </State>
<Zip_Code> 560021 </Zip_Code>
<Telephone> (080) 22245678 </Telephone>
</Telephone_Directory_Listing>

Now we will make some small changes. First, we will separate the name into first, middle initial, and last name, and group them together. We will also group the address and separate the telephone number and area code into its own group. Separating fields, such as the name, makes it possible to use the components as individual query terms that will be serviced with direct lookups instead of requiring partial content scans within fields. This significantly improves performance in cases where a query, for example, might be for "Mohan Das" instead of " Mohan K. Das". The resulting XML is shown in
Listing-10.

Listing-10 Telephone Directory Listing in XML after Changes

<Telephone_Directory_Listing>
<Name>
<First> Mohan </First>
<MI> K. </MI>
<Last> Das </Last>
</Name>
<Address>
<Street> 1st main</Street>
<City> Bangalore </City>
<State> KA </State>
<Zip_Code> 560021 </Zip_Code>
</Address>
<Telephone>
<Area_Code> 999 </Area_Code>
<Number> ) 22245678 <Number>
</Telephone>
</Telephone_Directory_Listing>

The XML document in Listing-10 would serve as a good basis for a telephone directory. When thinking about additional information that may have to be added to some listings (additional address lines, additional telephone numbers, etc.), it is important to remember that XML is extensible; a field has to be added only when it is necessary. Not globally to all listings.

Many businesses are basically forms driven. For example, clinical trials in the pharmaceutical industry start with forms that have to be approved before the computer systems managing the information can be designed. Because forms can be converted into XML so easily, it is now possible to build systems that are driven primarily by business objectives in intuitive ways, rather than by abstract computing paradigms.

For more information on XML you can reach him on rka1965@gmail.com








}