What are the best practices for designing XML schemas?

Question

What are the best practices for designing XML schemas?

As an amateur software developer (I'm still in academia), I have written several schemas for XML documents. I regularly come across flubs design that invoke ugly XML documents because I'm not quite sure what the semantics of XML are.

My assumptions:

<property> value </property>

property = value

 <property attribute="attval"> value </property>

A property with a special descriptor - attribute.

 <parent> <child> value </child> </parent>

The parent has a characteristic "child" that has a meaning of "value."

 <tag />

A "tag" is a flag or it is directly converted to text. I am not sure about that.

 <parent> <child /> </parent>

"child" describes "parent". "child" is a flag or boolean. I'm not so sure about that either.

Ambiguity arises if you want to do something like displaying Cartesian coordinates:

 <coordinate x="0" y="1 /> <coordinate> 0,1 </coordinate> <coordinate> <x> 0 </x> <y> 1 </y> </coordinate>

Which one is most right? I leaned toward the third, based on my current concept of XML schema design, but I really don't know.

What are some resources that briefly describe how to efficiently create xml schemes?

+42

design xml xsd

evizaer Oct 23 '08 at 21:12

source share

14 answers

One general (but important!) Recommendation is to never store multiple logical pieces of data in a single node (be it node text or node attribute). Otherwise, you will need your own logical parsing on top of the XML parsing logic, which you usually get for free from your structure.

So, in your example with coordinates, <coordinate x="0" y="1" /> as well as <coordinate> <x>0</x> <y>1</y> </coordinate> both reasonable for me.

But <coordinate> 0,1 </coordinate> not very good, because it stores two logical pieces of data (X-coordinate and Y-coordinate) in one XML node - forcing the consumer to analyze data outside their XML parser. Although separating the line with a comma is quite simple, there are still some ambiguities, for example, what happens if there is an extra comma at the end.

+22

C. Dragon 76 Oct 23 '08 at 21:52

source share

I agree with the w / cdragon recommendation below to avoid option # 2. Choosing between # 1 and # 3 is heavily influenced by style. I like to use attributes for what I consider attributes of an entity, and elements for what I consider to be data. It is sometimes difficult to classify. However, they are not “wrong.”

And while we are on the topic of circuit design, I’ll add two cents regarding my preferred level of (maximum) reuse (both elements and types), which can also facilitate the external “logical” linking of these entities in, say, a data dictionary stored in the database data.

Note that while the Garden of Eden scheme offers maximum reuse, it also includes most of the work. At the bottom of this article, I have provided links to other templates described in the blog series.

& bull; The Garden of Eden Approach http://blogs.msdn.com/skaufman/archive/2005/05/10/416269.aspx

It uses a modular approach, defining all elements globally and, like the Venetian Blind method, all type definitions are declared globally. Each element is globally defined as an immediate child of a node, and its type attribute can be set to one of these complex types.

 <?xml version="1.0" encoding="UTF-8"?> <xs:schema targetNamespace="TargetNamespace" xmlns:TN="TargetNamespace" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"/> <xs:element name="BookInformation" type="BookInformationType"/> <xs:complexType name="BookInformationType"/> <xs:sequence> <xs:element ref="Title"/> <xs:element ref="ISBN"/> <xs:element ref="Publisher"/> <xs:element ref="PeopleInvolved" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> <xs:complexType name="PeopleInvolvedType"> <xs:sequence> <xs:element name="Author"/> </xs:sequence> </xs:complexType> <xs:element name="Title"/> <xs:element name="ISBN"/> <xs:element name="Publisher"/> <xs:element name="PeopleInvolved" type="PeopleInvolvedType"/> </xs:schema>

The advantage of this approach is that schemes can be reused. Because both elements and types are globally defined, both are reusable. This approach offers the maximum amount of reusable content. The disadvantages are that the scheme is verbose. This would be a suitable design when creating shared libraries in which you can allow yourself to make any assumptions about the scale of elements and types of schemes and their use in other schemes, especially with regard to extensibility and modularity.

Since each individual type and element has one global definition, these canonical particles / components can be connected one-to-one with identifiers in the database. Although this may seem like a tedious routine at first glance, to maintain the relationship between text particles / XSD components and the database, SQL Server 2005 can actually generate canonical component identifiers through the statement

 CREATE XML SCHEMA COLLECTION

http://technet.microsoft.com/en-us/library/ms179457.aspx

Conversely, to build a schema from canonical particles, SQL Server 2005 provides

 SELECT xml_schema_namespace function

http://technet.microsoft.com/en-us/library/ms191170.aspx

sa · not · i · kal Related to math. (equations, coordinates, etc.), "in the simplest or standard form" http://dictionary.reference.com/browse/canonical

Other, simpler to construct, but less resultant / more "denormalized / redundant" schema schemes include

& bull; The Russian Doll approach http://blogs.msdn.com/skaufman/archive/2005/04/21/410486.aspx

There is one global element in the scheme - the root element. All other elements and types enter deeper, deeper, giving it a name due to the fact that each type is adjusted to what is above it. Because elements in this design are declared locally, they will not be reused using the import or include statements.

& bull; Salami Slice Approach http://blogs.msdn.com/skaufman/archive/2005/04/25/411809.aspx

All elements are defined globally, but type definitions are defined locally. Thus, other schemes can reuse elements. With this approach, the global element with its locally defined type provides a complete description of the contents of the elements. This informational “slice” is declared individually and then aggregated back together and can also be brought together to build other schemes.

& bull; The Venetian Blind Approach http://blogs.msdn.com/skaufman/archive/2005/04/29/413491.aspx

Similar to the Russian Doll approach, as they use one global element. The Venetian Blind approach describes a modular approach by naming and defining all type definitions globally (as opposed to the Salami Slice approach, which declares global elements and types locally). Each globally defined type describes an individual “tablet” and can be reused by other components. In addition, all locally declared elements can be either a qualified namespace or a namespace (sections can be “open” or “closed”) depending on the parameter attribute elementFormDefault at the top of the schema.

+11

6eorge Jetson Oct 24 '08 at 2:36

source share

XML is somewhat subjective in terms of design - I don't think there are precise recommendations on how elements and attributes should be laid out, but I tend to use elements to represent “things” and attributes to represent the singular attribute / property.

In terms of an example of coordinates, either would be perfectly acceptable, but my tendency would be to go with <coordinate x="" y=""/> because it is somewhat shorter and makes the document more readable if you have a lot of them.

The most important thing, however, is the schema namespace. Make sure that (a) you have, and (b) you have a version so that you can make a difference in the future and release a new version. Versions can be either dates or numbers, for example.

 http://company.com/2008/12/something/somethingelse/ urn:company-com:2008-12:something:somethingelse http://company.com/v1/something/somethingelse/ urn:company-com:v1:something:somethingelse

+3

Greg Beech Oct 23 '08 at 21:26

source share

I don’t know a good training resource on how to create XML document models (schemas are just a formal way of defining document models).

In my opinion, one important understanding of XML is that it is not a language: it is syntax. And each document model is a separate language.

Different cultures will use XML in their own way. Even in the W3C specs, you can smell Lisp in the symbol-separated names XSLT and Java in camelCaseNames XML Schema. Similarly, different application areas will require different XML idioms.

Descriptive document models such as HTML or DocBook typically place typed text in text nodes and metadata in the names and attributes of elements.

Other object-oriented document models, such as SVGs , almost never use text nodes and instead use only elements and attributes.

My personal rules of thumb for document model design are something like this:

If it's some kind of free tag soup that requires mixed content , use HTML and DocBook as inspiration. Other rules only matter otherwise.
If the value is compound or hierarchical, use elements. XML data does not require further analysis except established idioms such as IDREFS, which are simple spatially separated sequences.
If the value may be required several times, use the elements.
If the value may need to be clarified further or enriched later, use the elements.
If the value is clearly atomic (logical, number, date, identifier, simple label) and can occur no more than once, then use the attribute.

Another way to say:

If this is a narrative, it is not object oriented.
If the object is object oriented, model objects as elements and attributes of the atom as attributes.

EDIT: Some people seem to want to completely abandon attributes. There is nothing wrong with that, but I don’t like it because it inflates documents and makes them unnecessary for manual reading and writing.

+3

ddaa Oct 23 '08 at 21:46

source share

When developing an XML-based format, it is often useful to think about what you represent. Try taunting some XML data that matches your intended purpose. Once you have something that you are satisfied with that meets your requirements, develop a scheme to verify it.

When specifying a format, I try to use elements to store data content and attributes to apply characteristics to data, such as identifier, name, type, or some other metadata about the data that the element contains.

In this regard, the XML representation for coordinates can be:

 <coordinate type="cartesian"> <ordinate name="x">0</ordinate> <ordinate name="y">1</ordinate> </coordinate>

This applies to various coordinate systems. If you knew that they would always be decartive, you can better implement them:

 <coordinate> <x>0</x> <y>1</y> </coordinate>

Of course, the latter can lead to a more detailed scheme, since each type of element would have to be declared (although I would have hoped that a complex type would be defined to actually do the hard work for these elements).

As in programming, often there are many ways to achieve the same goals, but in many situations there is no right and wrong, only better and worse. It is important to stay consistent and try to be intuitive so that when others look at your circuit, they can understand what you were trying to achieve.

You should always update your schemas and ensure that XML written against your schema points to this as such. If you have configured XML incorrectly, then adding admins to the schema, while XML support written on the old schema, will be much more difficult.

+1

Jeff Yates Oct 23 '08 at 21:39

source share

In our Java projects, we often use JAXB to automatically parse XML and transform it into an object structure. I think for other languages you will have something similar. A suitable generator can automatically create an object structure in your chosen programming language. This greatly simplifies XML processing, but it also has a portable XML representation for communication between systems.

If you use such automatic matching, you will find that it severely limits the scheme - <coordinate> <x> 0 </x> <y> 1 </y> </coordinate> is the way to go if you don't want to do special magic in translation. You will get a Coordinate class with two attributes x and y with the appropriate type, as indicated in the diagram.

+1

Hans-Peter Störr Nov 15 '08 at 17:46

source share

I was tasked with writing a bunch of XML schemas to integrate my company systems with our customers. I developed a dozen of them over 10 years ago and saw that many of the extension functions in the specification do not work very well in practice. Before developing new ones, I searched for current best practices (and arrived here!).

Some of the tips above are useful, but I don't like almost all the links. The best place with design recommendations I found was from Microsoft.

The best reference is XML Schema Design Patterns: Avoiding Complexity . You will find this sound advice here:

it seems that many authors of the scheme are best at understanding and using the effective subset of the functions provided by the W3C XML Schema instead of trying to understand all the esoteric and little things of the language.

and provide detailed explanations of the following recommendations:

Why you should use global and local element declarations
Why you should use global and local attribute declarations
Why You Should Understand How XML Namespaces Affect W3C XML Schema
Why you should always set elementFormDefault to "qualified"
Why you should use attribute groups
Why you should use model groups
Why you should use built-in simple types
Why you should use complex types
Why you should not use notation ads
Why you should use substitution groups carefully.
Why you should approve key / keyref / unique by ID / IDREF for personality restrictions
Why you should carefully use chameleon patterns.
Why you shouldn't use default or fixed values, especially for xs: QName types
Why you should use restriction and expansion of simple types
Why you should use the extension of complex types
Why you should carefully use the restriction of complex types.
Why you should use abstract types carefully.
Use wildcards to provide well-defined extensibility points.
Do not override groups or types.

My advice on their advice is that when they say use carefully, you should just avoid it. My impression is that the Schema specifications were not written by software developers. They tried to use some concepts of object orientation, but were still confused. Many expansion mechanisms are useless or extremely verbose. I really don't understand how someone could come up with a constraint on complex types.

Two more interesting articles on this site:

And one piece of advice that is common is to point out your circuits with something different from the official specification. Relax NG seems to be the most preferred specification language. Unfortunately, you will lose one of the best features, which is standardization.

+1

neves Mar 15 '17 at 18:44

source share

Look at the relationships of the data you are trying to present - the best approach I have found.

0

Rob Wells Oct 23 '08 at 21:19

source share

I often encounter the same problem, but I think that in practice this does not really matter, xml is just data.

However, I usually prefer "if it says something about node this is an attribute, otherwise it is a" junior "approach.

In your example, I would go for:

 <coordinate> <x>0</x> <y>1</y> </coordinate>

because x and y are coordinate properties, and in fact do not say anything about xml, but about the object represented by it.

0

Kris Oct 23 '08 at 21:28

source share

I think it depends on how complex or complex the structure is.
I will make x and y as an attribute, unless x and y have their own data

You can look at HTML or any other form of markup that is used to define things (XAML in case of WPF, MXML in case of flash) to understand why something is selected as an attribute as a child node)

If x and y should not be repeated, they can be attributes.

Assuming the coordinates have multiple x and y, I assume that xml does not allow multiple attributes with the same name for node to be used. In this case, you will have to use child nodes.

0

shahkalpesh Oct 23 '08 at 21:40

source share

There is nothing wrong with using an element or sub-element for every value that you would like to represent.

The main consideration is that it is sometimes cleaner to use an attribute. Since an element can have only one attribute of a given name, you are stuck with a power of 1: 1. If you present the data as a child, you can use whatever capacity you want (or be open to continue later).

Rob Wells answer above is right: it depends on the relationships you are trying to model.

At any time when there is clearly nothing but a 1: 1 ratio, the attribute can be cleaner.

0

Bq. Oct 23 '08 at 21:40

source share

Here is a great list of methods for developing XML grammar.

, , , " X"... " ...".

0

Zearin 28 . '09 11:32

source share

, . , , , . , , (.. namespace )

0

Caleb 20 . '12 20:19

source share

Dimitre Novatchev · Accepted Answer · 2008-11-18 04:13

See the tutorial:

XML Schema: Roger Costello Best Practices .

I also recommend:

What are the best practices for designing XML schemas?

More articles: