Word XML Parsing Technology

The powerful core technology used in Pagemap is its unique Word XML parsing engine, which extracts necessary information from a Word file and stores it in a structured database. Once in a database, it’s much easier to analyse and search the document content.

I took a couple of years to develop and refine the parser to provide its current capabilities. It can parse very large documents with the time taken directly proportional to the document size, with no exponential processing effort required.

A Word docx file is actually a zip file made up of several XML files, which contain all the document content, style and formatting information. The layout and format of these files is highly complex and often contain convoluted links between styles.


At this level, Word’s great flexibility and backwards compatibility become the biggest challenge when it comes to extracting the information, especially paragraph hierarchy, which is critical for legal documents.

The task is easier when the document has been consistently written and uses Word styles for heading and numbers. But often, especially when a document has more than one author, it’s constructed using several different approaches and techniques to achieve a particular look on the page.

Word’s automatic numbering can sometimes get muddled up when a document is edited, and the only complete fix is to remove the numbering from the affected text and reapply it. However, not knowing this, users often apply various methods apply several number restarts or manually type number. Pagemap’s parser identifies all these cases and derives the correct number for cross-references etc.

Pagemap’s parser is written in Ruby, and is available under licence to other organisations wishing to extract data from Word documents. For information on licensing this technology contact:  [email protected].

For conversion from PDF to MS Word, Pagemap uses Solid Documents, which it believes is the best PDF converter available.