JUMBO Converters

<<toc></toc>>

JUMBO-Converters are modules that transform inputs into outputs, usually 1:1 such as Foo2CML and CML2Foo.

Primary site: https://bitbucket.org/wwmm/jumbo-converters

Table of Contents Running the converters Overview of the parsing philosophy Actual implementation Parsing a module with a template Template Attributes Records Unit Testing Framework ========template============= Transforming the raw XML Key Transforms Notes on Transforms JUMBO-Converters filesystem structure

Running the converters

This page is concerned with the philosophy and design of the converters. If you are just interested in running them, please go the the Tutorials and problems page.

Overview of the parsing philosophy

The approach that has been adopted by the parsers is to break the monolithic text block of the logfile into a series of separate **chunks** that encapsulate a coherent piece of data.

There may be many repeated chunks within a log file. For example, if a chunk is an SCF calculation, for a single-point energy calculation there would just be a single //SCF chunk//, whereas for a geometry optimisation calculation, there would be as many SCF chunks as there were SCF calculations.

Chunks are often nested, so using the geometry optimisation example, a single geometry optimisation step would itself be a chunk, and this in turn would contain (one or more) SCF chunks. There would then be as many geometry optimisation step chunks as there were geometry optimisation steps.

For a more detailed explanation, please see the pages on Chunkers and Block, and also the older How_converters_work.

Parsing is currently a multi-stage process. The parser reads the log files and converts it into "raw" XML. This process splits the file up into **modules**, each module corresponding to a **chunk** in the file. Within each module the text is preserved, so there is no loss of data from the log file; additional structure and annotation have just been added.

Each module is then parsed separately. The module may be parsed into a number of sub-modules or have data extracted into **records**.

The process of parsing a module into a record is the only process that removes text from the log file, but this is also the process of marking up data, so again nothing should be lost.

At the end of the parsing process we have a raw XML file that contains all of the information from the original log file, separated up into modules and with known quantities marked up with XML.

The terms in the raw XML are defined in the code-specific dictionary, which describes what each of the quantities are, their units etc.

This raw XML is then transformed into CML in a second step, where the quantities in the code-specific dictionary are either mapped onto CML or domain-specific dictionaries, and additional annotations or properties can be added (e.g. bond lengths could be calculated etc.).

Actual implementation

Jumbo converters are written in Java, although the template parsing technology is described entirely in XML, so that once a new parser module has been created, only XML files need to be edited in order to extend and develop the parser.

The reference parser for computational chemistry is the NWChem parser, so any examples will refer to it.

The java class that controls the two-stage parsing for the NWChem is NWChemLog2CompchemConverter.java.

The first stage (controlled by the NWChemLog2XMLConverter.java class), uses the topTemplate.xml file to include the various XML templates that parse the different chunks of the logfile.

The second stage (controlled by the NWChemLogXML2CompchemConverter.java class), uses the transforms in the nwchem2compchem.xml to manipulate the raw XML into a convention-compliant form.

See Declarative_parsing_syntax for a complete list of the rules followed by the parsers and their relations to the template XML files.

Parsing a module with a template

The structure of a typical template is shown below with comments to explain the various sections.

Template Attributes

The possible ATTRIBUTES on a template are:

* **id** - this should be a unique identifier. The text that is parsed by the template will be extracted into a cml module with a templateRef (in the cmlx namespace) of the id. In other words the text parsed by the template with id="foo" will end up in a module as shown below:

* **name** - used to give a name to the template and is currently unused. * **repeat** - the number of times that a template will be matched within the file. If repeat="1" then the template will only be matched once, regardless how many times the **pattern** is matched in the file. repeat="*" means the template will be matched as many times as the **pattern** is matched. * **newline** - the character that is used to indicate a new line in the regular expression used in the pattern or endPattern. The default is the dollar character, i.e. newline="$" * **pattern**, **pattern2**, **pattern3**… - the regular expression used to trigger this template to start parsing text. The pattern may extend over more then one line if the **newline** character (see above) features in the expression. For example, pattern="\Number One$\s*The Larch\s*" would only match the line "Number One" if it were followed by "The Larch". Multiple patterns can be specified using the attributes pattern2, pattern3... * **endPattern**, **endPattern2**, **endPattern3**… - as for pattern (see above), but this matches where the template stops parsing. If , then if the end of the text that this template is parsing is reached, the entire text will be included in the template. If the endPattern is anything other then , then no text will not be included in the template and the entire text will be available for matching by another template within the parent. * **offset**, **offset2**, **offset3**… - the number of lines either side of the match to include within the template. WIth the default of "0", all the text from (and including) the first matched line is included in the template. An offset of "-2" includes the two lines before the match, an offset of "3" excludes the first line of the match, and the two lines following it. If no offset is specified (or only offset is specified) the offset will apply to all matches (i.e. pattern, pattern2, pattern3…). If, for example, offset2 is specified, then this is the offset that will be applied when pattern2 is matched. * **endOffset**, **endOffset2**, **endOffset3**… - the number of lines to include in the template after the endPattern match. With the default of "0", the line matched by endPattern is NOT included, and this line is pushed into the containing template, where it may be matched by the pattern of another template. An endOffset of 2, includes the endPattern line, and the one after. An endOffest of "-1" excludes the line preceding the match as well.

Records

Records are the machinery used to extract text from a file and mark it up into XML.

A record is an XML element, which can have a number of attributes (see below) and which may contain a string, which is a simple regular expression-type language for determining what will be extracted and how it will be marked up.

Unlike the templates, where each template is tried in turn against each line of the file, records are processed sequentially. Each record is processed in turn until it fails, at which point the next record is processed until all records in the module have been processed.

An empty record (such as <record repeat="2"></record>) can be used to "gobble" lines (which are discarded).

If the record has content, then the text of the line is parsed into a CML list with a templateRef as specified by the **id** of the record.

A simple example to read the XYZ format geometry printed in an NWChem output is shown below. The text that is to be parsed is:

The records to parse this are:

This result of this parsing is as follows:

Unit Testing Framework

The templates contain their own internal testing framework, in the form of one or more pairs of comment blocks within them.

A comment block with the **class** attribute "example.input" should contain a small representative chunk of text that the parsers can be tested with. The **id** attribute is used to match the example input with the the representative output that should be produced when the template acts on the sample text.

An input comment is shown below:

The matching example.output comment is below:

It is possible for the templates to contain multiple examples, provided that each pair has matching **id** attributes. In this case, each matching pair will be tested in turn and all must pass for the unit test to be successful.

For the NWChem logfile templates, the code that runs these tests lives in the file:

TemplateUnitTests.java

To test and develop an individual template (using the xyz template as an example), the following line needs to be added to the TemplateTest.java file.

The individual test can be run from within Eclipse, but from the command-line, it only appears possible to run all of the TemplateTests (see note below), using the following command, whilst sat in the **jumboconverters-compchem/jc-compchem-nwchem** directory:

If you are developing a template, the first time this is run, it will fail. However, it will print out the output of running the test, and something like the following:

The chunk of test after the **------------test---------------------** line, and excluding the **** line is the output of the test. This should be checked, and if correct, placed in the **<comment class="example.output" id="xyz"></comment>** tag in the template. Re-running the test should then lead to a successful result.

- Note:**' The discussion at stackoverflow and maven documentation suggests that the following syntax should work:

But this appears not to be the case. Are we using the junit < 4.7?

Transforming the raw XML

As has been mentioned, the parsing is a two-stage process, consisting of marking up the file with xml and then converting the raw XML to valid CML. In some cases, the raw XML may already be valid CML, but it most cases transforms will need to be applied.

The transforms can either be applied within the template, after the text has been parsed and marked up, or as an entirely separate step, once the whole file has been parsed.

The transformation process relies heavily on the powerful XPath language. A short tutorial on xpath can be found here.

The philosophy of the transforms is very similar to the idea of templates in xslt, using the idea of "nodeset" to which operations are applied.

The transforms are carried out by elements like the following:

In this case, the attribute **id="job"** will be added to all cml modules that are direct children of the document, and have the **templateRef** "job".

The transforms have a **process** which defines the operation that will be carried out, almost all have an **xpath** that is an xpath expression indicating the elements the process will be applied to (the nodeset), and a variable number of arguments, depending on the process being carried out.

A brief overview of the key transformations follows below, however, for those with a strong constitution, a more comprehensive documentation can be found by examining the code in the file TransformElement.java

The text from ~ line 160, starting with the comment **// process values** lists the processes that are available.

Various miscellaneous notes will be added in the section below, which will be merged into the documentation in due course.

Key Transforms

* **addAttribute** - add an attribute of type **name** and value **value** to all nodes in the **xpath** nodeset. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH (and is relative to the node defined in the xpath attribute), then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by **xpath**, in string or number form. If **name** consists of two strings separated by a colon, SOMETHING EXCITING HAPPENS...

* **addChild** - this will create a child element of the nodes specified by the **xpath**. The only required argument is **elementName**, which specifies the type of element to create. Other supported arguments are: **id**, **dictRef** and **value**. The **position** argument specifies where the child will be created in the list of children. position="0" creates it as the first child, "1", the second etc. With no position argument, the child is added as the last child. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by **xpath**, in string or number form.

* **addDictRef** - this will add a **dictRef** attribute with the specified value to the nodes defined by **xpath**.

* **addId** - this adds an **id**with the value specified by the **value** argument to the nodeset specified in the **xpath**.

* **addMap** - for every node in the nodeset specified by **xpath**, this creates a **cml:map** with the specified **id** that links the **values** of the nodes in the **from** nodeset to that in the **to** nodeset.

* **addNamespace** - this will add a namespace element of the form xmlns:**name**="**value**" to every element in the nodeset returned by **xpath**.

* **addSibling** - this will add a sibling element to each node in the **xpath** nodeset, with the type of element being that specified in **elementName** and the elements id attribute as specified by **id** argument. The **position** argument indicates where the element will be created, "0" creates the element before node, "1" creates it after the current node. If there are multiple siblings to the current node, "-2" would create it 2 nodes down from the current node, "2", one up from it etc. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by **xpath**, in string or number form.

* **addUnits** - this will add a **units** attribute to the element with the value specified in the **value** argument. The value should be of the form **namespace:id**, where namespace refers to one of the units dictionaries and the id points to the actual unit. In the example below, the namespace refers to the non-si unit dictionary and the id links to the entry for the hartree.

* **copy** - this copies the nodes defined by **xpath** to the xpath defined by the **to** argument, which is relative to the element being copied. e.g. if **to** is ".", then the element and its children will be copied to become children of itself. If the element has an **id** attribute, this will have the string ".copy" appended to it, if not, an id of "copy.n" will be created, where n is the index of the node in the original xpath.

* **createAngle** - TODO

* **createArray** - this will create a cml:array at each of the nodes in the **xpath** query from the cml:scalar nodes generated by the **from** xpath query. If only one node is supplied, the contents of the node will be separated by whitespace and the array created from these. Arrays can only be created for integer or double data types. The scalar nodes with then be discarded. You may also specify a **delimiter** attribute to indicate a given character that separates the values of the array.

* **createAtom** - TODO

* **createDate** - TODO

* **createDouble** - TODO

* **createForumla** - TODO

* **createLength** - TODO

* **createList** - this will take a list of nodes from the **xpath** and, if they are cml modules, it will convert them to cml lists.

* **createMatrix** - this takes an array from the **xpath** and creates a matrix in return.

* **createMatrix33** - TODO

* **createMolecule** - this will create a molecule from the list of cml:arrays generated by the 'xpath' query. The length of the arrays indicates the number of atoms, and the **dictRef** attribute of each array determines the property of the atom it will be used for. Supported types are: **x3**, **y3**, **z3** for the coordinates, **id**, **elementType**, **label** and **atomTypeRef**. The molecule will be created as a child of the parent of the first array, and the arrays will then be discarded. The gaussian template l202.orient.xml is shown below as an example.

* **createNameValue** - TODO

* **createString** - if **xpath** returns a list arrays, then each array will be converted into a cml:scalar with a dataType xsd:string, with the value of the scalar being the values in the array, concatentated as strings and separated by whitespace. If **xpath** returns a list of cml:scalars, the the first scalar will be converted to type xsd:string, the value of which will be the concatenation of all the values in the remaining scalar nodes. The remaining scalar nodes will then be deleted. If a single node is returned by **xpath** and it is of instance text, then a new cml:scalar node will be created in its place with an optional id attribute as specified in the **id** argument.

* **createTable** - creates a table in the root node. Only need to specify **xpath** attribute that points to a **list** containing **only arrays**. If anything other than an array is present in the list, the transform will fail.

* **createTorsion** - TODO

* **createVector3** - for each node specified in the **xpath** this will take the nodes listed in the **to** argument and create a cml:vector from them, and give it the specified **dictRef**. The **to** argument must return 3 cml:scalar nodes for this to work.

* **createWrapper** - for each node in the **xpath** nodeset, this will create an enveloping element of type **elementName** that will become the child of the node's parent, and hold the node and all of its children. **id** and **dictRef** arguments are supported.

* **createWrapperMetadata** - for each node in the **xpath** nodelist, if the node is one of a cml:scalar, cml:array, cml:list, cml:table or cml:matrix, and has a "dictRef" attribute, it will remove the dictRef attribute and instead wrap the element in a cml:metadata element, so that, e.g. :

becomes:

* **createWrapperParameter** - this performs the same operation as createWrapperMetadata, but wraps the element in a cml:parameter with the dictRef of the target element.

* **createWrapperProperty** - - this performs the same operation as createWrapperMetadata, but wraps the element in a cml:property with the dictRef of the target element.

* **createZMatrix** - TODO

* **delete** - this will delete the list of nodes defined by the xpath, along with all of their child nodes.

* **debugNodes** - this just prints out the nodes selected by the xpath and is only useful for developing and debugging the transforms.

* **groupSiblings** - TODO

* **joinArrays** - with a single **xpath** argument, this will take the first array in the nodelist and join all the others to it, deleting the other arrays and leaving a single array with the dictRef of the original array. With an additional **key** argument, arrays with the same value of the specified attribute will be merged (and the specified attribute will be discarded thereafter). With an additional **from** argument, SOMETHING ELSE HAPPENS.

* **move** - this takes one or more nodes, and moves them into the node defined by the **to** argument. The **to** argument is an xpath that must just return a single element. The **position** argument indicates where in the children of the target, the element will be moved to. "1" makes the element the first, "2" makes it the second etc.

* **moveRelative** - this is similar to move, but the **to** argument is a xpath that is relative to the element being moved, so that if the **xpath** returns a list of elements scattered from throughout the document, each will be moved to the **to** relative to itself.

* **pullup** - this takes one or more elements defined by an xpath and moves them up out of their current containing element, so that they become children of their current grandparent. The only argument required is the xpath of the nodes to be pulled up.

* **pullupSingleton** - this takes one or more elements defined by an xpath and, if the element only has one child, replaces the element with the child, thereby deleting the original element and "pulling up" the child.

* **reparse** - this reparses a scalar specified by an xpath and must have either a regexPath attribute whose value points to a record element containing the desired regular expression, or a regex attribute whose value is the regular expression to be used for reparsing.

* **setDataType** - sets the dataType attribute of the nodes in the **xpath** to the string given as **value**. This can be useful if initially extracted data as , that should actually contains data of type or .

* **setValue** - with just a simple string as an argument (e.g. value="foo") to the value argument, this will set the value of all nodes in the **xpath** to be equal to this string. If value is a string of the form "$string(XPATH)" or $number(XPATH), where XPATH is a valid XPATH, then the value will be the result of evaluating the XPATH relative to the current node in the nodeset evaluated by **xpath**, in string or number form. With a map argument ...TODO

* **split** - this will take the nodes in the **xpath** and split them (in places) according to a specified splitter attribute, which is defined by a regular expression. With no splitter attribute specified, a scalar will be split by whitespace and turned into a list, a 1D array will be split into a cml list, and 2D arrays will be split into a list of separate arrays.

Notes on Transforms

* where possible, **id**'s should always be added to nodesets to facilitate later operations. * the < and > symbols should not be used in xpath comparisons, however < and > can be used, as shown below:

* use "..." for quoting attribute values * Rather then use relative namespaces (e.g. g:charge), the more reliable namespace-uri syntax can be used:

JUMBO-Converters filesystem structure

(I guess some of this is standard for Maven projects but my ignorance forces me to document everything. The bright side is that other newbies like me will feel happy!)

The main folder is

Under this is:

The two most important subfolders of this are

The second one is where the final compiled Java classes are located (**any more stuff?**) and we will not care about it for the moment. The subfolder, as its name indicates, contains the source code associated to the compchem part of JUMBOconverters (i.e., the one most related to the Quixote project). Inside the subfolder, we have the following chain of folders, at the bottom of which all Java source code is located:

Inside , we have two main subfolders:

The most specific compchem code is in (as you might have guessed!) ordered by the name of the compchem package (, , , etc.), and contains more general source code to support the former.

If you are Java-savy, you might want to check these folders and read the code, but one of the great things about the declarative approach that PMR has created into JUMBOconverters and we describe in this page is that you don't need to! If you know regular expressions and some very basic XPath (both of which you could even infer from already made examples), that should be sufficient.

One important thing to remember though, even if you don't plan to read the Java source code, is that the above folders structure translates into the names of the classes that do all the magic stuff, so, if you want to call these classes in the command line, like in

you need to have this structure in mind.

The declarative bits of the parsing infrastructure (i.e., what you, parsers developer, will have to check, understand and probably make a version for your favourite compchem code) are inside a similar folder tree under :

Inside each code folder, one can find subfolders for the different types of file, and inside each one of them a subfolder, e.g.,

In the rest of the sections and in some of the tutorials, we explain in detail how the different bits of declarative parsing are related and how everything works, but let us mention at this point that, at the filetype folders (i.e., at or ) the top level parsing template list file can be found, while each one of the smaller templates included in this list are located in .

Now, branching out at the same level as , still inside , we have a subfolder, which contains, on the one hand (under ), the Java source code for performing automatic tests, and, on the other hand (under ), a number of example files produced by the compchem codes that Quixote wants to tackle. The scheme of the folder tree is as follows:

A **general scheme** summarizing all the details commented above is the following:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly