Data Extraction Library (libdext)

This document describes the Data Extraction Library (libdext). Libdext uses a scripting language to extract one or more data sets from a collection of files. These extracted data sets are then merged in to a single target data set that is suitable for further post-processing. This edition documents version 1.0.

1. Notices

2. Introduction

3. Installation

4. Using the Library

5. Syntax

6. Examples

7. Troubleshooting

Concept Index

-- The Detailed Node Listing ---

Introduction

2.1 Companion Software

2.2 Obtaining

2.3 Other Documentation

Installation

3.1 Dependencies

3.2 Platforms

3.3 Required Build Tools

3.4 Building

Using the Library

4.1 Format of Extracted Data

4.2 Groups

4.3 Files

4.4 Definitions

4.5 Conditions

4.6 Merging

4.7 Testing

Files

4.3.1 File Format

4.3.2 Selecting Columns

4.3.3 Selecting An Index

4.3.4 Selecting Samples

1. Notices

The Data Extraction Library (libdext)

Copyright (C) 1997 - 2003 Mike Arnold, Altjira Software

Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.

This software and documentation is provided "as is" without express or implied warranty. All questions should be addressed to mikea@altjira.com.

2. Introduction

One problem in post-processing the output from a simulation is extracting the necessary data from the data output files that are generated during the simulation. libdext addresses this issue in two stages. Firstly, it uses a simple scripting language to define how a data file is formatted and what data is to be extracted from the file. Secondly, it allows multiple extracted data sets to be merged in to a single target data set which is then available for further post-processing.

libdext does not support any post-processing such as analysis or graphics. It is concerned only with the extraction of the required data from the available data files and their merging into a single useable target data set.

2.1 Companion Software

2.2 Obtaining

2.3 Other Documentation

2.1 Companion Software

Currently the libdext library is used as part of the following packages or is compatible with the data generated by the following packages:

matdpp
matdpp is a suite of matlab scripts based around a matlab mex-file routine readdext() which calls libdext to read in data sets from a set of data files.
TENNS
TENNS is a neural simulation framework designed for heterogenous models that fit a network paradigm of nodes and connections.

2.2 Obtaining

libdext may be downloaded from http://www.altjira.com/distrib/libdext. See also 3.1 Dependencies, and 2.1 Companion Software, for other software that may need to be downloaded and installed.

The latest version of this documentation may be found at http://www.altjira.com/support/libdext/libdext.html.

2.3 Other Documentation

Further information regarding the interoperability of NMI components and supporting frameworks, their development and use, and all associated support and post-processing packages, is available in the NMI Simulation Resources documentation.

This documentation is also installed to ${prefix}/doc/libdext, where `${prefix}' is the installation directory specified during the build configuration.

3. Installation

3.1 Dependencies

3.2 Platforms

3.3 Required Build Tools

3.4 Building

3.1 Dependencies

Libdext is built on the ANSI Standard C++ Library. It also requires both the Network Model Interface library libnmi and the xutil library libxutil.

3.2 Platforms

libdext has been successfully tested on the following systems:

i686-linux-gnu (Mandrake 8.2 and Redhat 7.2)
solaris 2.7
irix 6.5
freebsd release 4.4, 4.5, 4.6

3.3 Required Build Tools

The following software is suggested. Other versions may work but have not been tested.

Package maintenance requires:

3.4 Building

Refer to the README and INSTALL files in the distribution.

4. Using the Library

The primary class is a Group. Other classes include Files, Definitions and Conditions. Data sets are extracted from File constructs according to Definitions and Conditions. A Group object merges the data sets extracted from a set of Files constructs in to a single target data set.

4.1 Format of Extracted Data

4.2 Groups

4.3 Files

4.4 Definitions

4.5 Conditions

4.6 Merging

4.7 Testing

4.1 Format of Extracted Data

The format of an extracted data set is always the same, independent of whether it is extracted from a single File construct, or is the result of merging multiple File constructs within a Group. It contains the following elements:

data
A matrix representing the merged extracted data. Each column represents a different dimension and each row represents a different sample. The number of data vectors is given by the number of rows and the number of dimensions is given by the number of columns which is the same for each row. The matrix is stored columnwise in a 1-dimensional array of type double.
index
An optional 1-dimensional array of type double representing an index column for the extracted data set. If it exists then the length of the array is equal to the number of rows in the data set, otherwise it is of length zero.
titles
A 1-dimensional array of strings representing the title of each column in the extracted data set. There is no title for the optional index column.
description
A string representing the description of the data set. It is generated from the descriptions for each of the files used to generate the data set.

4.2 Groups

A Group object uses a configuration string to generate the target data set. This configuration string is a list of File configuration strings. The Group objects parses this configuration string and creates the corresponding list of File objects. Each File configuration string may contain Condition and Definition configuration strings. Each File object contains a data set extracted from a text-based data source such as a file according to this set of Conditions and Definitions. These extracted data sets are then merged to generate the target data set.

4.3 Files

A File object extracts information from the data set contained in a file according to the specified configuration string see section 5. Syntax. The configuration string includes information about the name of the file and its format, along with configuration strings for Definitions and Conditions. The Definitions define which columns or dimensions in the data set to extract. The Conditions define which rows or samples in the data set to extract.

4.3.1 File Format

4.3.2 Selecting Columns

4.3.3 Selecting An Index

4.3.4 Selecting Samples

4.3.1 File Format

The format of the source file must meet a number of criteria:

the file contains only ascii text
comments within the file are identified by a '#' at the beginning of the line.
data within the file is separated by the specified seperator. The default seperator is a single white space.
the data is arranged in rows and columns. Each row represents a sample. Each column represents a different dimension of the data space.
the file may contain an initial block of comments which contain special information about how the file was generated and the description of each dimension of the data space.

It is assumed that the number of columns is fixed for a file. The number of columns is given by the number of column descriptors if supplied, else by the number of columns in the first line of data. Subsequent lines with the wrong number of columns are discarded.

The following special information may be contained in the initial block of comments:

#<space>Titles<space>[<title><space> ...]

Defines the number of columns in the file and the title for each column.

#[<text>]Experiment<space>:<space><description><end of line>

Describes how the file was generated.

Here is the start of an example file:

# tennsS (pid 4848) started on <Mon Sep 2 19:15:52 2002> # Experiment : sp-ca # current time is <Mon Sep 2 19:16:10 2002> # Titles time xeye yeye xeyevel yeyevel xeyeacc yeyeacc 100.1 0.152867 -0.0426542 0.48 -0.504554 0 0 100.2 0.189939 -0.0895467 0.32 -0.42693 0 0

4.3.2 Selecting Columns

The columns in the file that are to be extracted may be specified by a list of configuration strings for Definitions see section 4.4 Definitions. If no Definitions are specified then no data is extracted.

A single Definition may specify multiple columns in the file. A column may also be specified multiple times, either within the one Definition or between multiple Definitions, but it is included in the extracted data set only once.

The 'skip' prefix may be used with a Definition configuration to specify that the columns are not to be extracted. This can be used to provide exceptions to other Definitions. The set of columns to be extracted from the file is made up of the columns specified by all of the standard Definitions minus the columns specified by all of the skip Definitions.

4.3.3 Selecting An Index

One column in the file may be optionally defined as the index column. The index column is used when merging the extracted data sets from multiple files into a single target data set.

The index column is specified by giving the first Definition configuration the prefix 'index' see section 5. Syntax. There may be only one index Definition for a File and it must be the first Definition. It must resolve to a single column.

4.3.4 Selecting Samples

The samples in the file that are to be extracted may be specified by a list of configuration strings for Conditions see section 4.5 Conditions. If no Conditions are specified then all the samples are extracted.

A single Condition will generally select multiple samples in the file. A sample may be also be selected by multiple Conditions, but it is included in the extracted data set only once.

The set of samples to be extracted from the file is specified by AND-ing all the Conditions together. For a sample to be selected then it must meet all of the Conditions.

4.4 Definitions

There are two types of Definition objects, IndexDef and RegexDef objects see section 5. Syntax.

An IndexDef object specifies columns by number where the columns are numbered from 1 to N. Multiple indices may be specified using a comma (',') as a seperator. A range of indices may be specified using a dash ('-') as a seperator. A '~' as the first character in the specifier string means exclusion.

A RegexDef object specifies columns by matching a regular expression against their titles.

4.5 Conditions

The configuration for a Condition object is given in two parts. The first part specifies an argument list. The second part defines a function with respect to a set of positional arguments see section 5. Syntax. For each sample, the value of the Condition is calculated by passing the current values for the argument list as the positional arguments to the function. The return from the function is interpreted as a boolean and defines the value of the Condition.

The arguments to the function may be comprised of any of the following: the current number of samples read from the file, the current number of samples selected from the file, the total number of samples in the file, and the value of the current sample for any of the columns in the file. A list of column arguments is specified using a list of Definition objects. The order in which certain of the arguments can be specified is pre-defined see section 5. Syntax.

The function is defined in terms of the positional arguments x_0 to x_n. The number of positional arguments used in the definition of the function should not be more than the number of arguments defined in the argument list. The definition of a function allows common arithmetic, logical, and mathematical primitives including grouping and trigonmetric functions see section 5. Syntax. For detailed information refer to the libxutil library documentation for the Fexpr class.

4.6 Merging

Each File object extracts a data set and an optional index column from a file according to the specified definition. The data sets from each File object are merged into a single target data set according to the optional index columns for each File.

The data sets are merged on a sample by sample basis by matching the current values for all of the different index columns. The precision of this matching functionn can be specified and defaults to an exact match. Only rows where the values for the index columns match will be extracted and placed in the target data set. To be effective, the algorithm for the merging requires that each index column be monotonically increasing.

If any of the Files to be merged does not have an index column specified then they must be merged blindly. This means that every sample or row in the data sets for the blind Files will be merged which consequently places two constraints on the merging process. Firstly, all of the Files must have the same number of rows. Secondly, all of the index columns that are specified must match on every sample or row.

4.7 Testing

The command dexttest <config_file> can be used to test a configuration script.

5. Syntax

A definition string for a Group consists of a list of one or more file definitions.

filelist: filedef | filelist filedef

A file definition starts with the token 'file' and ending with the token 'end'. The body of the definition consists of a list of file statements.

filedef: 'file' filestatlist 'end'

A file statement can be one or more of the following. The order in which the statements appear is not important and a statement may appear multiple times. In some cases, if a statement appears multiple times, the final value replaces the previous values ('seperator', 'precision' and 'index'). In other cases, the final value is added to the set of previous values (definition, condition and <string>).

filestat: 'seperator' <string> | 'precision' <double> | 'index' definition | definition | 'skip' definition | condition | <string>
A <string> on it own defines the name of the file. If this statement appears multiple times then a File object is created for each file name. If the file can not be located then the '.gz' suffix is appended to the name and a search is repeated. If the file is located with the '.gz' suffix then the file is first uncompressed using gunzip.

A definition statement can be either of the following.

definition: 'regexdef' <string> 'end' | 'indexdef' <string> 'end'

A condition statement is defined as a list of arguments followed by a function definition. Note that the order in which the arguments may be specified is restricted.

condition: 'condition' ['read'] ['selected'] ['total'] [definitionlist] function 'end'

A function is defined by a single string. To allow white space enclose the string in double quotes. The only precedence is right to left and curly braces should be used to specify the desired grouping. Curly braces are also used in the place of parentheses. Note that the order in which the arguments can be specified is pre-defined. The following primitives can be used when defining a function:

`Arguments': x_0 x_1 x_2 ...
`Arithmetic Operators': + - / * %
`Logical Operators': gt ge lt le eq ne || &&
`Ternary Logical Operators': expr ? expr : expr
`Unary Functions': sin{} cos{} tan{} sinh{} cosh{} tanh{} abs{} ceil{} floor{} exp{} log{} log10{} sqr{} sgn{}
`Binary Functions': pow{,}
`Grouping': {}

6. Examples

The following example extracts a single data set from a single file. A single Group object will be created which contains a single File object.

# libdext configuration file file # use the column labeled 'time' as the index index regexdef time # include column number 7 indexdef 7 # include all columns whose title matches 'xeye' regexdef xeye # do not include any columns whose title matches 'acc' skip regexdef acc # only include samples where the time is less than 105 condition regexdef time "x_0 lt 105" end # select no more than 4 samples and only after the 5th sample condition read selected "{x_0 gt 5} && {x_1 lt 4}" end # apply to the following file test.asc end

If the start of the file test.asc is:

# tennsS (pid 4848) started on <Mon Sep 2 19:15:52 2002> # Experiment : sp-ca # current time is <Mon Sep 2 19:16:10 2002> # Titles time xeye yeye xeyevel yeyevel xeyeacc yeyeacc 100.1 0.152867 -0.0426542 0.48 -0.504554 0 0 100.2 0.189939 -0.0895467 0.32 -0.42693 0 0

then running this script wiht dexttest produces the following output:

---- File <test.asc> from experiment <sp-ca> with total number of columns <20> ---- number of samples read <6000> selected <4> bad lines <0> length of data set <4> number of selected columns <3> indices for selected columns <1 2 4> titles for selected columns <time xeye xeyevel> index column selected: matching <time> as regular expression on titles index in original data set <1> index in extracted data set <1> selecting columns based on definitions: matching <7> to list of indices <7> matching <xeye> as regular expression on titles skipping columns based on definitions: matching <acc> as regular expression on titles selecting data based on conditions: Condition <(x_0 < 105.000)> using as arguments: indices <1> selected definitions: matching <time> as regular expression on titles skipped definitions: Condition <((x_0 > 5.000) && (x_1 < 4.000))> using as arguments: <samples-read> <samples-selected> indices <> selected definitions: skipped definitions: ------------- Merging files <test.asc> ------------- lengths of files <4> total number of columns <2> total number of indexed files <1> number of merged data points <4> rows <4> columns <2> description <sp-ca> titles: xeye xeyevel data: 100.6 0.267235 0.48 100.7 2.65211 0.48 100.8 2.68907 0.24 100.9 2.70061 2.66443e-17

7. Troubleshooting

problems with column definition statements
check that the correct number of columns is being found. It may be necessary to specify a seperator within each file construct see section 5. Syntax.
indexdef statements work but not regexdef statements
make sure that the titles line is properly formatted and includes the word Titles see section 4.3.1 File Format.
problems with index statements
index column definition statements must resolve to exactly one column.

[ > ]

[ >> ]

Concept Index

Table of Contents

2. Introduction

2.1 Companion Software

2.3 Other Documentation

3. Installation

3.1 Dependencies

3.3 Required Build Tools

4. Using the Library

4.1 Format of Extracted Data

4.3.1 File Format

4.3.2 Selecting Columns

4.3.3 Selecting An Index

4.3.4 Selecting Samples

4.4 Definitions

7. Troubleshooting

Short Table of Contents

1. Notices
2. Introduction
3. Installation
4. Using the Library
5. Syntax
6. Examples
7. Troubleshooting
Concept Index

About this document

This document was generated by Mike Arnold on June, 19 2003 using texi2html

The buttons in the navigation panels have the following meaning:

Button	Name	Go to	From 1.2.3 go to
[ < ]	Back	previous section in reading order	1.2.2
[ > ]	Forward	next section in reading order	1.2.4
[ << ]	FastBack	previous or up-and-previous section	1.1
[ Up ]	Up	up section	1.2
[ >> ]	FastForward	next or up-and-next section	1.3
[Top]	Top	cover (top) of document
[Contents]	Contents	table of contents
[Index]	Index	concept index
[ ? ]	About	this page

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

1. Section One

1.1 Subsection One-One

...

1.2 Subsection One-Two

1.2.1 Subsubsection One-Two-One
1.2.2 Subsubsection One-Two-Two
1.2.3 Subsubsection One-Two-Three <== Current Position
1.2.4 Subsubsection One-Two-Four

1.3 Subsection One-Three

...

1.4 Subsection One-Four

This document was generated by Mike Arnold on June, 19 2003 using texi2html