EXPRESS/Binary Report David Price ISO TC184 SC4 Toulouse June 2006

EXPRESS/Binary Report

David Price

ISO TC184 SC4 Toulouse

June 2006

Agenda

1. Status since last ISO STEP in Italy (added)

2. Walkthrough of current EXPRESS/HDF5 mapping

3. Presentation of prototypes and testing results

4. Issue discussion for next draft of mapping

5. Next actions and plans for testing

March 2006 Italy STEP Meeting Report Items

• Workshop hosted by HDF Group– Workshop Dec 6-8, 2005– Champaign, Illinois, USA

• STEP, ESA, commercial, EXPRESS/Binary and HDF 5 developer attendees

• Agenda was– Introduced HDF Group to EXPRESS language and STEP

information models– HDF developers provided overview of HDF 5 Concepts and

Structures– Walkthrough of EXPRESS/HDF Mapping Draft 0.2– Presentation by domain experts : AP209 Analysis, STEP TAS,

SINDA/G, Ship AP Analysis Needs– Issues/requirements around APIs, programming languages, etc.

Summary Reported at March 2006 Italy STEP Meeting

• Many core issues on V0.2 spec addressed at the Dec 2005 workshop at HDF Group US facilities – The basic approach was flawed, V0.2 did not use

enough of the HDF capability

• V0.3 will be an improvement and should allow better control of efficiency by the application– http://www.exff.org/express_binary

• Prototyping will follow V0.3

March 2006 Italy STEP Meeting Action Items

• David Price – Publish EXPRESS/HDF Mapping V0.3 due March 24

• Mats Lindeblad – Create New Work Item for June SC4 meeting

• David Price - contact Hans-Peter about linking a one-day workshop with the NASA/ESA PDE at the end of April (a day before Monday?)

• Keith Hunten – plan session at Eng Analysis sessions at PDES, Inc. Offsite end of March

• David/Mats – plan for technical work at June SC4 meeting

Progress Since March

• V0.3 published• Short requirements session at PDES, Inc Offsite

where the EA team prioritized– Add SELECT– Add redefined attributes (does HDF support this?)– Add schema version attribute (may use URN)– What kind of metadata does NARA required?

• National archives project

– Also, need a EXPRESS-to-C software to lower barrier to participating in prototyping

Progress Since March (2)

• One-day workshop held with pyEXPRESS prototype team lead by Alain Fagot and Hans-Peter– David Price Slides/Notes are available

• Post-workshop plan to produce V0.4– EA requirements– better examples– Incorporate feedback/issues from pyEXPRESS

• Editor (i.e. David Price) could not provide sufficient time to the project to produce V0.4 or the EXPRESS-to-C software before June vacation

• V0.31 was published June 9 adding proposal for subset of SELECT types (one of the EA team priorities)

Current Mapping Walkthrough

Prototypes and Testing results

• pyEXPRESS testing (slides from PDE workshop)– Subset of EXPRESS (e.g. no complex instances)– Based on pyTables 1.3, HDF 1.6.5, Python 2.4– Using same EXPRESS-based API for P21 and HDF access

• HDF is just another backend to the pyEXPRESS API• This is a different approach from what is assumed by the EXPRESS/Binary

team where direct HDF API access was assumed (is “programmer ease of use” a very high priority?)

– Compression (using ZLIB) and chunking make file smaller and more efficient for read/write

• Even PC processors are powerful enough that decompression is faster than file access as HDF lets you only read into memory what you need at any given time

– Benchmarks show good results (e.g. 10-50% file size and 75% access times), but also identify areas in the mapping that need improvement (e.g. small HDF files are bigger than P21 and sometimes slower)

– STEP TAS will be a NWI in SC4 starting soon

Issue discussion for next draft of mapping

• <Technical work goes here>– David can edit source XML for V0.4 draft to include

issue resolution we develop today– EA needs

• Check V0.31 SELECT support (DONE)• Add redefined attributes (does HDF support this?) (DONE)• Add schema version attribute (may use URN)

– pyEXPRESS Cannes issues• Object ID (i.e. pointers) handling code ID = Integer + string

(string is pyTable name, generated from EXPRESS name) (DONE)

• Unset values for each datatype within the file (DONE)

Issue discussion for next draft of mapping (2)

– Issues• Complex/partial entity instances (ANDOR) (DONE)• David Issue = (Multiple) Inheritance? Had something to do with select types.

(DONE)• Defined type of array “TYPE x = aggregate of whatever” (TODO)• Complicated types for array values e.g. SELECT (REAL, INTEGER, ENTITY

INSTANCE) (DONE)– We will use the same generic object identifier approach to handle these as to

handle complicated SELECT types.• Variable length string

– HPdK thinks that these cannot be put in a HDF Compound Datatype. Georg found where it the UG seems to say this is allowed 7.1 Complex combinations of datatypes. Maybe it’s a limitation of pyTables?

– The current mapping says use Varaible length datatypes but it’s not clear if that’s allowed in a Compound Datatype.

– We may have to use the general purpose object id capability and have a dataset somewhere containing varying length strings (or find another solution). It does look like you may have to specify the maximum length of the varying length strings.

– (DEFER TO EMAIL WITH HDF)

Instance identifiers

• Every hdf5 link and hdf5 dataset has an hdf5 object id that is an unsigned 32/64 bit integer– Issue : Is there a problem with using 64 bit integer as part of

entity instance ids on a 32 bit platform (i.e. does this place a limit on file size or interoperabilty?)

• H-P thinks the object ids are managed inside a hash table in HDF– Also thinks the object id is not exposed in the hdf API

everywhere that we need it

• Proposal is to use a tuble of integers that can be used for both an entity instance id and a pointer into the aggregates– (hdf object id, row index)

Complicated Select types

• TYPE x = SELECT OF (REAL, INTEGER, LIST OF BOOLEAN, e2);

• Proposal is to have each base type in a separate HDF dataset in a separate group– Group for REAL, Group for INTEGER, Group for LIST OF

BOOL, etc.• It could be configurable

– May have a single dataset for ALL integers in the file used in this way

– May have a dataset for each attribute used in this way (similar to how the mapping for aggregate attribute values works now)

– For cases where every entity instance that has TYPE x as its domain, you might use the simple type instead of the complicated mapping

Redeclared attributes attribues

• Redeclaration things we can address– specialize the attribute domain

• Write the encoding of the specialized value in the HDF compound type representing the subtype

– type is subtype of original• We only use the object identifier everywhere so this is no

problem

– rename of attribute• Use new name in HDF compound data type for the subtype

– Explicit to derived• Do not put the attribute in the HDF5 compound data type and

do not store a value

ANDOR

• SCHEMA test; ENTITY a; name : STRING; ENTITY b SUBTYPE OF a; age : INTEGER; x : REAL; ENTITY c SUBTYPE OF a; height : REAL; x: BOOLEAN; Results in test/a test/a/name test/b test/b/name test/b/age test/c test/c/height test/b__c test/b__c/name test/b__c/age test/b__c/height test/b__c/b__x test/b__c/c__x

Next actions and plans for testing

• pyEXPRESS testing based on pyTABLES, there is a C Tables API … Should our other testing be based on that?

• Can/should we set up another workshop with HDF Group to complete mapping?– DP Action to talk to Mike Folk to about doing something prior to the ISO

in October (we remember him saying there was a workshop in DC)• What do testers need to help get them started?

– EXPRESS-to-C has been mentioned (if we use C Tables API that’s not useful)

– Training?– Test data?– Schemas?

• Closing plenary slides for Friday• NWI – Will be created and circulated via telecon before the next ISO

STEP meeting.

Notes from Meeting• Are there other sources of MetaData?

– Are there other archiving (e.g. NARA) or LTDR standards (e.g. LOTAR)?

– If you treat HDF as a “database” what is needed?– What about internal company meta-data?– What about Web-based standards (e.g. Dublin Core)?– Should we just include a generic meta-data “name-value pair”

capability?– What about non-STEP data in the same file that the STEP data

references (e.g. jpegs)?• Where multiple mappings are still being tested, it is OK to include

more than one in the specification.– The specification is currently a guide for prototype testers, not a draft

standard.• What are the highest priority requirements? “Performance”, but

performance and efficiency of exactly what?

Notes from Meeting (2)

• We may need to add some HDF attributes to the Groups and Datasets when they are written to help readers (e.g. number of instances of an entity type that were written)– C Tables API uses this approach so we should look at that to

see if we can learn anything for our use.

• We need to have more discussion about whether to allow or require writing inverse attribute values into the file, nothing is done there now. – For “read-only files” inverses could be a nice optimization.– Would we need to allow this to be configured? If so, how? – What about the “unnamed inverse” that EXPRESS says exists?

Action Items• HPdK – Find out how to implement the object id using the HDF 5

API• DP – Find email thread on entity instance identifiers from a year

ago, it might be useful for the new proposal• AF – Write text to describe the multi-dataset approach to Aggregate

Instances, email to DP who will add to spec V0.4• DP – Read “fixme” from meeting and fix them.• HPdK – Put example HDF5 files on the Web somewhere for others

to view. Mapping document too.• ML – Look at what Vivace stuff can be published publicly.• ML – Look at What can be published to the Vivace Forum 2

(unfortunately, these are same dates as Hershey).

Documents

EXPRESS/Binary Report David Price ISO TC184 SC4 Toulouse June 2006