large schema with text index (for an empty collection)

hello all,

(working with tamino 4.2.1)

Until now i had a large collection containing 272201 xml docs and a large schema (220 elements and more than 18000 lines when tsd formatted and size of 900 kb) but no index of any kind. As i wanted to exploit text indexes on some of the elements that i want to search, i would have liked to update my schema in adding at least one text index on the element “title” of my schema. But i am facing several problems :

  • when trying to open my schema with the schema editor via X-plorer, it freezes (workaround : but works most of the time when opening it directly in the schema editor)
  • via the schema editor i add a text index in the title element
  • then i ask for “define the schema” and wait for hours (waited 10 at max) but nothing seems to happen and i have to stop all the processes.

Seeing that, i tried to create a new and empty collection, define the schema with the text index (so i avoid the validation with many documents) and then add my documents but it appears i can’t even define such a schema for an empty collection : the schema editor freezes and i don’t know where to look to see if there’s something going on or not.

Hi Thomas,

then i ask for "define the schema" and wait for hours (waited 10 at max)

This indicates that your schema might be highly linked or recursive, thus
allowing for many paths to your newly indexed element/attribute.
Each possible path to that node will go as a separate index into the
physical schema.

The appropriate solution is using a multipath index. Then schema processing time will shrink down dramatically. If that does not help,
please post your schema for further investigation.

In general, the time needed for schema processing does only depend on the presence of data being loaded, if you have chosen the “instance validation” option an Tamino detects, that the new schema cannot be accepted without validating each document against it.

Regards
Uli

thank you Ulrich for the quick answer,

my schema is of course highly linked and recursive, as it is based on a TEI dtd conversion.

I would like to try to build a multipath index but i have not found in the documentation what value to give to the multipath value when creating a text index on the title element :

  • is it a name (like “titleMultipathIndex” for example)?
  • is it a xpath expression to all the title element i want to index (like /TEI.2/teiHeader//title to have all the titles from the headers of my documents)?

By the way, has anybody some kind of experience of dealing with a TEI schema ?

Regards,
thomas

Hi Thomas,
I played around with the TEI2 for some linguistic corpus data two years ago, and found that the original DTD gave me more problems than it solved.
So I cheated :wink:
I took all the instances I could find, and then made XMLspy generate a new schema that suited these instances.

  • and I tell you - this was a whole lot simpler than the original !!!

Anticipating your next question…
This schema was not part of my backup routine, so the work was lost when my harddisk crashed :frowning:

regards Finn

Hi Thomas,

the content of the tsd:multiPath element is just a name. Multiple multiPath indexes may even share the same name (if compatible: datatype, …)

Regards
Uli

Ulrich :
my schema with multipath text index is now accepted in less than a minute. Thank you for the useful piece of advice.
I have tried to see the generated index in using the following x-machine command (with and without the document type /TEI.2) :
http://localhost/tamino/testindex2/kerncorpus-all?_admin=ino:DisplayIndex(“kerncorpus-all”,“/TEI.2/teiHeader/fileDesc/titleStmt/title”)

and the response given is not what i was waiting for :
starting admin command ino:DisplayIndex(“kerncorpus-all”, “/TEI.2/teiHeader/fileDesc/titleStmt/title”)Invalid parameter detectedno index of requested type defined for that node

Finn the Dane :
i have tried something like what you say but for the work i need to do, i have to many docs (270.000 expandable to over than 1 million) to be able to generate something usable.

again thanks for what you’ve already told me
thomas

Hi Thomas
You should note that there is an indextype parameter for the displayindex
you should specify “text”

_admin=ino:DisplayIndex(“CollectionName”, “ElementPath”, “StartValue”, “Size”, “IndexType”)

270.000 documents - OK a bit of a job for XMLspy !

  • but are there really a significant difference in their structure ?

But anyways; if “guru” Post’s solution works for you I won’t interfere :wink:

Finn

finn the dane

thanks for the parameter hint. It works and i can see that i really have an index.

- but are there really a significant difference in their structure ?
i don’t think so but as it’s a huge corpus encoded during the last years by a lot of people, i can’t be sure that for example 1000 random documents have all the elements and attributes i will need to describe all of the others. The only thing i’m sure of is the conformance to the TEi guidelines so i have prefered to take the large TEI dtd and try to do something with it than reivent a new one that i would have had to change every now and then.

But anyways; if “guru” Post’s solution works for you I won’t interfere :smiley: