Huge schema size and huge ammount of documents. Recomendatio

Hi there

Our team is starting a new system that is expected to work with a librarian standard (MARC). The expected size of data is simple:

- about 2.000.000 documents
- the schema (MARC) is composed by around 3000 nodes.
- each document between 5 to 10KB, filling just 50 of the previous 3000 nodes.


The first tests done in a test database with 4 indexes (expected around 50) and 1.000.000 documents seems to work properly in updating and recovering but we start to have problems when we try to define new indexes what make us to think about the potential problems that are waiting us in the next future.

We have revised the documentation of Tamino and taking into account general recomendations related with size, performance, etc, but we will thank any experience, recomendation, link, etc in the management of similar projects.

Thanks in advance and best regards
Ignacio

What kind of problems do you have?



Hi Harald

We are in test process just now, but we are getting the next responses:

- Trying to get the Schema using the Tamino Schema editor from the database does not get response, even waiting for hours. This is happening in three different installations.

- Although problably there is posible to do this by other means, we are trying to define two new indexes in the data base (fully of records). This generates a very strange situation. The server start to response very slow in all the rest operations, even having a lot of free resources in the system. I mean RAM (more than 50% free) and CPU (99% idle). We can check this with the Proceess Administrator.

- Similar behavior to the previous is got when you try to include a new node. The sole solution for this is to stop the server. This behavior has been checked in two different servers.

We are now trying avoid the problem using non so much structured Schemas, using Any content, and checking the previous behavior with no so big databases.

Feedback is welcome.

Best regards
Ignacio

Hi Ignacio,

We also have run into similar issues where Tamino just doesn’t seem to work well when dealing with schemas that have a large number of indexed nodes. This problem seems to get worse as the number of documents and size of the documents increase. We have spent the last few weeks trying to get the ead to work with Tamino. We’ve even modified the ead locally to reduce the number of schema nodes, even though it is a standard, so that it might be more Tamino-friendly. However, our response times are still unacceptable. Our problems include the following:

1. Building indexes. Like you, we have found that defining new indexes in a schema with a large number of nodes is very difficult once documents are loaded. Even if you have the patience to wait for the schema editor to add an index, there is a limit on the number of indexes that can be added. For some of the more recursive elements in the ead such as title (very common search index), we have been unable to add indexes at all.

2. Some of our ead documents are over 2MB in size (max size = 5MB). Once the documents are loaded and indexes added (where possible), response times for an Anywhere search can be well over a minute, especially if your search term occurs within many (a few hundred) documents.

3. We are also dealing with another schema that contains about 100+ nodes and 250,000 documents averaging 10K in size. Again, this schema is very difficult to index once documents are loaded. After loading 100,000 records it took about 25 minutes to add a text index to one element.

Modifying database configuration parameters hasn’t helped us.

We are having a lot of difficulty getting past these critical issues right now. We too would be very grateful for any recommendations that would address these issues and reduce our search response times to something more acceptable (like 2-3 seconds).

Clare.
ead.tsd (216 KB)

Hi Ignacio & Clare,

your contributions raise a whole bunch of issues.
Let me try to answer most of them:

1. (ad Clare, 1)
Unfortunately, there is no simple solution on the maximum number of indexes
permitted for a single doctype when using Tamino 4.1. You need to check
your application in order to find out which indexes you do really need.
Then you may restrict the number of indexes by using tsd:which elements
inside your schema.

The good news is:
we have addressed that issue in a manyfold way in Tamino 4.2:
(a) Instead of approx 900 standard/text indexes each per doctype it will
allow for more than 10,000 each.
(b) Indexes on a single logical node can be shared for many paths leading
to that logical node - thus reducing the number of indexes being used.
This helps on either recursive and highly linked schemas with many paths
to the same logical node.
In addition, there will be additional types of indexes allowing to obtain
a better performance of queries.

2. (ad Ignacio, 2) If you are experiencing a Tamino which does neither respond
nor consume CPU resources during schema update, you should create a support
request. You should probably provide things like
(a) a backup (if not too big)
(b) an XML request log (shutdown database with abort option while hanging)
and the dump files (SAGSMP* + Tamino*) being created then.

3. (ad Ignacio, 3) What do you mean by “include a new node”?
Adding a new optional node should be an issue of seconds unless the
schema update is waiting for the exclusive collection lock due to open
cursors held by other applications.

4. (ad Clare, 2) You should check output of ino:explain() for your X-Query
request, or {?explain?} when using XQuery, in order to check whether post-
processing in involved when executing your queries while updating the
schema.
One possible explanation for the performance degradation could be
conflicts due to document locks while the schema update is in progress.
However, when the schema update is finished (and indexes successfully added),
the queries should perform at least as good as before adding the indexes.
If it does not, again check the “explain” results. If that does not help,
issue a support request.

Please note, that even reading a few hundred documents of 2 MB each means
reading “2 x few” GB from disc - which may already last a while!
More selective queries and usage of cursors may help.

4. (ad Clare, 3) It seems not exceedingly long to me when adding a text index
for 100,000 documents of 10kB each (i.e. 1GB of data) takes 25 minutes.
This depends on the following aspects:
(a) On which node has the text index being added? - If the root node is
being newly indexed, the entire text content of each document will be
indexed.
(b) If the word fragment index is turned on (default:off), this will lead
to a longer time for adding the text index.
In any case, all documents need to be read, decompressed, and deserialized.

Best regards
Uli

Hi Uli and the rest

Thanks for the reply. We will take into account your recomendations.

By the moment, we are avoiding the use of the big Schema with so many nodes (elements plus atributes in XML Schemas), getting out the values of the elements that will be considered as indexes (this has sense in the context we are working), and storing those values in an alternative schema.

The payload, I mean the real record that in fact is described by a very big schema, is being stored now in an element of type “Any” and not being used for searchs, just for presentation.

Best regards
Ignacio