Problem with big documents

Hello All



My problem is that I am geeting very poor performance due to the amount of data.
I have lots of nodes which are connected to each other by simply pointing on an id. That looks like this:











(There is some more information in it but principly that’s all inside)

Now I have been testing with 100 000 nodes in one single document (a little more than 8 MB) with simple queries (such as getting nodes by name or reference) and the performance is way below anything acceptable. I have defined indexes on the name, id and pointer attribute but without any significant change. I am running on a P2 - 266 with 256 MB ram, on Windows 2000 over the lan, StarterKit 4.1.4 (for production purpose there will be a Solaris server, which has yet to come); I know its below the minimum spec but the computer isn’t doing anything else. Is it possible that performance will change dramatically on a better machine?

I have been thinking about splitting the document into subdocuments with e.g. 1000 or 100 nodes each to improve performance (apparently Tamino doesn’t like big documents) but then I can’t do queries (with XQuery) “over the edge of two documents”.
This for example won’t give me any result if the nodes are in different documents:

for $a in input()
let $b := $a/enode
let $c := $b[@name=“F”]
let $d := $b[@name=“G” and @id=$c/forward/@pointer]
return

{$c}
{$d}


(BTW, why do I have to put a constructor expression in the query when I want to give back more then one variable? see the last 4 lines of code)

Does that make any sense to split the documents? As I am expecting to have data sets of about a million nodes, what is the way to go? I would like to stay with Tamino because of its native XML support, but the performance would have to increase.
The queries which have to be run on the data will concern getting references and search by names (also with wildcards).
For example searching for a A-B-C pattern.

Thank you in advance for you help!

Regards

Mic

Mic,

first to your XQuery:
If I am right you are looking for with name F and G where F is “epointing” to G.

If you store -documents instead of sets of a XQuery should look something like:

for $a in input()
let $c := $a[@name=“F”]
let $d := $a[@name=“G” and @id=$c/forward/@pointer]
return

{$c}
{$d}


so you are just starting one level lower, $a substitutes $b.

Regarding the performance:

- You should expect improvements storing smaller documents. I would suggest storing -documents.
Why?
Tamino works like a DBMS. It is very good picking a certain instance out of a huge number of instances. This is also where the indeces are helping. If you have to do lot of work on a specific instance, this is expensive as you encountered. Also indeces do not help that much as you experienced. The indeces speed up finding the instance, not working on the instance itself.

So as a rule of thumb for performance with tamino:
- many documents in one doctype
- documents size not too big
- try to avoid “grouping” or sets of similar subdocuments in one bigger documents - better to break it into smaller ones. Let XQuery do the work to put it back together.
- use indices on search fields (again: fast to find a specific instance out of similar ones - will not speed up much work on a single, large instance)

If you have further questions, you could also post your schema, example docs and queries to the list (zipped), I am sure you get a “tailored” help!


regards,

Timm

Hello Tim


Thank you very much for your answer!
The problems with splitting my file into files with each one having only one node are:

1.) How many files can Tamino manage? Because the number of nodes might go up to one million.
2.) Isn

And here should be the attached file I mentioned before :slight_smile:

Mic

My mac browser wouldn
xml-files.zip (1.41 KB)

Mic,

one comment: I was not suggesting using documents with only one node. But I was suggesting to store large set of nodes as single documents instead of keeping them in one document.

1) limit is 2 billion docs per doctype (to avoid konfusion: US=billion is german=Milliarde). should be ok for you

2) should be no overhead

3) Query should work over the instances. In fact a query always tuns against all documents of a certain doctype.

I will have a look into your docs, but I do not know when.

regards,

Timm

Mic,
I looked at your data. see attached docs.

what I did:
- change your schema. Now knoten is the doctype. I also added an index @pointer. I changed the data accordingly

so the documents look like













with the Query

for $j in input()/knoten,
$k in input()/knoten
where $j/@name = “A” and $j/forward/@pointer = $k/@id
return


{$j}


{$k}



I get the “knoten” with name “A” and the “knoten” he is pointing to, in our case the “B” knoten:













Another example:
The Query:

for $j in input()/knoten,
$k in input()/knoten
where $j/forward/@pointer = $k/@id
return


{$j}


{$k}



gives you a list of all “knoten” and their sons:

-
-



-
-




-
-
-



-
-




-
-
-



-
-




-
-
-



-





hope that helps,

Timm
ChangedXml-files.zip (1.23 KB)