Sequentially accessing all XML documents in a huge database

Hello,

I need to update some specific fields of XML documents (records) in a huge database. I figure out that the most straightforward way is to use XQuery programmatically as follows:

for each “somekey” in list of “keys”
{
update for $a in input()/root where $a/key=“somekey”
do insert “value corresponding to ‘somekey’” into $a/target
} /* of course, this is only the algorithm which I will implement in Java API */

Here the “/root/key” element is properly indexed. But, I have some 10000 distinct “keys” which I try to match, and corresponding to each key I insert the “newelement” into the “/root/target”. And this particular doctype “root” contains some 4.5 million documents/records in it.

My question is: given this scenario, instead of doing as above, would it be advantageous to access the database records sequentially (i.e. going through the whole database only once), modify the document in memory and put it back into the database? The algorithm would be something like:

for each document “doc” in “root” doctype
{
currentkey ← “doc”/root/key
insert element value corresponding to ‘currentkey’ into “doc”/root/target
write back “doc” to the database
}

In other words, I seem to fail to appreciate how does indexing help in general? Does my first algorithm need 100004500000 operations (based on the numbers I gave above)? Or does it only need 10000 operations? Or, probably it actually needs 10000log(4500000), but it is still very bad. How many operations does it take to fetch the right document based on an indexed value?

Could you please pour in your valuable ideas. I can give more details, if you need … for now I wanted to keep it simple, hence this silly way of writing.

Best regards,
Gopal

Comparing the “number of operations” does not help a lot, because you also need to consider the relative cost of the participating operations. Given the fact that your “for each” iteration is performed on the client side, I’d prefer algorithm 1, simply because it has only 10K client/server interactions instead of 4.5M. I would expect the additional request overhead of algorithm 2 to (by far) exceed the effort for accessing the index on a simple predicate. You could also run an experiment on a small scale to convince yourself.

Hi,

Thanks for the reply. I would be grateful for further clarifications.

Forgive my ignorance. But I think there is something more to it than just the operations. The algorithm 1 requires 10K scans through the DB. OK… maybe 10K scans through the index data (plus probably accessing the document corresponding to a selected indexed item is constant time operation?)

There is also another thing. That of flexibility. How do I check if “newelement” already exists in method 1? That is, if I want to insert this element only if an identical element doesn’t already exist, how do I incorporate that in method 1? In method 2, I see how I can do that. In fact, in method 2, I can modify the document in any number of ways with java because I have the complete DOM tree in hand and once I am done with everything, I can put it back.

OK … now I will tell you about a small test that did. I implemented a sub problem of the one I described above and checked the performance. In this test, I avoided update operations. I avoided accessing the big DB with 4.5M documents. The “keys” in my question come from a database which contains 10K documents. I simply obtained these keys and the corresponding “candidate-new-elements-to-be-inserted-into-4.5M-db” (I call it new-element-set from now on) from the 10K database in two ways: (0.1) by accessing the DB sequentially (0.2) by accessing via an indexed field. (OK now I am running into a stupid notation. I want to reserve the names methods 1 and 2 for the two that I mentioned in my first message. Sorry. Here the ‘0’ in 0.x stands for the fact that this is a preliminary test!). In each method, I saved the results as a hash map where “key” and the corresponding “new-element-set” become the key and value of the hash map. That is all I did. I elaborate the methods (0.1) and (0.2), sequential access and index-based access respectively, below:

(0.1) sequential access:

In this method, I executed the following query
“for $a in input()/root1 return {$a/key}{$a/value}”.
I then iterated through and saved the “key” and “value-set” pairs to the hash map.

(NOTE: each document in input()/root1 contains exactly one “key” element and zero or more “value” elements)

(0.2) index-based access:

In this method, I first executed the query:
“input()/root1/key”,
and then for each “key”, I executed the following query
“for $a in input()/root1 where $a/key = ‘’ return $a/value”,
and finally iterated through these values, and saved the “key” and “value-set” pairs to the hash map.

And the result is: method (0.1) takes much less time than method (0.2). In an example run (that I did one second ago to check again), method (0.1) took 47 milliseconds and method (0.2) took 829 ms.

But of course, this test didn’t include any update operations, and didn’t access the real big DB. So I don’t deny that the things may be different if I simulate my original question. I will soon try to do that, but your kind feedback and suggestions will greatly help already now.

Best regards,
Gopal.