Badly performing query

Hi.

I have a query where I - among other things - try to detect duplicates.

The dublicates have identical “AnmeldelseIdentifikator” but different tracsaction timestamps in “TransaktionsTid”.

In each group of documents with same “AnmeldelseIdentifikator” i want to process the first and move other duplicates to another collection.

The query works with a few documents, but with 1000 documents in the input collection fails because it takes too long to complete.

It uses a function “tafe:getAnmeldelseIdentifikator()” for which I have created a function-index.
It doesn’t seem to use that index, though.

The “Kategoriser” query is somewhat complex, so here’s a small part of it, which performs just as bad:


declare namespace wsa      = "http://www.w3.org/2005/08/addressing";
declare namespace xsi      = "http://www.w3.org/2001/XMLSchema-instance";

declare namespace cvr      = "http://rep.oio.dk/cvr.dk/xml/schemas/2005/03/22/";
declare namespace kms       = "http://rep.oio.dk/kms.dk/xml/schemas/2005/03/11/";
declare namespace cpr      = "http://rep.oio.dk/cpr.dk/xml/schemas/core/2005/03/18/";
declare namespace xkom     = "http://rep.oio.dk/xkom.dk/xml/schemas/2005/03/15/";

declare namespace etl      = "http://rep.oio.dk/tinglysning.dk/svarservice/message/integration/1/";
declare namespace model    = "http://rep.oio.dk/tinglysning.dk/schema/model/1/";
declare namespace snapshot = "http://rep.oio.dk/tinglysning.dk/schema/snapshot/1/";
declare namespace msgext   = "http://rep.oio.dk/tinglysning.dk/svarservice/message/anmeldelse/1/";
declare namespace akt      = "http://rep.oio.dk/tinglysning.dk/schema/elektroniskakt/1/";
declare namespace msg      = "http://rep.oio.dk/tinglysning.dk/service/message/tingbog/1/";
declare namespace msgadm   = "http://rep.oio.dk/tinglysning.dk/service/message/administration/1/";
declare namespace anm      = "http://rep.oio.dk/tinglysning.dk/schema/anmeldelse/1/";

declare namespace fase2_1  = "http://skat.dk/begrebsmodel/2009/01/15/";
declare namespace ho       = "http://skat.dk/begrebsmodel/xml/schemas/kontekst/2007/05/31/";

declare namespace aid="http://skat.dk/tafe/anmeldelseid/xml/schemas/2009/06/01/"
declare namespace cfg = "http://skat.dk/tafe/config/xml/schemas/2009/10/13/";
declare namespace log = "http://skat.dk/tafe/log/xml/schemas/2009/06/01/";
declare namespace tf = "http://namespaces.softwareag.com/tamino/TaminoFunction";

import module namespace tafe = "http://skat.dk/tafe";

declare function local:isDublet($doc as node()) as xs:boolean {
  exists(collection('Staging2')/fase2_1:TinglysningAnmeldelseKopiSvarModtag_I
    [tafe:getAnmeldelseIdentifikator(.) = tafe:getAnmeldelseIdentifikator($doc)
    and .//fase2_1:Kontekst/ho:HovedOplysninger/ho:TransaktionsTid
    < $doc//fase2_1:Kontekst/ho:HovedOplysninger/ho:TransaktionsTid]
  )
}

for $doc in input()/*
	where local:isDublet($doc)
	return $doc

I’d very much appreciate ideas on how too get this to use the index or otherwise improve performance.

Best regards
Claus Nielsen
[/code]
queries.zip (6.7 KB)
xsd.zip (19.9 KB)

Hi Claus,
It looks as if something in your XQUERY triggers not only one but TWO full “tablescans”.

  • and I guess only one is expected :wink:

I tried to move the “isDublet” functionality into the main loop and it appears that the “exists()” function is the culprit!

So perhaps somekind of “if” statement could be used instead ?

  • it is pretty hard to test without example-data …
    Finn

Hi Claus,
as Finn already said it is difficult to assess without test data. It might be the "tafe:getAnmeldelseIdentifikator(.) = tafe:getAnmeldelseIdentifikator($doc) " predicate cannot be evaluated index-based and this leads to many more doctype scans at runtime.

May be you can try the following in the getDublet() function:

let $x := tafe:getAnmeldelseIdentifikator($doc)
let $y := $doc//fase2_1:Kontekst/ho:HovedOplysninger/ho:TransaktionsTid
return (
exists(
collection(‘Staging2’)/fase2_1:TinglysningAnmeldelseKopiSvarModtag_I
[tafe:getAnmeldelseIdentifikator(.) = $x
and .//fase2_1:Kontekst/ho:HovedOplysninger/ho:TransaktionsTid < $y]
) )

and avoid as many // operators as possible.

Best Regards,

Michael