FIND ... SORTED BY

Very often I read in programming guides: FIND … SORTED BY is forbidden!

Also in the thread READing a Range by Multi Valued Descriptor Field Steve Robinson(!) wrote:

What I am astonished of is: “(but only for a small number of records)”. This is not true. The correct use of SORTED BY does not depend on the number of records to be sorted but on the selectivity of the descriptor the records should be sorted by.

Here is an example (extreme to show the result):

Suppose you have a unique descriptor Order-Num and you have 60 Million records in your file. The highest Order-Num is 60000000. If you do a

FIND file WITH ORDER-NUM = 60000000 SORTED BY ORDER-NUM

you will get the answer in minutes (not milliseconds!). What’s the reason for this?

The resulting set is created in milliseconds.
Then Adabas has to sort the one element set. It does this by climbing the inverted list and comparing each ISN to the result set. If it finds a ISN in the result set, it will move it into the sorted result set until the unsorted result set is empty. In the case above Adabas reads all descriptor values to compare them. And this will consume the huge time.

The higher the selectivity of a descriptor, i.e. the lower the number of ISNs per descriptor is, the longer the sort may take.

On the other hand it is very simple for Adabas to sort a huge set by a descriptor with a low selectivity (many ISNs per value). Remember that the ISNs in a descriptor are sorted by ISN and the ISNs in an unsorted resulting set are sorted by ISN, too!

So keep in mind that the efficiency of SORTED BY first depends on the selectivity of the descriptor to be sorted by and second on the number of records in the resulting set.

Steve, I am expecting your reply 8)

The example I always use for this is actually a slight modification of a real application.

I have a file, 2 million records. Of these, 20,000 records have COUNTRY (which is a descriptor) = CANADA

I need to process the single record for the oldest person from CANADA. There is a field, called DATE-OF-BIRTH, which is also a descriptor, that determines age.

Without using SORTED BY, e.g. FIND or READ, there is no way to do this without reading all 20,000 CANADA records. Without MULTIFETCH, this will also require 20,000 ADABAS calls.

Using SORTED BY, I call ADABAS once, and read ONE record. Yes, there is the expense of the SORTED BY; this will be small compared to reading all 20,000 records.

SORTED BY is most valuable when it avoids reading records.

steve

[quote="Wilfried B

As I understood this, it meant the small number of records in the resulting set, not the number of records in the database.

If you check old notes from an Adabas internals class (I did even better, I contacted a colleague who knows the internals inside/out) you will discover there are two algorithms that Adabas uses to sort an ISN list from an S2 (FIND SORTED BY) command. I do not remember the precise criteria Adabas uses, but basically, if memory serves (something it does not always do) Adabas compares the number of ISNs with the number of blocks in the inverted table for the SORTED BY field. If the former is lower, Adabas reads the records rather than read all the inverted tables.

Thus, in your example Wilfried, Adabas should simply read the one record, not all the inverted tables.

steve

I’ve made the described experience with SORTED BY some years ago (on UNIX) and told SAG the inacceptable behaviour. I made the suggestion to change the algorithm as you described. I got the answer: it works as designed and will not be changed.
Fine, that it seems to be changed now!:oops:

We experienced the two different ways how ADABAS is sorting the resulting list mentioned by steve. We had a non-descriptor field in ORDER BY-clause and it worked fine. Then, the field was changed to be a descriptor and we got poor performance.

It seemed that ADABAS read all the records of the resulting ISN set before. After turning the field to a descriptor sorting was done using the complete inverted table.
And reading about 50 records was much faster than reading an inverted list containing millions of records…

Bye
/olker