Performance READ + FIND NUMBER vs. READ + IF ARRAY(*)

OK, Copycode is ready.

* XXFIND1L
*
* Local for XXFIND1C (See Docu there)
*
DEFINE DATA LOCAL
1 #XXFIND1-FOUND (L)
1 #XXFIND1-IDX   (I4)
1 #XXFIND1-SC    (I4)
1 #XXFIND1-S1    (I4)
1 #XXFIND1-S2    (I4)
END-DEFINE
* XXFIND1C
*
* Searching in sorted arrays
* (like XXFIND1N, but with better performance)
*
* usage:
* LOCAL USING XXFIND1L
* ...
* INCLUDE XXFIND1C '#searchval' '#array' '#idx'
* IF #IDX > 0 /* found
* ...
RESET &3&
*
RESET #XXFIND1-FOUND #XXFIND1-IDX
*
IF *OCC(&2&) = 0
  #XXFIND1-FOUND := FALSE
  #XXFIND1-IDX   := 0
ELSE
  #XXFIND1-S1 := 1
  #XXFIND1-S2 := *OCC(&2&)
  REPEAT
    IF #XXFIND1-S1 + 1 >= #XXFIND1-S2
      #XXFIND1-SC := #XXFIND1-S1
    ELSE
      #XXFIND1-SC := #XXFIND1-S1 + (#XXFIND1-S2 - #XXFIND1-S1 + 1) / 2
    END-IF
    DECIDE FOR FIRST CONDITION
      WHEN &2&(#XXFIND1-SC) = &1&
        #XXFIND1-IDX   := #XXFIND1-SC
        #XXFIND1-FOUND := TRUE
        ESCAPE BOTTOM
      WHEN &2&(#XXFIND1-SC) < &1&
        IF #XXFIND1-SC = #XXFIND1-S2 OR = *OCC(&2&)
          #XXFIND1-IDX := #XXFIND1-SC
          #XXFIND1-FOUND := FALSE
          ESCAPE BOTTOM
        END-IF
        IF #XXFIND1-SC + 1 < #XXFIND1-S2
          #XXFIND1-S1 := #XXFIND1-SC + 1
        ELSE
          #XXFIND1-S1 := #XXFIND1-S2
        END-IF
      WHEN &2&(#XXFIND1-SC) > &1&
        IF #XXFIND1-SC = #XXFIND1-S1 OR = 1
          #XXFIND1-IDX := #XXFIND1-SC - 1
          #XXFIND1-FOUND := FALSE
          ESCAPE BOTTOM
        END-IF
        IF #XXFIND1-SC - 1 > #XXFIND1-S1
          #XXFIND1-S2 := #XXFIND1-SC - 1
        ELSE
          #XXFIND1-S2 := #XXFIND1-S1
        END-IF
      WHEN NONE 
        IGNORE
    END-DECIDE
  END-REPEAT
END-IF
*
IF #XXFIND1-FOUND
  &3& := #XXFIND1-IDX
END-IF

First test seems to be good…

Copycode 1811 #FOUND: 20
search array 1420 #FOUND: 20
find number 2440 #FOUND: 20

My prediction is, that at size 500, the copycode should be the fastest way. But maybe with 10000 and more find number should be faster.

Meanwhile I did some tests with other sizes. A rule of thumb (for my system) is:
→ from 0 to 200 take IF #ARRAY(*) =
→ from 200 to 2000 take copycode with algorithm
→ from 2000 take adabas descriptor

Of course you can drop some command of the copycode (i.e. filling #idx directely instead of using found). But this is not a game changer.

Copycode 1811 #FOUND: 20
search array 1420 #FOUND: 20
find number 2440 #FOUND: 20

Copycode 1954 #FOUND: 100
search array 1616 #FOUND: 100
find number 2474 #FOUND: 100

Copycode 2085 #FOUND: 500
search array 2532 #FOUND: 500
find number 2509 #FOUND: 500

Copycode 2228 #FOUND: 1000
search array 3532 #FOUND: 1000
find number 2381 #FOUND: 1000

Copycode 2276 #FOUND: 2000
search array 5848 #FOUND: 2000
find number 2464 #FOUND: 2000

Copycode 2460 #FOUND: 5000
search array 12426 #FOUND: 5000
find number 2403 #FOUND: 5000

Try now to load all into 2 arrays, and search.
I would choose to use the Find cmd. It is the best choice

Sorry, but it doesn’t make sense to read file A into an array. I’ll pass every record of file A only once - so reading it into an array won’t save database I/Os. I will consume some additional time for building the array.

Just for the test, dont use it.

Matthias,

Just for a data point, I’m wondering what kind of machine you are testing on? Mainframe, LUW server or PC? How many other users/jobs are active? If Mainframe, is Natural optimizer involved? What Adabas caching is in use, indexes, table files? Do you only care about optimizing wall clock time, or does CPU time matter too?

I guess Steve Robinson is enjoying his retirement, haven’t seen him post in over a year. But he always recommended doing large table searches using EXAMINE #TABLE(*) for #STRING … GIVING POSITION. This requires redefining an (A1/10000), or whatever size, over the original array. And even uglier, needing to add an extra character between each array element and populate it with some unused character to avoid matching a value split between 2 occurrences. But I think he was only focusing on saving CPU time. I would only use this method if I was charged a lot for CPU minutes.

Here is a post from him w/example: https://tech.forums.softwareag.com/t/searching-and-updating-array/243555/4

  1. Hezi, accessing a 20-record file (FIND) a million times (parent loop) in a single program is bad design. Those million FINDs impact system throughput and every Adabas user. Even for a 20000-record file, you would be executing 50 Adabas commands, on average, for each record. Still not acceptable.

  2. Calling an external routine (subprogram, function, external subroutine) incurs lots of overhead. This increases with more parameters, including array entries.

  3. A colleague, George Vidacovich, demonstrated the speed of binary searches (your copycode) a decade ago and I have used them several times since. They are great.

  4. Compute #OCC = *OCC (#ARRAY) only once, rather than within the REPEAT.

  5. Because both parent and child files are sorted by the same value, replace the binary search with sequential match logic. At the beginning of the program, write the child file to a WORK file. That way you will access each record (parent and child) only once. And no array processing is needed. You will find a sample program attached as a zip file in the link, above.

George Cooper,

I’m on a (quite outdated) Solaris test machine. And I’m almost alone at the moment - that means ADABAS has below 100 commands per secs and cpu is 95% idle.

Adabas Bufferpool is 1GB while Database is 80GB. And the pool hit rate is 99.9%

Ralph Zbrog,

thanks for your reply. #4 and #5 are good points. I’ll try that out next week.

I don’t get the point of your workfile suggestion. Because it should be less time consuming to do a sequential match of ADABAS-File with an Array.

Here is one more option to try. Read FILE-B and compress the FOREIGN-KEY into a text field with delimiters. Then when reading FILE-A use EXAMINE #TEXT FOR FOREIGN-KEY GIVING #NMB1. My understanding is that using examine on a text string is a significant improvement over array logic.

ZZMATCH, my sample program, extracts the child file from Adabas to a WORK file. You could adapt it to load the data into an array, but the WORK file technique handles unlimited file sizes and eliminates any internal CPU for indexing. WORK file processing is not the fastest thing that Natural does, but I would be surprised to find a substantial reduction in elapsed time from using an array. Let us know your results.

As i see the result, performing Find is the best choice, you dont have to consider how many records are in the array.
The result is always the same

Hi George;

Five years ago, I hit 75. I retired. My wife and I went on quite a number of vacations (mostly hiking). Then came Covid. We did local hiking trips, thus avoiding restaurants and hotels.
That left time to follow Natural on the Tech Forum and on SAG-L. Somehow, I did not follow this thread totally. So, to clarify something, the last time I did comparisons, using a large alpha field for an EXAMINE always beat any technique that involved working with arrays. The reason is quite simple, arrays require work; a single alpha field does not have all that overhead. I will look later; somewhere on the Forums are copies of my Inside Natural. I think there is an index which will simplify finding articles that involve searching arrays. The extra character to delineate array members is not at all ugly. You simply do a REDEFINE (as you said); no additional coding required. Yes, I was concerned with saving CPU time.

Now to the posting involving the two files, A and B. Jerome Blanc basically posted what you would find in the Inside Natural articles.

I do not seem to have the entire postings. I will have to look again for them.

Dion Kejlberg pointed out that if the File-A records can be read in Physical sequence, that would be faster than Logical sequence. This could be very important for performance.

Actually, the File-A activity is probably the main “cost” of the entire process. As long as you avoid things like FIND NUMBERs on FILE-B, you should be fine. Even though there would be differences between different searches, the savings will probably be miniscule compared with reading the million records.

steve robinson (retired)

When I started reading this thread I remembered fondly going to MIT for Jim Wisdom’s annual Natural Conference with, you, Dennis Hamilton, Kelly Jones, Jim, and a host of others, all discussing nerdy stuff like READ vs FIND. Nice to see that this sliver of the corner of life hasn’t changed. Hope you are well in retirement.

Steve

@Matthias1: Can you please share the file_a view fields used for “do some stuff here” and the (additional?) view fields used for “do some additional stuff here”?

…and does the “stuff” also update the file_a records?

Hi Steve,

Glad to hear your retirement is going well, and that you were able to continue hiking during the Covid shut downs. The hiking will keep you young, much more so than keeping up with SAG-L and this forum!

Sorry, maybe calling the redefine “ugly” was a bit over the top. I just find it a bit messy-looking in my code when I use it. But if I needed to save CPU time on a large table scan, I have always used it once I read your tests and statistics examples. Thank you for all the tips and tricks you have shared with us over the years.

Maybe this thread could be used to justify a request to the Natural developers to add a binary table search or a hash table search option to the EXAMINE command. Cobol has always had the binary search algorithm built in, and it was always useful, especially when CPUs were so much slower. I wonder if the developers just didn’t want to make Natural look too much like Cobol? Or maybe they didn’t see why anybody would bother reading an Adabas table into a Natural array and doing lookups there, since Adabas was so fast and efficient?

Anyway, I’ll be interested in seeing how the sequential file match of Ralph’s will compare time wise if Mathias has time to test it. I’m sure it would use the least CPU, and probably wall time too. Good to use if you have a canned routine ready to plug in, but I always hated writing them from scratch and testing on old Cobol files back in the day. I’m sure a lot of mainframe CPU upgrades could have been avoided over the years if more sequential matches had been implemented for nightly, weekly and monthly batch runs. I’ve seen so many batch jobs running that reread the same data for reports that the previous job just read. But we got lazy, as it was so much easier to just read the database again, and less error prone. It also seemed like a step back in time to managing sequential master files, which was why we went to DBMS systems in the first place.

Cheers to all, and thanks to Mathias for the interesting thread,
.George Cooper (semi-retired and fighting it :wink:

the additional file_a fields are (with no updates):
N2
A1
A2
A4
A15 (Desc)
P10
A15 (Desc)

Today I tried the Ralphs Zbrog points #4 and #5 and the array redef recommended by Jerome and others.

→ #occ instead of *OCC doesn’t really help.
→ since file_a is read in a sorted way, Ralph Zbrog’s #5 is a good idea. See the code snippet below
→ searching in a string instead of using doesn’t really help. But to be honest: I didn’t make a redefinition. I compressed the array into a string before. I think this doesn’t matter. :wink:

* use search array with start pos
*
reset #found
#timx := *TIMX
#idx-start := 1
read (#reads) file_a by item_no
  for #idx = #idx-start to *OCc(#array(*))
    decide for first condition
      when #array(#idx) > item_no escape bottom
      when #array(#idx) < item_no #idx-start := #idx + 1
      when #array(#idx) = item_no #idx-start := #idx + 1
        add 1 to #found escape bottom
      when none terminate 1
    end-decide
  end-for
end-read
#timx := *TIMX - #TIMX
write 'array + pos ' #TIMX_P13 '=' #found
*
* emulate array with redefine
*
reset #found
compress '|' #array(*) '|' into #adyn with all delimiters '|'
#timx := *TIMX
read (#reads) file_a by item_no
compress '|' item_no '|' into #item_no_delimited leaving no
examine #adyn for #item_no_delimited giving position #idx
IF #idx ne 0
add 1 to #found
end-if
end-read
#timx := *TIMX - #TIMX
write 'array redef ' #TIMX_P13 '=' #found

BTW: I feel my machine is a bit slower today…

copycode with #occ 2202 #FOUND: 100
array + pos 1608 #FOUND: 100
array redef 1957 #FOUND: 100

copycode with #occ 2451 #FOUND: 500
array + pos 1674 #FOUND: 500
array redef 3058 #FOUND: 500

I am pleased to see that the sequential match logic faired well, but the results may be a bit skewed. You have three loops to read file A, one for each scenario. The first loop may have a longer run time because Adabas needs to retrieve data blocks from disk. The subsequent loops may be faster because some (or all) blocks are already loaded into the buffer pool. Rerun with the “array + pos” or “array redef” test`` first in the program to see if it changes the results.

Shouldn’t
when #array(#idx) = item_no #idx-start := #idx + 1
be
when #array(#idx) = item_no #idx-start := #idx
Aren’t you expecting multiple records to match each array entry?