Performance READ + FIND NUMBER vs. READ + IF ARRAY(*)

Hi *,

I got an interesting performane question.
There is File A with over 1 Millon Records to process. An there is a file B with ~20 records. I have to process all records of File A, but I have to check, if there is a Record in File B to do some additional stuff. In Natural you would code:

READ FILE_A BY PRIMARY-KEY
/* do some stuff here
FN-B. FIND NUMBER FILE_B WITH FOREIGN-KEY = FILE_A.PRIMARY-KEY
IF NUMBER(FN-B.) NE 0
/
do some additional stuff here
END-IF
END-READ

So far so good. In my case I feel that this code has a bad performance. Because you do 1 Mio Finds. So I tried the following:

READ FILE_B BY FOREIGN-KEY
ADD 1 TO (#I4)
EXPAND ARRAY #ARRAY TO (1:#I4)
#ARRAY(#I4) := FOREIGN-KEY
END-READ
*
READ FILE_A BY PRIMARY-KEY
/* do some stuff here
IF PRIMARY-KEY = #ARRAY()
/
do some additional stuff here
END-IF
END-READ

Here’s the performance (while 1486 means 148.6 secs)

search array 1486 #FOUND: 20
find number 2527 #FOUND: 20

→ Searching an array (with size 20) is faster than find number.

But now I want to go the next step.

Question #1 is: When (at which array-size) does the array-technique become inperformant vs. find number?
→ I wrote a test program to try out some sizes. I think it’s ~500. Of course this could be dependent on hardware etc.

search array 2503 #FOUND: 500
find number 2513 #FOUND: 500

search array 3547 #FOUND: 1000
find number 2443 #FOUND: 1000

Question #2 is: For huge arrays there must be a better way to search then IF PRIMARY-KEY = #ARRAY(*). Which way is the best? I already got an idea, but what do you think?

Regards

Matthias

BTW here’s my test code.

define data local
1 #options
2 #reads (I4) const <1000000>
2 #finds (I4) const <500 >
*
1 file_a view file_a
2 ITEM_NO  /* (A15) unique descriptor
*
1 file_b view file_b
2 key1       /* (A20) foreign key 
2 key2       /* (A20) key for delete
*
1 #TIMX (T)
1 REDEFINE #TIMX
2 #TIMX_P13 (P13)
1 #array(A/1:*) DYNAMIC
1 #I4 (I4)
1 #I3 (I4)
1 #idx (I4)
1 #found (I4)
end-define
*
* build file_b and corresponding search_array
*
perform delete_file_b
expand array #array to (1:#finds)
*
divide #finds into #reads giving #i4
read (#reads) file_a by item_no
  add 1 to #i3
  if #i3 = #i4
    add 1 to #idx
    #array(#idx) := ITEM_NO
    reset file_b
    file_b.key1 := ITEM_NO
    file_b.key2 := 'file_b' /* to delete records afterwards
    store file_b
    reset #i3
  end-if
end-read
end transaction
*
* use search array
*
reset #found
#timx := *TIMX
read (#reads) file_a by item_no
  if item_no = #array(*)
    add 1 to #FOUND
  end-if
end-read
#timx := *TIMX - #TIMX
write 'search array' #TIMX_P13 '=' #found
*
* use find number instead
*
reset #found
#timx := *TIMX
read (#reads) file_a by item_no
  fn-item. find number file_b with key1 = item_no
  IF *NUMBER(fn-item.) NE 0
    add 1 to #found
  end-if
end-read
#timx := *TIMX - #TIMX
write 'find number ' #TIMX_P13 '=' #found
*
perform delete_file_b
*
define subroutine delete_file_b
find file_b with key2='file_b'
  delete
end-find
end transaction
end-subroutine
*
end

Hi,
You can do it with ‘examine giving index’,
I am not sure it will run faster
But how did decide thst only 500 records will be in file b?

If you are reading the entire file and don’t mind the order, then READ PHYSICAL must be the fastest.

~WRD0000.jpg

Hi,

the reason for this runtime behavior is, that the arrray will be read sequentially and every occurance will passed by an IF statement until an occurence fits to that IF.
You could write an function that uses the technique of folding (german: falten).
This will decrease the number of executet IF statements by far.

Using techniques like "(A) DYNAMIC and BY VALUE in the paramter definition enables you to even write an abstract solution to this problem, which could be reused for similar problems.

Ok, SAG could modify natural the same way.

examine full #array(*) for full … giving index … is not faster.
Here are the numbers:
search array 1463
examine 1436
find number 2477

In my case, there are only 20 records in file b. And I know that the number isn’t changing that much. But I thought: What, if there would be 100 or even more records? When does IF #array(*) = key become inperformant compared to FIND? In my case it was 500. And that’s interesting because I didn’t expect that high number. And I think with a more intelligent searching algorithm, we can push that limit to over 1000. And this would change my way of writing programs a lot.

By the result you publish, i understand Examine is faster, am i right?
I am not sure, calling a Function will improve the overall time, but sure will be usefull

OK, but to be honest it’s not really faster :wink:

This is just what I thought. Just give me a few minutes more. I’ll post a subprogram…

… Hashtabels (like in other programming languages) would be great.

1 Like

Here it is. Performance tests are running at the moment …

* XXFIND1N
*
* find value in sorted array
*
DEFINE DATA PARAMETER
1 #INPUT-PARAMETER
  2 #SEARCH-ADYN (A) DYNAMIC BY VALUE
  2 #ADYN        (A/1:*)  DYNAMIC   /* has to be sorted!
*
1 #OUTPUT-PARAMETER
  2 #O-IDX      (I4)
*
LOCAL
*
1 #S1    (I4)
1 #S2    (I4)
1 #SC    (I4)
1 #IDX   (I4)
1 #INT2  (I4)
1 #INT4  (I4)
1 #FOUND (L)
*
END-DEFINE
*
RESET #OUTPUT-PARAMETER
*
RESET #FOUND #IDX
IF *OCC(#INPUT-PARAMETER.#ADYN) = 0
  #FOUND := FALSE
  #IDX   := 0
ELSE
  #S1 := 1
  #S2 := *OCC(#INPUT-PARAMETER.#ADYN)
  REPEAT
    IF #S1 + 1 >= #S2
      #SC := #S1
    ELSE
      #SC := #S1 + (#S2 - #S1 + 1) / 2
    END-IF
    DECIDE FOR FIRST CONDITION
      WHEN #INPUT-PARAMETER.#ADYN(#SC) = #INPUT-PARAMETER.#SEARCH-ADYN
        #IDX   := #SC
        #FOUND := TRUE
        ESCAPE BOTTOM
      WHEN #INPUT-PARAMETER.#ADYN(#SC) < #INPUT-PARAMETER.#SEARCH-ADYN
        IF #SC = #S2 OR = *OCC(#INPUT-PARAMETER.#ADYN)
          #IDX := #SC
          #FOUND := FALSE
          ESCAPE BOTTOM
        END-IF
        IF #SC + 1 < #S2
          #S1 := #SC + 1
        ELSE
          #S1 := #S2
        END-IF
      WHEN #INPUT-PARAMETER.#ADYN(#SC) > #INPUT-PARAMETER.#SEARCH-ADYN
        IF #SC = #S1 OR = 1
          #IDX := #SC - 1
          #FOUND := FALSE
          ESCAPE BOTTOM
        END-IF
        IF #SC - 1 > #S1
          #S2 := #SC - 1
        ELSE
          #S2 := #S1
        END-IF
      WHEN NONE 
        IGNORE
    END-DECIDE
  END-REPEAT
END-IF
*
IF #FOUND
  #OUTPUT-PARAMETER.#O-IDX := #IDX
END-IF
*
END

Ure a quick coder, i could not find an error.

But it could be, that because of the usage of that by VALUE parameter definition the performance may be bad.
The reason is, that with every invoke of that subprogram the whole array is physically copied.

I hate copycode but in some times …

1 Like

Why not using For instead of Repeat, then the subpgm will be nicer

To be honest, I got a similar algorithm in an existing program. So the only thing to do was to adopt it.

And you’re right: ARRAY … BY VALUE would be really bad. But I didn’t use it for the array. Just for the search value… and I think that’s OK.

But the performance is disappointing for me. Maybe 1 Mio. CALLNATs is even more overhead (Searching in Steplibs etc.)

XXFIND 2432 #FOUND: 20
search array 1375 #FOUND: 20
find number 2497 #FOUND: 20

XXFIND 2927 #FOUND: 500
search array 2489 #FOUND: 500
find number 2471 #FOUND: 500

code snippet for the test program above:

*
* XARRAY
*
reset #found
#timx := *TIMX
read (#reads) file_a by item_no
callnat 'XXFIND1N' item_no #array(*) #idx
if #idx ne 0
add 1 to #FOUND
end-if
end-read
#timx := *TIMX - #TIMX
write 'XXFIND      ' #TIMX_P13 '=' #found
*

Another way, is to keep the array in the memory

Anyway, folding should be faster than IF #VAR = #ARR(*) …

But to take adavantage of it on the one hand and to produce a reusable abstract solution with no invocing overhead, the type copycode should be used.

Have a nice weekend, i continue typing some sort of operations manual.

Maybe you’re right again. Let me try out a natural function before …

Thank you. :wave:

Ok, i read it again, understood

using a natural function is a bit better than callnat. But not really good. So let’s write a copycode… as suggested by Wolfgang Rohlfs

func XXFIND 2549 #found: 500
search array 2453 #found: 500
find number 2406 #found: 500

func XXFIND 2243 #found: 20
search array 1432 #found: 20
find number 2491 #found: 20

It seems to me, that Find is always around 2400, regardless the array size, comparing to search (func,extSub copycode & etc) that is varying depending the array size

Right.
FIND doesn’t use the array. It’s just using the Descriptor supported by ADABAS.
Any other logic depends on the array size. Because:

IF #ARRAY(*) = ‘text’ just means:

IF #ARRAY(1) = ‘text’
OR #ARRAY(2) = ‘text’
OR #ARRAY(3) = ‘text’

So the IF takes more time the longer #ARRAY gets.