Looking for efficient way to find duplicate names in adabas file

Dan4 · April 11, 2012, 9:42pm

Taking your suggestions, this is my latest version of the program. (Boss doesn’t want two programs, so I’m assuming an external sort wouldn’t be an option for me.)


01 VENDOR VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)
  02 DTUSED-2K(D)
  02 VNDR-ENTR-IND(A1)
*                       
01 #HOLD (999)          
  02 #HOLD-NAME(A31)    
  02 #HOLD-TIN(A9)      
  02 VENDOR-NUM(A13)     
  02 NAME(A31)      
  02 DTUSED-2K(D)    
  02 VNDR-ENTR-IND(A1)  
*                       
01 #NAME(A31)
*            
01 #I(N3)
01 #N(N3)
END-DEFINE
*
READ MULTI-FETCH 10 DVSF01 PHYSICAL
  MOVE VENDOR.NAME TO #NAME
  EXAMINE #NAME FOR ' ' DELETE                           
  EXAMINE #NAME FOR ',' DELETE                           
  EXAMINE #NAME FOR '.' DELETE                           
END-ALL
AND
SORT BY #NAME VENDOR.VENDOR-NUM USING VENDOR.NAME #TIN
  VENDOR.VNDR-ENTR-IND VENDOR.DTUSED-2K                          
*                                                                   
  AT TOP OF PAGE (1)                                                
    WRITE (1)                                                       
      'DUPLICATE NAMES ON VENDOR FILE'                     
      121T 'PAGE' *PAGE-NUMBER (1) //                               
      'VENDOR NUMBER' 27T 'VENDOR NAME' 65T 'VENDOR ENTER INDICATOR'
      104T 'LAST DATE USED' /                                       
      '-'(16) 26T '-'(31) 63T '-'(26) 104T '-'(15) //               
  END-TOPPAGE                                                       
*
  IF #NAME NE #HOLD-NAME(1)                                         
    IF #I > 1                                                       
      FOR #N 1 #I                                                   
        WRITE (1) #HOLD.VENDOR-NUM(#N) 26T #HOLD.NAME(#N) 74T    
          #HOLD.VNDR-ENTR-IND(#N) 106T                              
          #HOLD.DTUSED-2K(#N)(EM=MM/DD/YYYY) /
      END-FOR                                                 
      SKIP (1) 1
    END-IF
    RESET #HOLD(*)
    MOVE #NAME TO #HOLD-NAME(1)                
    MOVE #TIN TO #HOLD-TIN(1)                 
    MOVE VENDOR.VENDOR-NUM TO #HOLD.VENDOR-NUM(1)           
    MOVE VENDOR.NAME TO #HOLD.NAME(1)            
    MOVE VENDOR.VNDR-ENTR-IND TO #HOLD.VNDR-ENTR-IND(1)       
    MOVE VENDOR.DTUSED-2K TO #HOLD.DTUSED-2K(1)        
    MOVE 1 TO #I
    ESCAPE TOP                                                
  END-IF                                                      
  IF #TIN = #HOLD-TIN(1) OR (#I >= 1 AND #TIN = #HOLD-TIN(#I))
    ESCAPE TOP                                                
  END-IF                                                      
  ADD 1 TO #I                                                 
  MOVE #NAME TO #HOLD-NAME(#I)                 
  MOVE #TIN TO #HOLD-TIN(#I)                  
  MOVE VENDOR.VENDOR-NUM TO #HOLD.VENDOR-NUM(#I)
  MOVE VENDOR.NAME TO #HOLD.NAME(#I)                  
  MOVE VENDOR.VNDR-ENTR-IND TO #HOLD.VNDR-ENTR-IND(#I)             
  MOVE VENDOR.DTUSED-2K TO #HOLD.DTUSED-2K(#I)              
*                                                                  
END-SORT                                                           
END

Steve_Robinson · April 12, 2012, 3:51pm

Just a minor point, and I have not yet done a timing comparison:

Instead of a series of EXAMINEs with DELETES, you can do the following

1 #VALS (A1/1:3) INIT <‘#’,‘$’,‘*’>
1 #NUM (I2)
:::::

EXAMINE #SOURCE FOR #VALS (*) DELETE GIVING NUMBER #NUM

My guess, and that is all it is at the moment, is that the single EXAMINE, even with the multiple “targets”, will be more efficient than the multiple EXAMINEs.

steve

Dan4 · April 12, 2012, 4:10pm

This morning I was informed that I need to weed out anything that isn’t alphanumeric from the NAME field when doing my comparisons instead of just looking for spaces, periods, and commas. Am I correct in assuming that this means I won’t be able to use the EXAMINE statement, and will need to search each byte of the NAME field for anything NE MASK(A) and MASK (N)?

Nikolay_Kuhta · April 12, 2012, 4:21pm

Well, not necessarily, and my thanks go to Steve

Dan, if you take a look at the syntax of EXAMINE, you will realize that #VALS(*) is an array; as such, just put all of your “illegal chars” into that array and they will be DELETED

Regards,
NK

Ralph_Zbrog · April 12, 2012, 4:22pm

If you’re at Natural 8, then you can specify “operand4” as a vector of values, as Steve suggested.

If you’re on Natural 4, you can use an EXAMINE TRANSLATE to change all your “bad” characters to the same special character, then use a single EXAMINE DELETE to remove all occurrences of it.

Dan4 · April 12, 2012, 4:29pm

My boss is adamant that I not look for “bad” characters but instead look for alphanumeric characters and keep them. I guess the rationale behind this is that one never knows which “bad” characters a user may input and the list would be too vast to try to compile.

Ralph_Zbrog · April 12, 2012, 4:53pm

I intended “bad” to mean “characters I don’t want.”

Create a table of 256 characters. Each alpha character translates to itself. All others translate to something else, “#” for example. Then delete all ‘#’.

Dan4 · April 12, 2012, 6:07pm

I’m curious, Ralph. In your experience, will this EXAMINE TRANSLATE method be faster than moving VENDOR.NAME to an array and using a FOR loop to search each byte and moving each byte to a new field if MASK=(A) or MASK=(N)?

Nikolay_Kuhta · April 12, 2012, 6:21pm

Dan (sorry, Ralph :-),
Theres a MACHINE INSTRUCTION TR (belongs to early IBM System-360; I mean it was there since many years ago) which executes EXTREMELY (since its a MACHINE instruction) fast by CPU and performs something VERY SIMILAR to our NATURAL “EXAMINE TRANSLATE statement”; I assume the NATURAL “EXANINE TRANSLATE” statement translates into some kind of code using TR and it should be (well, this is my personal opinion) at least as fast as your FOR loop

system · April 12, 2012, 6:23pm

And it will be even faster when the Natural Optimizer Compiler is involved …

Steve_Robinson · April 12, 2012, 9:20pm

Ralph’s approach is probably the fastest.

However, if someone has the time to play (i do not at the moment), you might want to take a look at something else.

One could do a SEPARATE with all the “bad” characters as delimiters. Then just COMPRESS the resultant pieces.

Assuming a “real world” scenario, there probably would be few, if any bad characters in a given source. I think that would augur well for the SEPARATE approach.

steve

system · April 12, 2012, 9:27pm

To be honest, I would do two things:

run a once-off program to clean the names
make sure (i.e. let the application check) no “bad characters” ever make it to the file

Jerome_LeBlanc · April 12, 2012, 9:36pm

Interesting. I had to lookup the SEPARATE statement b/c I didn’t know you could specify more than one delimiter. I also learned this…

Steve_Robinson · April 12, 2012, 10:12pm

Jerome’s suggestion certainly seems like it should be best.
SEPARATE , then COMPRESS the resultant strings.

But, sometimes logic is not borne out by performance.

will get to it over the weekend.

steve

Ralph_Zbrog · April 12, 2012, 10:29pm

If I were a betting man, I would put my money on EXAMINE, regardless of NOC.

I’m keen to see the benchmarks. Had I mainframe access, I would perform the tests myself.

I expect to see consistently low CPU with EXAMINE, but varying CPU with SEPARATE depending on the resulting NUMBER.

Topic		Replies	Views
Finding Out the Numeric Position Adabas-Natural , Natural , Natural-Code-Samples	17	42317	April 2, 2021
Checking for a single value within an array Adabas-Natural , Natural , Natural-on-Mainframes	15	9377	April 2, 2021
Performance READ + FIND NUMBER vs. READ + IF ARRAY(*) Adabas-Natural , Adabas , Natural	44	2741	December 2, 2022
[Solved] Not a bug in examine. Information about examine/separate and performance. Adabas-Natural , Natural , Natural-on-Mainframes	19	1931	April 2, 2021
How to remove trailing characters? Adabas-Natural , Natural , Natural-Code-Samples	21	3540	April 2, 2021

Looking for efficient way to find duplicate names in adabas file

Related topics