Looking for efficient way to find duplicate names in adabas file

Taking your suggestions, this is my latest version of the program. (Boss doesn’t want two programs, so I’m assuming an external sort wouldn’t be an option for me.)


01 VENDOR VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)
  02 DTUSED-2K(D)
  02 VNDR-ENTR-IND(A1)
*                       
01 #HOLD (999)          
  02 #HOLD-NAME(A31)    
  02 #HOLD-TIN(A9)      
  02 VENDOR-NUM(A13)     
  02 NAME(A31)      
  02 DTUSED-2K(D)    
  02 VNDR-ENTR-IND(A1)  
*                       
01 #NAME(A31)
*            
01 #I(N3)
01 #N(N3)
END-DEFINE
*
READ MULTI-FETCH 10 DVSF01 PHYSICAL
  MOVE VENDOR.NAME TO #NAME
  EXAMINE #NAME FOR ' ' DELETE                           
  EXAMINE #NAME FOR ',' DELETE                           
  EXAMINE #NAME FOR '.' DELETE                           
END-ALL
AND
SORT BY #NAME VENDOR.VENDOR-NUM USING VENDOR.NAME #TIN
  VENDOR.VNDR-ENTR-IND VENDOR.DTUSED-2K                          
*                                                                   
  AT TOP OF PAGE (1)                                                
    WRITE (1)                                                       
      'DUPLICATE NAMES ON VENDOR FILE'                     
      121T 'PAGE' *PAGE-NUMBER (1) //                               
      'VENDOR NUMBER' 27T 'VENDOR NAME' 65T 'VENDOR ENTER INDICATOR'
      104T 'LAST DATE USED' /                                       
      '-'(16) 26T '-'(31) 63T '-'(26) 104T '-'(15) //               
  END-TOPPAGE                                                       
*
  IF #NAME NE #HOLD-NAME(1)                                         
    IF #I > 1                                                       
      FOR #N 1 #I                                                   
        WRITE (1) #HOLD.VENDOR-NUM(#N) 26T #HOLD.NAME(#N) 74T    
          #HOLD.VNDR-ENTR-IND(#N) 106T                              
          #HOLD.DTUSED-2K(#N)(EM=MM/DD/YYYY) /
      END-FOR                                                 
      SKIP (1) 1
    END-IF
    RESET #HOLD(*)
    MOVE #NAME TO #HOLD-NAME(1)                
    MOVE #TIN TO #HOLD-TIN(1)                 
    MOVE VENDOR.VENDOR-NUM TO #HOLD.VENDOR-NUM(1)           
    MOVE VENDOR.NAME TO #HOLD.NAME(1)            
    MOVE VENDOR.VNDR-ENTR-IND TO #HOLD.VNDR-ENTR-IND(1)       
    MOVE VENDOR.DTUSED-2K TO #HOLD.DTUSED-2K(1)        
    MOVE 1 TO #I
    ESCAPE TOP                                                
  END-IF                                                      
  IF #TIN = #HOLD-TIN(1) OR (#I >= 1 AND #TIN = #HOLD-TIN(#I))
    ESCAPE TOP                                                
  END-IF                                                      
  ADD 1 TO #I                                                 
  MOVE #NAME TO #HOLD-NAME(#I)                 
  MOVE #TIN TO #HOLD-TIN(#I)                  
  MOVE VENDOR.VENDOR-NUM TO #HOLD.VENDOR-NUM(#I)
  MOVE VENDOR.NAME TO #HOLD.NAME(#I)                  
  MOVE VENDOR.VNDR-ENTR-IND TO #HOLD.VNDR-ENTR-IND(#I)             
  MOVE VENDOR.DTUSED-2K TO #HOLD.DTUSED-2K(#I)              
*                                                                  
END-SORT                                                           
END

Just a minor point, and I have not yet done a timing comparison:

Instead of a series of EXAMINEs with DELETES, you can do the following

1 #VALS (A1/1:3) INIT <‘#’,‘$’,‘*’>
1 #NUM (I2)
:::::

EXAMINE #SOURCE FOR #VALS (*) DELETE GIVING NUMBER #NUM

My guess, and that is all it is at the moment, is that the single EXAMINE, even with the multiple “targets”, will be more efficient than the multiple EXAMINEs.

steve

This morning I was informed that I need to weed out anything that isn’t alphanumeric from the NAME field when doing my comparisons instead of just looking for spaces, periods, and commas. Am I correct in assuming that this means I won’t be able to use the EXAMINE statement, and will need to search each byte of the NAME field for anything NE MASK(A) and MASK (N)?

Well, not necessarily, and my thanks go to Steve :slight_smile:

Dan, if you take a look at the syntax of EXAMINE, you will realize that #VALS(*) is an array; as such, just put all of your “illegal chars” into that array and they will be DELETED

Regards,
NK

If you’re at Natural 8, then you can specify “operand4” as a vector of values, as Steve suggested.

If you’re on Natural 4, you can use an EXAMINE TRANSLATE to change all your “bad” characters to the same special character, then use a single EXAMINE DELETE to remove all occurrences of it.

My boss is adamant that I not look for “bad” characters but instead look for alphanumeric characters and keep them. I guess the rationale behind this is that one never knows which “bad” characters a user may input and the list would be too vast to try to compile.

I intended “bad” to mean “characters I don’t want.”

Create a table of 256 characters. Each alpha character translates to itself. All others translate to something else, “#” for example. Then delete all ‘#’.

I’m curious, Ralph. In your experience, will this EXAMINE TRANSLATE method be faster than moving VENDOR.NAME to an array and using a FOR loop to search each byte and moving each byte to a new field if MASK=(A) or MASK=(N)?

Dan (sorry, Ralph :-),
Theres a MACHINE INSTRUCTION TR (belongs to early IBM System-360; I mean it was there since many years ago) which executes EXTREMELY (since its a MACHINE instruction) fast by CPU and performs something VERY SIMILAR to our NATURAL “EXAMINE TRANSLATE statement”; I assume the NATURAL “EXANINE TRANSLATE” statement translates into some kind of code using TR and it should be (well, this is my personal opinion) at least as fast as your FOR loop

And it will be even faster when the Natural Optimizer Compiler is involved …

Ralph’s approach is probably the fastest.

However, if someone has the time to play (i do not at the moment), you might want to take a look at something else.

One could do a SEPARATE with all the “bad” characters as delimiters. Then just COMPRESS the resultant pieces.

Assuming a “real world” scenario, there probably would be few, if any bad characters in a given source. I think that would augur well for the SEPARATE approach.

steve

To be honest, I would do two things:

  • run a once-off program to clean the names
  • make sure (i.e. let the application check) no “bad characters” ever make it to the file

Interesting. I had to lookup the SEPARATE statement b/c I didn’t know you could specify more than one delimiter. I also learned this…

Jerome’s suggestion certainly seems like it should be best.
SEPARATE , then COMPRESS the resultant strings.

But, sometimes logic is not borne out by performance.

will get to it over the weekend.

steve

If I were a betting man, I would put my money on EXAMINE, regardless of NOC.

I’m keen to see the benchmarks. Had I mainframe access, I would perform the tests myself.

I expect to see consistently low CPU with EXAMINE, but varying CPU with SEPARATE depending on the resulting NUMBER.