Looking for efficient way to find duplicate names in adabas file

I’m tasked with the following:

Read vendor file
Sort records found by vendor name (excluding spaces and punctuation)
Select records on vendor file with same name but different tax identification number

Here’s my superbly inefficient first attempt. After my test job ran for an hour with no end in sight, I canceled it.


DEFINE DATA LOCAL
01 VENDOR VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)


01 VENDOR-DUP VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)

01 #NAME(A31)
01 #NAME2(A31)
END-DEFINE

READ VENDOR
  RESET #NAME #NAME2                   
  MOVE VENDOR.NAME TO #NAME
  EXAMINE #NAME FOR ' ' DELETE         
  EXAMINE #NAME FOR ',' DELETE         
  EXAMINE #NAME FOR '.' DELETE         
  READ VENDOR-DUP   
    MOVE VENDOR-DUP.NAME TO #NAME2
    EXAMINE #NAME2 FOR ' ' DELETE
    EXAMINE #NAME2 FOR ',' DELETE
    EXAMINE #NAME2 FOR '.' DELETE
    ACCEPT IF #NAME2 = #NAME AND VENDOR-DUP.#TIN NE VENDOR.#TIN
END-ALL

SORT #NAME USING NAME VENDOR
  AT BREAK OF #NAME
    SKIP (1) 1
  END-BREAK
*
  WRITE (1) VENDOR 26T NAME
END-SORT
END

Why check for differences before you send it for sorting ?
As far as I can see, If you check when you get them back you can avoid reading through the whole DB n*n as you do now.
Finn

Two things for a first iteration

READ VENDOR BY NAME

will get rid of SORT

Unless name can start with one of the removed characters

  READ VENDOR-DUP STARTING FROM #NAME

plus check for a reasonable strategy to drop out as early as
possible from the inner READ loop, otherwise you will read VENDOR-DUP
from start to end for every single VENDOR record …

I’m thinking something like this:


DEFINE DATA LOCAL  
01 VENDOR VIEW OF VENDOR  
  02 VENDOR-NUM(A13)  
  02 REDEFINE VENDOR-NUM  
    03 FILLER 1X  
    03 #TIN(A9)  
  02 NAME(A31)  
01 #HOLD   (5)
  02 #NAME (A31)
  02 NAME   (A31)
  02 VEND0R=NUM
01 #I  (I4)
END-DEFINE

READ VENDOR
  MOVE VENDOR.NAME TO #NAME  
  EXAMINE #NAME FOR ' ' DELETE           
  EXAMINE #NAME FOR ',' DELETE           
  EXAMINE #NAME FOR '.' DELETE           
END-ALL
SORT BY #NAME USING NAME VENDOR-NUM
  IF #NAME NE #HOLD.#NAME (1)
    IF #I GT 1
      DISPLAY (ES=ON) #HOLD(*)
    END-IF
    RESET #HOLD
    MOVE #NAME       TO #HOLD.#NAME (1)
    MOVE NAME        TO #HOLD.NAME(1)
    MOVE VENDOR-NUM  TO #HOLD.VENDOR-NUM(1)
    MOVE 1 TO #I
    ESCAPE TOP
  END-IF

  ADD 1 TO #I
  MOVE #NAME       TO #HOLD.#NAME (#I)
  MOVE NAME        TO #HOLD.NAME(#I)
  MOVE VENDOR-NUM  TO #HOLD.VENDOR-NUM(#I)
END-SORT

True, you are sorting the entire file, but that is better than reading it n*n times.

If this is something that will need to be done frequently, I would add a CLEAN-NAME to the file and make it a descriptor. Then you could histogram on CLEAN-NAME and use *NUMBER to find duplicates.

I would flatten the database file (adding multi-fetch):

DEFINE DATA LOCAL  
1 EMP    VIEW EMPLOYEES
  2 PERSONNEL-ID
  2 NAME
1 #NAME (A31)
END-DEFINE
READ MULTI-FETCH 10 EMP PHYSICAL
  #NAME := EMP.NAME
  EXAMINE #NAME FOR ' ' DELETE
  EXAMINE #NAME FOR ',' DELETE             
  EXAMINE #NAME FOR '.' DELETE
  WRITE WORK 1 #NAME
               EMP.NAME
               EMP.PERSONNEL-ID
END-READ
END

Then use a separate sort step (letting it determine the duplicates):

//STEP01   EXEC PGM=ICETOOL
//TOOLMSG   DD SYSOUT=*
//DFSMSG    DD SYSOUT=*
//IN        DD DSN=
//OUT       DD DSN=
//TOOLIN    DD *
  SELECT FROM(IN) TO(OUT) ON(1,31,CH) ALLDUPS
/*

Then print the resulting file:

DEFINE DATA LOCAL
1 #W
  2 #NAME (A31)
  2 #VENDOR (A31)
  2 #ACCOUNT (A13)
END-DEFINE
READ WORK 1 RECORD #W
  DISPLAY #NAME (IS=T)
          #VENDOR
          #ACCOUNT
END-WORK
END

Since recently I also fall in love with DFSORT/Icetool and I’m 100% agreed with Ralph :slight_smile: Well, I don’t think it can be done more efficiently on mainframe

Thanks for all the great suggestions, guys!

But this completely ignores the tax identification number requirement,
or what am I missing here ?!?

To verify my syntax with CHECK, I used the Employees file and replaced VENDOR-NUM with PERSONNEL-ID, but the logic is essentially the same.

But you will never see any hits with

ACCEPT IF #NAME2 = #NAME AND EMPLOYEES-DUP.PERSONNEL-ID NE EMPLOYEES.PERSONNEL-ID
IF #NAME2 = #NAME

is handled by the ALLDUPS parameter.

EMPLOYEES-DUP.PERSONNEL-ID NE EMPLOYEES.PERSONNEL-ID

is used to eliminate false hits/duplicates resulting from reading the same record as parent and child (eg outer loop PID=1 and inner loop PID=1). With SORT, there will be no false duplicates, so this test is unnecessary. Jerome also removed that test in his one-file code example.

Thanks Ralph, and sorry, I knew I must be missing something,
and I know I should stay clear of the forum with a headache like
that …

Nice solution then if external tools are allowed, yet there are, and must be,
“better” solutions using just Natural & Adabas, with the proper descriptors.

Sorry, guys, yesterday I missed the end of our such an interesting discussion.
Today I have 2 more points to add.
Wolfgand, I think nowadays most mainframe shops consider SORT as “integral part of OS”; perhaps, your remark about “external tools” is not that appropriate, if I may call it that way :slight_smile:
However, I agree with you regarding requirement as for "different tax identification number " and I think Ralph is too optimistic: SELECT would certainly sort out ALL dups (with the same vendor`s name); however, the task to verify that those records have different tax numbers do require some MORE CODE to add…

Lets pretend this is the "flattened vendor file" (the vendors name is the first 8 positions):
00000001 THIS IS REC1
00000002 THIS IS TAX#2
00000005 THIS IS TAX#51
00000002 THIS IS TAX#2
00000003 THIS IS REC3
00000004 THIS IS REC6
00000002 THIS IS TAX#2
00000006 THIS IS REC8
00000007 THIS IS REC9
00000005 THIS IS TAX#52
00000005 THIS IS TAX#53

After SELECT ALL dups I got the following (unless I missed something?)

00000002 THIS IS TAX#2
00000002 THIS IS TAX#2
00000002 THIS IS TAX#2
00000005 THIS IS TAX#51
00000005 THIS IS TAX#52
00000005 THIS IS TAX#53

Since the first 3 records have the same tax-ID#, they should not be included into the final report; however, this is what I received:


00000002 THIS IS TAX#2
THIS IS TAX#2
THIS IS TAX#2
00000005 THIS IS TAX#51
THIS IS TAX#52
THIS IS TAX#53
NEXT FIN
NAT9995 Natural session terminated normally.

Only Dan, the original poster, can say for sure, but I presumed that a file named VENDOR would have unique TIN values. Now for a file named VENDOR-DETAILS …

Well, the same tax id for different individuals would definitely be a strange thing :wink:

In our case there can be multiple TINs on the VENDOR file.


PREFIX       TIN              SEQUENCE NUMBER
   F         123456789           001
   F         123456789           002
   S         123456789           001

Prefix indicates if vendor uses a FEIN as their TIN or a SSN as their TIN or if they’re a foreign vendor
Sequence number is for a company with multiple locations.

But I assume this will be for the same name, or ?

Yes, I’m creating a report listing records on our VENDOR file with the same name but a different TIN (VENDOR NUMBER minus prefix and sequence number).

This brings us to an additional problem, suppose you have

Name    Tin Seq
Smith    1   1
Smith    2   1
Smith    1   2
Smith    3   1
Smith    2   2