Looking for efficient way to find duplicate names in adabas file

Dan4 · April 10, 2012, 4:22pm

I’m tasked with the following:

Read vendor file
Sort records found by vendor name (excluding spaces and punctuation)
Select records on vendor file with same name but different tax identification number

Here’s my superbly inefficient first attempt. After my test job ran for an hour with no end in sight, I canceled it.


DEFINE DATA LOCAL
01 VENDOR VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)


01 VENDOR-DUP VIEW OF VENDOR
  02 VENDOR-NUM(A13)
  02 REDEFINE VENDOR-NUM
    03 FILLER 1X
    03 #TIN(A9)
  02 NAME(A31)

01 #NAME(A31)
01 #NAME2(A31)
END-DEFINE

READ VENDOR
  RESET #NAME #NAME2                   
  MOVE VENDOR.NAME TO #NAME
  EXAMINE #NAME FOR ' ' DELETE         
  EXAMINE #NAME FOR ',' DELETE         
  EXAMINE #NAME FOR '.' DELETE         
  READ VENDOR-DUP   
    MOVE VENDOR-DUP.NAME TO #NAME2
    EXAMINE #NAME2 FOR ' ' DELETE
    EXAMINE #NAME2 FOR ',' DELETE
    EXAMINE #NAME2 FOR '.' DELETE
    ACCEPT IF #NAME2 = #NAME AND VENDOR-DUP.#TIN NE VENDOR.#TIN
END-ALL

SORT #NAME USING NAME VENDOR
  AT BREAK OF #NAME
    SKIP (1) 1
  END-BREAK
*
  WRITE (1) VENDOR 26T NAME
END-SORT
END

Finn_the_Dane1 · April 10, 2012, 4:35pm

Why check for differences before you send it for sorting ?
As far as I can see, If you check when you get them back you can avoid reading through the whole DB n*n as you do now.
Finn

system · April 10, 2012, 4:42pm

Two things for a first iteration

READ VENDOR BY NAME

will get rid of SORT

Unless name can start with one of the removed characters

  READ VENDOR-DUP STARTING FROM #NAME

plus check for a reasonable strategy to drop out as early as
possible from the inner READ loop, otherwise you will read VENDOR-DUP
from start to end for every single VENDOR record …

Jerome_LeBlanc · April 10, 2012, 4:55pm

I’m thinking something like this:


DEFINE DATA LOCAL  
01 VENDOR VIEW OF VENDOR  
  02 VENDOR-NUM(A13)  
  02 REDEFINE VENDOR-NUM  
    03 FILLER 1X  
    03 #TIN(A9)  
  02 NAME(A31)  
01 #HOLD   (5)
  02 #NAME (A31)
  02 NAME   (A31)
  02 VEND0R=NUM
01 #I  (I4)
END-DEFINE

READ VENDOR
  MOVE VENDOR.NAME TO #NAME  
  EXAMINE #NAME FOR ' ' DELETE           
  EXAMINE #NAME FOR ',' DELETE           
  EXAMINE #NAME FOR '.' DELETE           
END-ALL
SORT BY #NAME USING NAME VENDOR-NUM
  IF #NAME NE #HOLD.#NAME (1)
    IF #I GT 1
      DISPLAY (ES=ON) #HOLD(*)
    END-IF
    RESET #HOLD
    MOVE #NAME       TO #HOLD.#NAME (1)
    MOVE NAME        TO #HOLD.NAME(1)
    MOVE VENDOR-NUM  TO #HOLD.VENDOR-NUM(1)
    MOVE 1 TO #I
    ESCAPE TOP
  END-IF

  ADD 1 TO #I
  MOVE #NAME       TO #HOLD.#NAME (#I)
  MOVE NAME        TO #HOLD.NAME(#I)
  MOVE VENDOR-NUM  TO #HOLD.VENDOR-NUM(#I)
END-SORT

True, you are sorting the entire file, but that is better than reading it n*n times.

If this is something that will need to be done frequently, I would add a CLEAN-NAME to the file and make it a descriptor. Then you could histogram on CLEAN-NAME and use *NUMBER to find duplicates.

Ralph_Zbrog · April 10, 2012, 10:36pm

I would flatten the database file (adding multi-fetch):

DEFINE DATA LOCAL  
1 EMP    VIEW EMPLOYEES
  2 PERSONNEL-ID
  2 NAME
1 #NAME (A31)
END-DEFINE
READ MULTI-FETCH 10 EMP PHYSICAL
  #NAME := EMP.NAME
  EXAMINE #NAME FOR ' ' DELETE
  EXAMINE #NAME FOR ',' DELETE             
  EXAMINE #NAME FOR '.' DELETE
  WRITE WORK 1 #NAME
               EMP.NAME
               EMP.PERSONNEL-ID
END-READ
END

Then use a separate sort step (letting it determine the duplicates):

//STEP01   EXEC PGM=ICETOOL
//TOOLMSG   DD SYSOUT=*
//DFSMSG    DD SYSOUT=*
//IN        DD DSN=
//OUT       DD DSN=
//TOOLIN    DD *
  SELECT FROM(IN) TO(OUT) ON(1,31,CH) ALLDUPS
/*

Then print the resulting file:

DEFINE DATA LOCAL
1 #W
  2 #NAME (A31)
  2 #VENDOR (A31)
  2 #ACCOUNT (A13)
END-DEFINE
READ WORK 1 RECORD #W
  DISPLAY #NAME (IS=T)
          #VENDOR
          #ACCOUNT
END-WORK
END

Nikolay_Kuhta · April 10, 2012, 10:59pm

Since recently I also fall in love with DFSORT/Icetool and I’m 100% agreed with Ralph Well, I don’t think it can be done more efficiently on mainframe

Dan4 · April 10, 2012, 11:22pm

Thanks for all the great suggestions, guys!

system · April 11, 2012, 12:27am

But this completely ignores the tax identification number requirement,
or what am I missing here ?!?

Ralph_Zbrog · April 11, 2012, 12:47am

To verify my syntax with CHECK, I used the Employees file and replaced VENDOR-NUM with PERSONNEL-ID, but the logic is essentially the same.

system · April 11, 2012, 1:04am

But you will never see any hits with

ACCEPT IF #NAME2 = #NAME AND EMPLOYEES-DUP.PERSONNEL-ID NE EMPLOYEES.PERSONNEL-ID

Ralph_Zbrog · April 11, 2012, 1:30am

IF #NAME2 = #NAME

is handled by the ALLDUPS parameter.

EMPLOYEES-DUP.PERSONNEL-ID NE EMPLOYEES.PERSONNEL-ID

is used to eliminate false hits/duplicates resulting from reading the same record as parent and child (eg outer loop PID=1 and inner loop PID=1). With SORT, there will be no false duplicates, so this test is unnecessary. Jerome also removed that test in his one-file code example.

system · April 11, 2012, 1:43am

Thanks Ralph, and sorry, I knew I must be missing something,
and I know I should stay clear of the forum with a headache like
that …

Nice solution then if external tools are allowed, yet there are, and must be,
“better” solutions using just Natural & Adabas, with the proper descriptors.

Nikolay_Kuhta · April 11, 2012, 3:34pm

Sorry, guys, yesterday I missed the end of our such an interesting discussion.
Today I have 2 more points to add.
Wolfgand, I think nowadays most mainframe shops consider SORT as “integral part of OS”; perhaps, your remark about “external tools” is not that appropriate, if I may call it that way
However, I agree with you regarding requirement as for "different tax identification number " and I think Ralph is too optimistic: SELECT would certainly sort out ALL dups (with the same vendor`s name); however, the task to verify that those records have different tax numbers do require some MORE CODE to add…

Nikolay_Kuhta · April 11, 2012, 3:39pm

Lets pretend this is the "flattened vendor file" (the vendors name is the first 8 positions):
00000001 THIS IS REC1
00000002 THIS IS TAX#2
00000005 THIS IS TAX#51
00000002 THIS IS TAX#2
00000003 THIS IS REC3
00000004 THIS IS REC6
00000002 THIS IS TAX#2
00000006 THIS IS REC8
00000007 THIS IS REC9
00000005 THIS IS TAX#52
00000005 THIS IS TAX#53

After SELECT ALL dups I got the following (unless I missed something?)

00000002 THIS IS TAX#2
00000002 THIS IS TAX#2
00000002 THIS IS TAX#2
00000005 THIS IS TAX#51
00000005 THIS IS TAX#52
00000005 THIS IS TAX#53

Since the first 3 records have the same tax-ID#, they should not be included into the final report; however, this is what I received:

00000002 THIS IS TAX#2
THIS IS TAX#2
THIS IS TAX#2
00000005 THIS IS TAX#51
THIS IS TAX#52
THIS IS TAX#53
NEXT FIN
NAT9995 Natural session terminated normally.

Ralph_Zbrog · April 11, 2012, 8:15pm

Only Dan, the original poster, can say for sure, but I presumed that a file named VENDOR would have unique TIN values. Now for a file named VENDOR-DETAILS …

system · April 11, 2012, 8:28pm

Well, the same tax id for different individuals would definitely be a strange thing

Dan4 · April 11, 2012, 8:41pm

In our case there can be multiple TINs on the VENDOR file.


PREFIX       TIN              SEQUENCE NUMBER
   F         123456789           001
   F         123456789           002
   S         123456789           001

Prefix indicates if vendor uses a FEIN as their TIN or a SSN as their TIN or if they’re a foreign vendor
Sequence number is for a company with multiple locations.

system · April 11, 2012, 8:51pm

But I assume this will be for the same name, or ?

Dan4 · April 11, 2012, 8:57pm

Yes, I’m creating a report listing records on our VENDOR file with the same name but a different TIN (VENDOR NUMBER minus prefix and sequence number).

system · April 11, 2012, 9:02pm

This brings us to an additional problem, suppose you have

Name    Tin Seq
Smith    1   1
Smith    2   1
Smith    1   2
Smith    3   1
Smith    2   2

Topic		Replies	Views
[Solved] Not a bug in examine. Information about examine/separate and performance. Adabas-Natural , Natural , Natural-on-Mainframes	19	1932	April 2, 2021
Performance READ + FIND NUMBER vs. READ + IF ARRAY(*) Adabas-Natural , Adabas , Natural	44	2742	December 2, 2022
Finding Out the Numeric Position Adabas-Natural , Natural , Natural-Code-Samples	17	42320	April 2, 2021
Checking for a single value within an array Adabas-Natural , Natural , Natural-on-Mainframes	15	9390	April 2, 2021
Create an Excel-File from Adabas-Report (AdaRep) Adabas-Natural , Natural , Natural-Code-Samples , Natural-on-Windows-Unix	13	28440	April 2, 2021

Looking for efficient way to find duplicate names in adabas file

Related topics