Compare Two large Files and find difference

This is related to large file reading.
Here we have a scenario where we need to compare two large text files (Over 120 MB) that have around 13 lacs lines per file and we want to find the difference between those two files and further process only unique records.

The actual pattern of the file is like suppose today there are 13 lacs records in a particular file, and on the next day we will have another file from the client having records of the previous day’s file + new records. So we need to identify the new records only. And the position of records may change on the next day’s file.

Currently, we are using the flat file pattern to read one file line by line and call the java service which checked that line with another file’s each line and gives the output whether that line is matched or not.

This is taking too much time as that comparison leads to matching the 13 lacs record of one file with the 13 lacs record of another file.

Is there any particular existing function or way available in SAG through which we can compare and identify the unique lines?

Below is the product configuration.

Product: webMethods Integration Server
Product Version: 10.5
Operating System: Windows Server 2019 x86-64

As for so many problems to be solve in IT world the best answer is likely: It depends. :slight_smile:
e.g. On the data to be compared, are the lines in any order and do you need a line compare, or can the data be random order but you still need to identify matching lines and those who are orphaned. Do you need a line compare, or even a compare of the data inside your columns? and so on.
This might require some pre-processing (e.g. sorting) or even structuring the data (parsing into records) in order to efficiently compare them.
Depending on the use case also considering to use free or commercial libraries rather than implement the required logic yourself would be an option.
As example: LINK

4 Likes

Hi Mahida,

As of my Knowledge, Using the custom Java service is the best option for comparing the files and data.

In case of records:

  1. Generate schema of flat file and create its document type.
  2. create a flow service and invoke ‘getFile’ service from FlatFile package, pass the required information.
  3. create a java service and pass the document type of files you have created as input and take ouput as of your choice.
  4. Write custom logic for matching the data.
4 Likes

Do your files allow duplicate within same file? in that scenario - do you compare within file a record with another record?
Are you comparing using specific fields - or are you treating entire record for comparison?

Hi Mahida,

I think you need to think out-of-the-box here. You’ll be best served with a database table with a uniqueness constraint. If you only need exactly line matches, then one field is enough. Alternatively could break a line down into different fields and define a partial uniquness constraint.
Once you have the table with your constraint, you can start reading the files and inserting the lines into the table one by one. Once in a while you will get a uniqueness constraint violation. That’s fine, ignore it. When you’re done, you’ll have the unique lines in the dabase which you then can process one by one.

One tip if your dataset is very large: consider sequential processing. Rather than selecting all rows in one select statement, only select the first, say, 100, process them, and mark the rows you’ve processed as done. Continue doing this, e.g. in a REPEAT, until all rows are done.

Cheers!

1 Like

No, the same file does not contain duplicate records.
Also, there is not a specific unique field that helps in comparison.
I need to compare the whole line with another file’s whole line.

This has been done using creating a java service that finds the difference between the 2 files and generates a new file with only the difference. After that processing, the different file is the solution that I have chosen.

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.