Sort of Workfiles

Hi all,

how do you Sort your Workfiles? Do you use Naturals internal Sort, or use some external sort Programms. And if last one how about negative numeric fields, binary, floating point and so on?

Greetings Sascha

A few years ago, I had the problem to sort 1.7 Million records by an ascii-string. I solved this with a Unix-sort, because it was significantly faster.

Generally, negatives and decimal points are no problem for the Unix-command sort -n. But numeric fields in terms of Natural could be a problem. If necessary, I would be best to write an own sorting-algorithm (e.g. in perl) for that issue.

http://www.perlfect.com/articles/sorting.shtml

We are using SyncSort on HP-UX. we use it as external and as internal sort with Natural.

No problems with “negative numeric fields, binary, floating point and so on”.

I asked because i don’t know much about Natural for Unix Workfiles. But to the Sort-question. AFAIK sort also as grep and many other unix-command-line tools works linebased. What if i use the following Program:


define data
local
01 #workfilestructure
  02 #binary-first (b4)
  02 #text         (a80)
  02 #nummeric     (n14.7)
  02 #floating     (f8)
  02 #anothertext  (a8)
end-define
#binary-first = H'0A'
#text         = 'Hi Community'
#nummeric     = -12345.789
#floating     = 0.000001
#anothertext  = '01234567'
write work 01 #workfilestructure
end

if i use hexer it looks like:


 00000000:  0a 20 20 20 48 69 20 43  6f 6d 6d 75 6e 69 74 79  .   Hi Community
 00000010:  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20
 00000020:  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20
 00000030:  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20
 00000040:  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20
 00000050:  20 20 20 20 30 30 30 30  30 30 30 30 30 31 32 33      000000000123
 00000060:  34 35 37 38 39 30 30 30  70 8d ed b5 a0 f7 c6 b0  45789000p.......
 00000070:  3e 30 31 32 33 34 35 36  37 0a -- -- -- -- -- --  >01234567.------

hexdump -vC is:

[code]

00000000 0a 20 20 20 48 69 20 43 6f 6d 6d 75 6e 69 74 79 |. Hi Community|
00000010 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
00000020 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
00000030 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
00000040 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 | |
00000050 20 20 20 20 30 30 30 30 30 30 30 30 30 31 32 33 | 000000000123|
00000060 34 35 37 38 39 30 30 30 70 8d ed b5 a0 f7 c6 b0 |45789000p.

I almost forget about your sorting problem. Here is an example of a sorting-algorithm in perl. To keep it simple, I only used 15 Byte per record containing three fields.

#!/usr/bin/perl
use strict;

$/=\15;                # treat 15 byte as one record

my @lines=(<STDIN>);   # read whole standard input into an array

for (sort mysort (@lines)) {  # run thru sorted array
  print;                      # write it on standard output
}

sub mysort {           # sorting algorithm (compares $a with $b)
  my @fields_a;
  my @fields_b;

  @fields_a = unpack "A5A5A5", $a;   # split line into single fields
  @fields_b = unpack "A5A5A5", $b;   # split line into single fields

  $fields_a[0] cmp $fields_b[0]      # compare field 0 as ascii
               ||                    # if field 0 is equal
  $fields_a[1] <=> $fields_b[1]      # compare field 1 numerical
               ||                    # if field 1 is also equal
  $fields_b[2] cmp $fields_a[2];     # compare field 2 descending
}

advantages:

  • no problem with newline-characters
  • you can change the sorting algorithm exactly to your special needs
  • the perl-command “unpack” can interpret formats like integer and float
  • perl is available on almost every platform. On Linux it comes with the standard-installation.

disadvantages:

  • the command “unpack” doesn’t know NUMERICs and PACKED NUMERICs. But of course you can write an own converter for this.
  • In my example, the whole input file is read into the memory. Could be a problem with huge files.

Sorry to resurrect an old topic, but has anyone else had experience of the third-party SYNCSORT product on Unix? If so, how does it compare to the native Unix sort, in terms of performance and features?

This is for SORTs within Natural itself AND sorting workfiles outside Natural, and we need the equivalent of typical mainframe SYNCSORT statements e.g. INCLUDE, OMIT, SUM etc

we are using syncsort on hp-ux (pa-risc and itanium) without problems. it is more ore less like the one on ibm-mainframe, and we did not want to migrate all sort-jcls. thats one of the reason we decide to use syncsort instead of the unix-home-made sort. the other reason, afair, was that the unix-sort was not able to handle binary- and floating-value-fields in a file.

KlaBue