[Solved] Not a bug in examine. Information about examine/separate and performance.

Hi there.

By accepting a friend’s challenge to write a program to count words of a text, I think I found a bug here.

If you may, please submit this program and check if you will have the same writes as I had.

Sorry for the bad/unalligned code. It was written fast.

If this is not a bug and I’m being stupid, please tell me this. I promise I won’t be mad :slight_smile:


DEFINE DATA                                                       
LOCAL                                                             
1 PL  (A15/10000)                                                 
1 CPL (N5/10000)                                                  
*                                                                 
1 TRY (A15)                                                       
1 ARQ                                                             
  2 LINHA (A250)                                                  
*                                                                 
1 X    (N5)                                                       
1 QUEBRA (A1/10) INIT<' ',',','.',':',';','!','?','/','(',')'>    
1 QBR (A1)                                                        
1 I   (N5)                                                        
1 IX  (N5)                                                        
1 #I   (N5)                                                       
1 #POSI (N5)                                                      
1 #POSF (N5)                                                      
1 #POSQ (N5)                                                      
1 #POS  (N5)                                                      
1 #LRES (A250)                                                    
1 #CLIN (N5)                                                            
END-DEFINE                                                              
X := 1                                                                  
**READ WORK FILE 1 ARQ                                                  
LINHA :=                                                                
 'O CONTROLE POPULACIONAL é FREQUENTEMENTE UM MODO DE MUDAR DE ASSUNTO,'
 -'SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL E AS TAXAS'
  -' SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL.'        
  -' POPULACIONAL TENTANDO CONTAR O VALOR DOS NúMEROS.         '        
REPEAT                                                                  
  RESET #POSI #POSQ #POSF                                               
  FOR I 1 10                                                            
    MOVE QUEBRA(I) TO QBR                                               
    EXAMINE LINHA FOR QBR GIVING POSITION #POSQ                         
    IF (#POSQ LT #POSI AND #POSQ NE 0) OR #POSI = 0                     
      MOVE #POSQ TO #POSI                                               
    END-IF                                                              
  END-FOR                                                               
  IF #POSI GT 1                                                         
    #POSF := #POSI - 1                                                  
	  ELSE                                                                  
    #POSF := 1                                                          
  END-IF                                                                
  RESET TRY                                                             
  MOVE SUBSTRING (LINHA,1,#POSF) TO TRY                                 
  IF TRY EQ QUEBRA(*)                                                   
    IGNORE                                                              
  ELSE                                                                  
    RESET IX                                                            
    EXAMINE FULL PL(1:X) FOR TRY GIVING INDEX IX                        
    IF IX NE 0 AND TRY NE PL(IX)    /* BUG HERE. Examine is locating different words and assuming they're the same.     
      WRITE '=' TRY '=' PL(IX)   /* I've tried EXAMINE and EXAMINE FULL... same result.                 
    END-IF                                        
    IF IX NE 0 AND TRY EQ PL(IX)                                        
      ADD 1 TO CPL(IX)                                                  
    ELSE                                                                
      MOVE SUBSTRING (LINHA,1,#POSF) TO PL(X)                           
      ADD 1 TO CPL(X)                                                   
    END-IF                                                              
  END-IF                                                                
  #POS := #POSI + 1                                                     
  #CLIN := 250 - #POSI                                                  
  MOVE SUBSTRING (LINHA,#POS,#CLIN) TO LINHA                            
  IF LINHA EQ ' '                                                       
    ESCAPE BOTTOM                                                       
  END-IF                                                                
  ADD 1 TO X                                                            
END-REPEAT                                                              
FOR X 1 500                                                             
  IF PL(X) NE ' '                                                       
    WRITE '=' X '=' PL(X) '=' CPL(X)                                    
  END-IF                                                                
END-FOR                                                                 
END                                                                     

TRY: E PL: CONTROLE
TRY: AS PL: ASSUNTO
X: 1 PL: O CPL: 4
X: 2 PL: CONTROLE CPL: 1
X: 3 PL: POPULACIONAL CPL: 4
X: 4 PL: é CPL: 1
X: 5 PL: FREQUENTEMENTE CPL: 1
X: 6 PL: UM CPL: 1
X: 7 PL: MODO CPL: 1
X: 8 PL: DE CPL: 2
X: 9 PL: MUDAR CPL: 1
X: 11 PL: ASSUNTO CPL: 1
X: 12 PL: SE CPL: 2
X: 13 PL: VOCê CPL: 2
X: 14 PL: OLHA CPL: 2
X: 15 PL: OS CPL: 2
X: 16 PL: NúMEROS CPL: 3
X: 19 PL: MAIOR CPL: 2
X: 20 PL: CRESCIMENTO CPL: 2
X: 22 PL: E CPL: 1
X: 23 PL: AS CPL: 1
X: 24 PL: TAXAS CPL: 1
X: 37 PL: TENTANDO CPL: 1
X: 38 PL: CONTAR CPL: 1
X: 40 PL: VALOR CPL: 1
X: 41 PL: DOS CPL: 1

kind regards,
Marcelo Oliveira.

You’re missing an IF between lines 63 & 64, so the program can’t be tested.

Try the following, much simpler code

DEFINE DATA LOCAL
1 LINHA (A250)
1 #ARRAY (A30/1:200)
1 #NUMBER (N5)
END-DEFINE
*
INCLUDE AASETC
LINHA :=
‘O CONTROLE POPULACIONAL é FREQUENTEMENTE UM MODO DE MUDAR DE ASSUNTO,’
-‘SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL E AS TAXAS’
-’ SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL.’
-’ POPULACIONAL TENTANDO CONTAR O VALOR DOS NúMEROS. ’
*
EXAMINE LINHA FOR ‘, O’ REPLACE WITH ‘,O’
EXAMINE LINHA FOR FULL '. ’ REPLACE WITH ‘.’
SEPARATE LINHA INTO #ARRAY (*) WITH DELIMITER ‘, .’ GIVING NUMBER #NUMBER
WRITE ‘=’ #NUMBER
WRITE #ARRAY (1:45)
END

Page 1 16-02-01 17:05:25

#NUMBER: 40
O CONTROLE
POPULACIONAL é
FREQUENTEMENTE UM
MODO DE
MUDAR DE
ASSUNTO SE
VOCê OLHA
OS NúMEROS
O MAIOR
CRESCIMENTO POPULACIONAL
E AS
TAXAS SE
VOCê OLHA
OS NúMEROS
O MAIOR
CRESCIMENTO POPULACIONAL
POPULACIONAL TENTANDO
CONTAR O
VALOR DOS
NúMEROS

I would like a run-able version of the original program to try to determine what is wrong with the EXAMINE, but here is a program that counts the number of unique words.

If your challenge involves cash, I want a cut. :smiley:

DEFINE DATA LOCAL
1 #TXT (A) DYNAMIC
1 #WORDS (A15/10000)
1 #WORD (A15)
1 #W (I4)
1 #C (I4)
1 #I (I4)
END-DEFINE
FORMAT PS=50
ASSIGN #TXT =
'O CONTROLE POPULACIONAL é FREQUENTEMENTE UM MODO DE MUDAR DE ASSUNTO,'  
-'SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL E AS TAXAS'  
  -' SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL.'          
  -' POPULACIONAL TENTANDO CONTAR O VALOR DOS NúMEROS.         '
*
SEPARATE #TXT LEFT JUSTIFIED INTO #WORDS (*)
    WITH DELIMITERS ' ,.:;!?/()'
    NUMBER #W
FOR #I = 1 #W
  ASSIGN #WORD = #WORDS (#I)
  IF  #WORD = ' '
    THEN
      ESCAPE TOP
  END-IF
END-ALL
SORT #WORD
     USING KEY
  AT START OF DATA
    RESET #W
  END-START
  AT BREAK OF #WORD
    ASSIGN #C = COUNT (#WORD)
    DISPLAY OLD (#WORD)
            #C
    ADD 1 TO #W
  END-BREAK
  AT END OF DATA
    WRITE
        / '   Total words:' T*#C COUNT (#WORD) (NL=10)
        / '  Unique words:' T*#C #W
  END-ENDDATA
END-SORT
END
Page     1                                                   02/01/16  17:51:13
 
     #WORD          #C
--------------- -----------
 
AS                        1
ASSUNTO                   1
CONTAR                    1
CONTROLE                  1
CRESCIMENTO               2
DE                        2
DOS                       1
E                         1
FREQUENTEMENTE            1
MAIOR                     2
MODO                      1
MUDAR                     1
NúMEROS                   3
O                         4
OLHA                      2
OS                        2
POPULACIONAL              4
SE                        2
TAXAS                     1
TENTANDO                  1
UM                        1
VALOR                     1
VOCê                      2
é                         1
 
   Total words:          39
  Unique words:          24

sorry… my mistake

during the copy I rewrote a few lines that I wrote before.

now it is correct.

Hi Steve. Thanks for this, but the challenge was to count how many times each word appears in a text, considering all kind of punctuation.
The main challenge was to import a file, as you can see on line 24 and check the words.

In the end the program works perfectly, but I’m kinda bothered by this “bug” I think I found.

Hi Ralph. Thanks for this. I never used separate before… To be honest, I had no idea this statement ever existed.

I’m familiar with the help utility, but based on daily usage, I never even paid attention to this one.
This code is much much better than mine. Thanks very much.

About the challenge, my program works. I solved the bug with the “AND TRY EQ PL(IX)” in the line 54.
The challenge was to use all programming languages I know to see the smallest code.
Three friends wrote it in Java and C# and two in c++
I wrote in C# and Natural.

I started the program using an examine starting from, but here I can’t use it due to NAT0599 reason 10.
I could mess with COMPOPT, but the last time I did it the application admins weren’t very friendly :oops:

– edit

one doubt about this separate statement.
you set it to look for delimiters. If I put an ellipsis would it work too?
the first tests I did was with a different text, with all kind of punctuation. later I changed to a smaller text because everything was working, except for the examine I mentioned.

Hi Marcelo;

There is a reason both Ralph and I used SEPARATE rather than EXAMINE. The SEPARATE code, in addition to being more compact than the EXAMINE code, is FAR more efficient.

If you use the help facility and refer to the SEPARATE statement, you will see that you can indeed use an ellipsis in the SEPARATE as a delimiter.

Since you are competing for “smallest code”, I combined my code and Ralph’s to minimize code; you might want to “play” with the following:

DEFINE DATA LOCAL
1 LINHA (A250)
1 #ARRAY (A30/1:200)
1 #NUMBER (I2)
1 #LOOP (I2)
1 #UNIQUE (I2)
1 #VALUE (A30)
END-DEFINE
*
LINHA :=
‘O CONTROLE POPULACIONAL é FREQUENTEMENTE UM MODO DE MUDAR DE ASSUNTO,’
-‘SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL E AS TAXAS’
-’ SE VOCê OLHA OS NúMEROS, O MAIOR CRESCIMENTO POPULACIONAL.’
-’ POPULACIONAL TENTANDO CONTAR O VALOR DOS NúMEROS. ’
*
SEPARATE LINHA LEFT JUSTIFIED INTO #ARRAY () WITH DELIMITER ‘, .’ GIVING NUMBER #NUMBER
*
IF #ARRAY (#NUMBER) = ’ ’
SUBTRACT 1 FROM #NUMBER /
TAKES CARE OF FINAL DELIMITER
END-IF
*
FOR #LOOP = 1 TO #NUMBER
MOVE #ARRAY (#LOOP) TO #VALUE
END-ALL
SORT BY #VALUE USING KEY
AT BREAK OF #VALUE
DISPLAY 10T ‘WORD’ OLD (#VALUE) ‘OCCURENCES’ COUNT (#VALUE)
ADD 1 TO #UNIQUE
END-BREAK
AT END OF DATA
WRITE / 10T ‘TOTAL WORDS:’ COUNT (#VALUE) / 10T 'UNIQUE WORDS: ’ #UNIQUE
END-ENDDATA
END-SORT
END

Page 1 16-02-02 07:57:42

                  WORD              OCCURENCES
     ------------------------------ ----------

     AS                                    1
     ASSUNTO                               1
     CONTAR                                1
     CONTROLE                              1
     CRESCIMENTO                           2
     DE                                    2
     DOS                                   1
     E                                     1
     FREQUENTEMENTE                        1
     MAIOR                                 2
     MODO                                  1
     MUDAR                                 1
     NúMEROS                               3
     O                                     4
     OLHA                                  2
     OS                                    2
     POPULACIONAL                          4
     SE                                    2
     TAXAS                                 1

Page 2 16-02-02 07:57:42

                  WORD              OCCURENCES
     ------------------------------ ----------

     TENTANDO                              1
     UM                                    1
     VALOR                                 1
     VOCê                                  2
     é                                     1

     TOTAL WORDS:       39
     UNIQUE WORDS:      24

Could you post a list of all the punctuation characters you are concerned with?

steve

Hi again, Steve.
Thanks for this explanation.

I defined the punctuation on my QUEBRA variable. <’ ‘,’,’,’.’,’:’,’;’,’!’,’?’,’/’,’(’,’)’>

I solved the issue of the ellipsis using this huge block hahahahaha

IF #POSI GT 1 /* here I check the position of the punctuation.
#POSF := #POSI - 1 /* if position > 1, i set #posf to -1 to determine the end of the word.
ELSE
#POSF := 1 /* if punctuation is on first byte, that means it could be an ellipsis or !? or
** whatever else
END-IF
RESET TRY
MOVE SUBSTRING (LINHA,1,#POSF) TO TRY /* here i move it to a try variable to be sure if it is a punctuation
IF TRY EQ QUEBRA() / here i ignore if it is.
IGNORE

[edit]
edit to remove the code from quote.

SEPARATE will consider consecutive delimiters, such as ellipsis and quoted questions (?"), as having intermediate null words. That is … would be considered period-blank-period-blank-period. That is why my program tests for a blank word within the FOR loop. You could remove the AT START by replacing #W in the ADD and WRITE statements with a new variable.

As for the EXAMINE bug, it isn’t. When TRY contains E or AS, those values are found within CONTROLE and ASSUNTO, respectively. You need another FULL, as in

EXAMINE FULL PL(1:X) FOR FULL TRY GIVING INDEX IX

As long as PL and TRY are the same length, you need only one FULL.

EXAMINE PL(1:X) FOR FULL TRY GIVING INDEX IX

ahhhh… thanks very much.

that’s other thing I have never seen around.
I thought the first full was to tell the statement that I was looking for the full array, complete words in PL(*) as the same word in TRY. Not part of the words in PL…

This thread is a real lesson to me. Thank you and Steve very much.

As curiosity, where do you guys use Natural?
I work for a Oil/Gas Company and have worked for a telecom and bank and NEVER saw these commands the way you shown me.
Here I even asked for the specialists help on the “bug” and they told me it was strange and shouldn’t behave that way. It was probably a bug… That’s the reason I came here.

Hi Marcelo;

One of the great things about Natural is how easy it is to learn.

One of the worst things about Natural is how easy it is to learn.

Both are true. Natural is so easy to learn, that many programmers learn the basics, then stop the learning process, thus missing out on the fantastic development power that exists within the language.

Both Ralph and I are long time Natural educators. That means we strive to learn all about Natural capabilities and how to employ them most effectively. Ralph has also worked with the State of California for quite a few years. I have consulted with a financial organization for several years.

You’ll find more good stuff in the Code Samples pages:
http://techcommunity.softwareag.com/ecosystem/communities/public/adanat/products/natural/codesamples/

If you follow the link that Ralph provided, you will see (on page 4) an article entitled “Deleting null array occurrences”. This provides two approaches to “compressing” an array by removing null occurrences, and, a timing comparison which shows the EXAMINE to be quite a bit faster than the COMPRESS / SEPARATE approach.

Somewhat counter intuitive, but both of the approaches are a lot faster than individual tests for blank array members. Since the two approaches operate on strings, not individual array occurrences, the performance differences get larger as the number of array occurrences increases.

I used the timing comparison from the article and added a test for a blank. I also increased the number of array occurrences from 10 to 100. Please realize that the functionality you were looking for could probably involve several hundred, if not thousands of words (array occurrences).

Here is the timing comparison. The EXAMINE really outperforms the other two approaches.

  • NOW FOR THE “FUN”; A TIMING COMPARISON
  • BETWEEN THE TWO APPROACHES SHOWN ABOVE.

DEFINE DATA LOCAL
1 #ARRAYA (A3/1:100)
1 REDEFINE #ARRAYA
2 #STRINGA (A300)
1 #GROUP (100)
2 #ARRAYB (A3)
2 #FILLER (A1) INIT ALL <’&’>
1 REDEFINE #GROUP
2 #STRINGB (A400)
**
1 #CPU-START (I4)
1 #CPU-ELAPSED (I4)
1 #LOOP (I4)
1 #LOOP2 (I4)
END-DEFINE
*
INCLUDE AATITLER
INCLUDE AASETC
*
MOVE *CPU-TIME TO #CPU-START
SETA. SETTIME
FOR #LOOP = 1 TO 1000000
MOVE ‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’
TO #STRINGB
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #CPU-START
WRITE 5T ‘DUMMY FOR LOOP’ 20T *TIMD (SETA.) #CPU-ELAPSED
*

  • NOTE. THE EXAMINE LOOP DESTROYS THE “SOURCE” ARRAY.
  • HENCE, IN ORDER TO HAVE AN EQUAL COMPARISON, I ADDED
  • A SEEMINGLY UNNECESSARY MOVE TO THE COMPRESS LOOP.
  • (THE TARGET OF THE COMPRESS COULD HAVE BEEN ANOTHER VARIABLE).
  • THEREFORE, I ADDED THE MOVE TO THE “DUMMY” LOOP
  • SHOWN ABOVE.

MOVE CPU-TIME TO #CPU-START
SETB. SETTIME
FOR #LOOP = 1 TO 1000000
MOVE ‘AAA BB CCCDDD E FFFGGG’ TO #STRINGA
COMPRESS #ARRAYA (
) INTO #STRINGA WITH DELIMITER ‘;’
SEPARATE #STRINGA INTO #ARRAYA (*) WITH DELIMITER ‘;’
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #CPU-START
WRITE 5T ‘COMPRESS ’ 20T *TIMD (SETB.) #CPU-ELAPSED
*
MOVE *CPU-TIME TO #CPU-START
SETC. SETTIME
FOR #LOOP = 1 TO 1000000
MOVE ‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’-
‘AAA& &BB &CCC&DDD& & &E &FFF&GGG&’
TO #STRINGB
EXAMINE #STRINGB FOR FULL ’ &’ DELETE
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #CPU-START
WRITE 5T 'EXAMINE ’ 20T *TIMD (SETC.) #CPU-ELAPSED
*
MOVE *CPU-TIME TO #CPU-START
SETD. SETTIME
FOR #LOOP = 1 TO 1000000
FOR #LOOP2 = 1 TO 100
IF #ARRAYA (#LOOP2) = ’ ’
IGNORE
END-IF
END-FOR
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #CPU-START
WRITE 5T 'IF…IGNORE ’ 20T *TIMD (SETD.) #CPU-ELAPSED
END

PAGE #   1                    DATE:    16-02-05
PROGRAM: ARRAY03              LIBRARY: NEWPROGS

DUMMY FOR LOOP        4          33
COMPRESS            270        2700
EXAMINE              42         423
IF..IGNORE         1038        6378

Hi again. I was on holidays + mini vacation :slight_smile:

do you mind posting these two copycodes you used? aatitler and aasetc?

I would like to run this pgm here to check the numbers on our machine :smiley:

Sorry.

aatitler is just a WRITE TITLE statement; which if you look at the output has the page number, date, program name and library.

aasetc is a SET CONTROL ‘C’ statement that places output in the editor area. Useful for cut and pasting on the PC.

You can just comment out the INCLUDE’s, or just eliminate them.

haha no problem.

I tried commenting them before you said, but…

NAT0953 Time limit exceeded. :cry:

Simply change the FOR #LOOP = 1 to 1000000 statements to some number other than a million, like 200000.

I did it before.
with 500, 200, 100… now with 50k

same error.

I guess the sysadmin defined a really low value because our machine is kinda old. (at least is what the old fellas here always say hahahaha)

I’ll rewrite putting each task in a subroutine and then run them separately, but first I have to deliver three maintenances hahahahah

when I run them, I’ll post the result here. thanks.