examine for pattern

Hello,
I was wondering whether when using an EXAMINE FOR PATTERN statement, it is possible:

  1. To specify wildcards that represent numbers
  2. When a pattern was found in my string, to get the substring that matched the pattern, or to get the length of the pattern-matching substring.

Thanks a lot,
Joseph

  1. NO as far as I know → documentation says:
    A period (.), question mark (?) or underscore (_) indicates a single position that is not to be examined.
    An asterisk (*) or a percent sign (%) indicates any number of positions not to be examined.

  2. NO as far as I know - you have to programm it your self…

DEFINE DATA LOCAL
1 #LENGTH        (I4)
1 #AFTERREPLACE  (I4)
1 #PATTERNLENGTH (I4)
1 #PATTERNSTART  (I4)
*
1 #ORIGINAL      (A250)
1 #COPY          (A250)
1 #FOUNDPATTERN  (A250)
1 #FINDPATTERN   (A250)
END-DEFINE
*
#ORIGINAL := 'When a pattern was found in my string, to get the '-
  ' substring that matched the pattern, or to get the length of the'-
  ' pattern-matching substring'
*
#FINDPATTERN := 'a*the'
*
EXAMINE #ORIGINAL FOR ' ' GIVING LENGTH #LENGTH
#COPY := #ORIGINAL
EXAMINE #COPY FOR PATTERN #FINDPATTERN REPLACE FIRST WITH '.'
GIVING LENGTH #AFTERREPLACE
*
IF #LENGTH NE #AFTERREPLACE
  #PATTERNLENGTH := #LENGTH - #AFTERREPLACE + 1
  EXAMINE #ORIGINAL FOR PATTERN #FINDPATTERN
  GIVING POSITION #PATTERNSTART
*
  #FOUNDPATTERN := SUBSTR(#ORIGINAL, #PATTERNSTART, #PATTERNLENGTH)
ELSE
  WRITE 'Error: pattern not found'
END-IF
*
PRINT "=" #FOUNDPATTERN
*
END

Starting with the next issue of Inside Natural (see website; address below in signature)
I have added a new column. It is called “CPU Police”. The premise of the column
is simple. Many programmers, usually under time constraints, write code for simple
functions without considering alternative code. The code ends up in production.
Unfortunately, sometimes this code ends up within loops that are executed literally
millions of times per day.

Take the code that Weihnachsbar contributed to address the posting of Joseph regarding
“extracting” a “pattern” text string. My guess is that virtually everyone reading this
would implement this solution without giving it a second thought. It is a fine solution,
except, one can write code that performs the same function yet executes in one third
the time (and consumes about half the CPU time).

This is NOT to discourage participation in this forum (or, SAG-L for that matter).
To the contrary, the point of this posting is to encourage individuals within
the Natural community to find ways to better utilize Natural by writing more efficient code,
AND, sharing that code with the rest of the user community.

Part of the problem is that when the original posting was made it sort of specified
“PATTERN” as part of the solution. As will be shown below, “extracting” a “pattern”
of characters is not something that the PATTERN option does very well. Knowing that
PATTERN is fairly expensive, and noting that the posted solution had two EXAMINE for
PATTERNs, I rewrote a solution as follows:

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#PATTERNSTART) FOR ‘the’
GIVING POSITION #LENGTH
ADD 2 TO #LENGTH /* AFTER EXAMINE, ONLY HAVE UPTO THE t

Note, I have, for demonstration simplicity, eliminated various tests, and the
final MOVE SUBSTRING, which would be in common with the original solution.

The point of the above is that two EXAMINEs without PATTERN were likely to be
faster than two EXAMINEs with PATTERN, plus an EXAMINE without PATTERN, plus
a MOVE.

I wrote a timing comparison on my PC (Natural 6.1.1). Here is the code and the
output.

DEFINE DATA LOCAL
1 #LENGTH (I4)
1 #AFTERREPLACE (I4)
1 #PATTERNLENGTH (I4)
1 #PATTERNSTART (I4)
*
1 #ORIGINAL (A250)
1 #COPY (A250)
1 #FOUNDPATTERN (A250)
1 #FINDPATTERN (A250)
1 #LOOP (P9)
1 #START-CPU (P9)
1 #CPU-ELAPSED (P9)
END-DEFINE
*
INCLUDE AATITLER
INCLUDE AASETC
*
#ORIGINAL := ‘When a pattern was found in my string, to get the ‘-
’ substring that matched the pattern, or to get the length of the’-
’ pattern-matching substring’
*
#FINDPATTERN := ‘a*the’
*
SETB. SETTIME
MOVE CPU-TIME TO #START-CPU
FOR #LOOP = 1 TO 100000
EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#PATTERNSTART) FOR ‘the’
GIVING POSITION #LENGTH
ADD 2 TO #LENGTH /
AFTER EXAMINE, ONLY HAVE UPTO THE t
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #START-CPU
WRITE 5T ‘FAST WAY’ *TIMD (SETB.) #CPU-ELAPSED /
*
SETA. SETTIME
MOVE *CPU-TIME TO #START-CPU
FOR #LOOP = 1 TO 100000
EXAMINE #ORIGINAL FOR ’ ’ GIVING LENGTH #LENGTH
#COPY := #ORIGINAL
EXAMINE #COPY FOR PATTERN #FINDPATTERN REPLACE FIRST WITH ‘.’
GIVING LENGTH #AFTERREPLACE
*
#PATTERNLENGTH := #LENGTH - #AFTERREPLACE + 1
EXAMINE #ORIGINAL FOR PATTERN #FINDPATTERN
GIVING POSITION #PATTERNSTART
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #START-CPU
WRITE 5T ‘SLOW WAY’ *TIMD (SETA.) #CPU-ELAPSED /
*
SETC. SETTIME
MOVE *CPU-TIME TO #START-CPU
FOR #LOOP = 1 TO 100000
IGNORE
END-FOR
COMPUTE #CPU-ELAPSED = *CPU-TIME - #START-CPU
WRITE 5T ‘FOR LOOP’ *TIMD (SETC.) #CPU-ELAPSED
END

PAGE #   1                    DATE:    Sep 08, 2006
PROGRAM: PATTRN04             LIBRARY: SYSTEM

FAST WAY       33        324

SLOW WAY       48        466

FOR LOOP        5         56

For a meaningful comparison, one has to subtract the overhead of the FOR loop from the
two approaches. Thus, the elapsed times are 28 versus 43 ( a saving of a third in the
elapsed time ) and the CPU times are 268 versus 410 (a similar saving of about a third).

Then I realized that I could improve Weihnachsbar’s code. The second EXAMINE for PATTERN
was actually not necessary. Instead, the functionality of the second EXAMINE for PATTERN
could be incorporated within the first EXAMINE for PATTERN, as in:

EXAMINE #COPY FOR PATTERN #FINDPATTERN REPLACE FIRST WITH '.'
            GIVING POSITION #PATTERNSTART GIVING LENGTH #AFTERREPLACE

The new timings were:

PAGE #   1                    DATE:    Sep 08, 2006
PROGRAM: PATTRN05             LIBRARY: SYSTEM

FAST WAY       33        324

SLOW WAY       36        349

FOR LOOP        5         56

The difference between the two was certainly reduced. After FOR loop subtractions, the
performance difference was down to about 10% (as opposed to 30% earlier).

Then I remembered an interesting fact. SUBSTRING on the mainframe was rewritten
for Version 4, resulting in substantial performance improvement. I ran PATTRN05
on the mainframe. Here are the numbers:

PAGE #   1                    DATE:    Sep 08, 2006
PROGRAM: PATTRN05             LIBRARY: SYSTEM

FAST WAY        4         37

SLOW WAY       11         64

FOR LOOP        1          5

Since most of you, I am sure, run Natural on the mainframe, the numbers above
are the most important in this exercise. Subtracting out the FOR loop, shows an
almost 50 reduction in CPU time, and a 2/3 reduction in elapsed time.

steve

I agree wholeheartedly that reduction of CPU is an important goal. My Natural and Adabas training classes focus on this. My corporate motto is “Faster code. Faster.” That is, the least amount of code to do the job in the least amount of time.

That said, I disagree that the “fast way” posted above can be considered a reasonable alternative to the “slow way.” It’s faster because it has less functionality.

Weihnachtsbar’s code is (almost) a real-life example, and satisfies the requirement of a pattern-searching algorithm. Try it with a pattern of “tts fy”. The “improved” code works only for a single example. The introduction of multiple wildcards affects the code and the CPU comparison.

Perhaps the “CPU reduction” posting belongs in a separate thread, with practical examples.

[color=“blue”]Ralph Zbrog wrote:

That said, I disagree that the “fast way” posted above can be considered
a reasonable alternative to the “slow way.” It’s faster because
it has less functionality.[/color]

The “slow” way was written by Weihnachtsbar in response to a specific
requirement as stated in the original posting by Joseph. The “fast” way code
was also offered as a response to the original request, not as a
general pattern-searching algorithm. If one restricts one’s perspective
to the requested functionality, then the two pieces of code have equivalent
functionality.

Try it with a pattern of “ttsfy”. The “improved” code works only
for a single example. The introduction of multiple wildcards affects
the code and the CPU comparison.

How about a pattern of ‘arnst’ (which is germane to the posted query’s data)

Indeed, the code has to be re-written, as shown below:

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#PATTERNSTART) FOR ‘rn’
GIVING POSITION #LENGTH1
ADD #PATTERNSTART to #LENGTH1
*
EXAMINE SUBSTRING (#ORIGINAL,#LENGTH1) FOR ‘st’
GIVING POSITION #LENGTH2
COMPUTE #LENGTH = #LENGTH1 + #LENGTH2 - #PATTERNSTART + 1

Certainly not a complicated extension of the original code (basically one more
EXAMINE SUBSTRING). By contrast, it should be noted, Weihnachtsbar’s code
does not have to be altered at all.

In terms of CPU usage, here again is the original comparison:

PAGE # 1 DATE: Sep 08, 2006
PROGRAM: PATTRN05 LIBRARY: SYSTEM

FAST WAY 4 37

SLOW WAY 11 64

FOR LOOP 1 5

And here is the comparison for the pattern ‘arnst’

PAGE # 1 DATE: Sep 09, 2006
PROGRAM: PATTRN06 LIBRARY: SYSTEM

FAST WAY 7 51

SLOW WAY 12 65

FOR LOOP 1 6

As before, we subtract out the FOR loop time and have a comparison between
6 and 11 for elapsed times and 45 versus 59 for CPU times. Note that, as expected,
the “slow way” times are virtually unchanged (basically, all that changed was
the pattern). By contrast, the “fast way” times have increased substantially
since we now have an additional EXAMINE SUBSTRING statement.

My guess (and until I write and test the code later, that is all it is, a guess)
is that if we changed the required functionality yet again to provide for
three asterisks and four “targets”, we might reach parity.

steve

Mea Culpa.

The original posting by Joseph does not indicate a required functionality with but one wildcard specification. I got that from the subsequent posting by Weihnachtsbar. javascript:emoticon(‘:oops:’)
Embarassed

steve

nice thread…
some hints for the fast one …

  • the seperation of the search pattern into the two or more search parts is missing - would also take extra time.
  • caltulating the length of of the replaced patten is little more complicated.
  • the fast algorithm fails for the*the - there is some extra work to do.
  • using EXAMINE … STARTING FROM … would be more elegant and probably faster then the substring way - yes not on MF at the moment.
  • and don’t forget pattern as ’ t??’ or ‘t??*t’.

Eric Schindler (one of the Site’s Moderators; thanks to all
the Moderators for participating) wrote:

- using EXAMINE … STARTING FROM … would be more elegant and probably faster
then the substring way - yes not on MF at the moment.

Agreed about the elegance, not the performance. When STARTING FROM was added
to the PC Version of Natural, I did some timing tests, they turned out to be
basically identical. My guess was that “behind the scenes” the same code is
being generated. Perhaps it will be different when STARTING FROM is added to
the mainframe.

- caltulating the length of of the replaced patten is little more complicated.

HOWEVER, using STARTING FROM (rather than SUBSTRING) does make the length
calculation simple:

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE #ORIGINAL STARTING FROM #PATTERNSTART FOR ‘the’
GIVING POSITION #PATTERNEND
COMPUTE #LENGTH = #PATTERNEND - #PATTERNSTART + 3

- the fast algorithm fails for the*the - there is some extra work to do.

Whoops, another red face Eric Schindler (one of the Site’s Moderators; thanks to all
the Moderators for participating) wrote:

  • using EXAMINE … STARTING FROM … would be more elegant and probably faster
    then the substring way - yes not on MF at the moment.

Agreed about the elegance, not the performance. When STARTING FROM was added
to the PC Version of Natural, I did some timing tests, they turned out to be
basically identical. My guess was that “behind the scenes” the same code is
being generated. Perhaps it will be different when STARTING FROM is added to
the mainframe.

  • calculating the length of of the replaced patten is little more complicated.

HOWEVER, using STARTING FROM (rather than SUBSTRING) does make the length
calculation simple:

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE #ORIGINAL STARTING FROM #PATTERNSTART FOR ‘the’
GIVING POSITION #PATTERNEND
COMPUTE #LENGTH = #PATTERNEND - #PATTERNSTART + 3

  • the fast algorithm fails for the*the - there is some extra work to do.

Whoops, another red face javascript:emoticon(‘:oops:’)
Embarassed . In the first code I wrote for this I had

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
ADD 1 TO #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#Patternstart) FOR ‘the’
GIVING POSITION #LENGTH
ADD 3 TO #LENGTH
SUBTRACT 1 FROM #PATTERNSTART
#FOUNDPATTERN := SUBSTR(#ORIGINAL, #PATTERNSTART, #LENGTH)

When I was trying to “optimize” the code I removed the ADD/SUBTRACT to #PATTERNSTART
forgetting why I had put them there in the first place.

I do feel that the performance differential justifies the functionality difference
(how often will anyone require more than two asterisks anyway?).

The important point,
I feel, is looking beyond the obvious (the PATTERN option) for other solutions.

steve

. In the first code I wrote for this I had

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
ADD 1 TO #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#Patternstart) FOR ‘the’
GIVING POSITION #LENGTH
ADD 3 TO #LENGTH
SUBTRACT 1 FROM #PATTERNSTART
#FOUNDPATTERN := SUBSTR(#ORIGINAL, #PATTERNSTART, #LENGTH)

When I was trying to “optimize” the code I removed the ADD/SUBTRACT to #PATTERNSTART
forgetting why I had put them there in the first place.

I do feel that the performance differential justifies the functionality difference
(how often will anyone require more than two asterisks anyway?). The important point,
I feel, is looking beyond the obvious (the PATTERN option) for other solutions.

steve

:oops:

Whoops;

a lot of extraneous text when I went to include the emoticon. sorry.

steve

Here is the text without the “extra stuff”

Eric Schindler (one of the Site’s Moderators; thanks to all
the Moderators for participating) wrote:

  • using EXAMINE … STARTING FROM … would be more elegant and probably faster
    then the substring way - yes not on MF at the moment.

Agreed about the elegance, not the performance. When STARTING FROM was added
to the PC Version of Natural, I did some timing tests, they turned out to be
basically identical. My guess was that “behind the scenes” the same code is
being generated. Perhaps it will be different when STARTING FROM is added to
the mainframe.

  • caltulating the length of of the replaced patten is little more complicated.

HOWEVER, using STARTING FROM (rather than SUBSTRING) does make the length
calculation simple:

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
EXAMINE #ORIGINAL STARTING FROM #PATTERNSTART FOR ‘the’
GIVING POSITION #PATTERNEND
COMPUTE #LENGTH = #PATTERNEND - #PATTERNSTART + 3

  • the fast algorithm fails for the*the - there is some extra work to do.

Whoops, another red face . In the first code I wrote for this I had

EXAMINE #ORIGINAL FOR ‘a’ GIVING POSITION #PATTERNSTART
ADD 1 TO #PATTERNSTART
EXAMINE SUBSTRING (#ORIGINAL,#Patternstart) FOR ‘the’
GIVING POSITION #LENGTH
ADD 3 TO #LENGTH
SUBTRACT 1 FROM #PATTERNSTART
#FOUNDPATTERN := SUBSTR(#ORIGINAL, #PATTERNSTART, #LENGTH)

When I was trying to “optimize” the code I removed the ADD/SUBTRACT to #PATTERNSTART
forgetting why I had put them there in the first place.

I do feel that the performance differential justifies the functionality difference
(how often will anyone require more than two asterisks anyway?). The important point,
I feel, is looking beyond the obvious (the PATTERN option) for other solutions.

steve

Hint:

As long as #PATTERNSTART is inside your string size it works, but if the only ‘a’ is the last character at the string … :frowning:
… and if your string does not contain any ‘a’ your optimized algorithm will still find a match. :twisted:

I very often want to find, quick and dirty, if a specific tag with attributes inside a xml document:

'<mytag*value=?this?>*</mytag'

and I won’t relay on quotes and white spaces:

<mytag value="this">
or
<mytag value='this'>
or
<mytag 
value='this'>

:wink: Reliable code slow code may sometimes better then fast specialized code.

Hi again,
Thanks for the performance info, but it’s not so important for me at the moment.

Weihnachtsb

Hi Joseph;

I’d like to ask the first question again, since only Weihnachtsb

Natural’s MASK can find numerics and dates, but it won’t tell you the location or length of the embedded string, and the mask must be hard-coded.

Here are some examples.

DEFINE DATA LOCAL
1 #TXT (A50/10)  INIT <'Search for a date, such as 08/11/06, here.'
                      ,'08/11/06 starts with a date.'
                      >
END-DEFINE
*
IF  #TXT (1) = MASK (*'as 'MM'/'DD'/'YY)
  THEN 
    WRITE 'embedded: found'
  ELSE
    WRITE 'embedded: not found'
END-IF
IF  #TXT (2) = MASK (*MM'/'DD'/'YY)
  THEN 
    WRITE 'starting: found'
  ELSE
    WRITE 'starting: not found'
END-IF
IF  #TXT (2) = MASK (*99'/'99'/'99)
  THEN 
    WRITE 'numeric: found'
  ELSE
    WRITE 'numeric: not found'
END-IF
END

:?: Am I right, if I guess that you want to find e.g. a date pattern inside a string and then use this date for further processing?

:idea: What’s about a combination of the program I posted and the solution from “Ralph Zbrog” with MASK?

:arrow: First find a matching pattern and then crosscheck if it’s a valid date format you like.

  1. I agree with “Steve Robinson” that a build in solution with the Natural language would be the easiest solution (even if it’s not the fastest).

“Ralph Zbrog” wrote:

From the Natural documentation about MASK Option:

Weihnachtsb

see:

http://techcommunity.softwareag.com/ecosystem/documentation/natural/nat421mf/pg/pg_furth_lcc_0640.htm#Variable_Mask

Hi Joseph;

Although I am a great fan of online lists/forums for interacting; they do have some drawbacks. One of them was in evidence from your postings. The original problem involved “finding” and “extracting” a “PATTERN”, presumably using the EXAMINE for PATTERN statement.

HOWEVER, the “refined” question, regarding “finding”, “extracting”, and “testing” for a date does not fit nicely into the solutions for the original problem. The reason is quite simple, unless we know for a fact that all dates will be eight characters in the format dd/mm/yy, a simple PATTERN search will not suffice to find and extract a date.

We would have to consider possibilities like d/m/yy, dd/m/yy, d/mm/yy, and dd/mm/yy (I am assuming, hopefully a good assumption, that this year would never be represented as just 6 rather than 06).

To find the spectrum of possibilities, you would have to find a PATTERN of /*/ (I won’t bother to muddy the waters with other solutions that do not use PATTERN). Then you would have to “extract” at least two characters to the left and two characters to the right of the PATTERN. The characters to the right should be simple to deal with; presumably they will only be the YY.

You would now have to “look” at the character two positions to the left of the first slash. you could use an IF #LEFT-TWO = MASK (N) to see if this character is a digit.

Now you should be able to “isolate” what purports to be a date (the original pattern, two characters to the right of the pattern, and either one or two characters to the left of the pattern). Now you can final use an IF #MAYBE-DATE = MASK (DD.MM.YY) or something similar deopending on how you have isolated the “might be a date” string.

A whole lot more complicated than a simple isolate of a pattern.

steve