Parsing Input Streams

I just wanted to share a rather lengthy lesson I learned yesterday afternoon. We were trying to parse the contents of an input stream coming in via filepolling. We were polling our filesystem for new files that were dropped there by an internal partner. We essentially wanted to pass that file (stream) in its entirety to the TN, but first we needed to read its contents to decided who the sender and receivers were. We tried to pass ffdata (the input stream variable from the file polling) to pub.io:streamToBytes and then the output of that to pub.string:bytesToString… then we parsed that string, set the correct sender and receiver… then tried to pass the original ffdata (which essentially wasn’t edited) to wm.EDIINT:send to submit it to TN.

We couldn’t understand why the stream seemed to be corrupted by “touching it” in this manner. We never overwrote ffdata, but when it was passed through the whole process it was missing quite a bit of the original stream. After struggling with this for several hours we decided to rebuild the stream from the string we parsed. So we used the services above in reverse order, rebuilt the stream, and used that stream to submit it to the send service. (to TN)… This finally worked.

So if there is anyone out there who doesn’t understand why a stream is missing information, just remember you can’t touch it or you need to rebuild it!

Thanks,
Jessica

the streams internally maintain pointer to the location of data. When you read from the stream, you also move the internal pointer.

Hello Jessica,

Stream object in webMethods is inherited from the stream object of Java. Stream is just sort of pointer to where the actual data is stored. Once the information is read from this pointer, the data is REMOVED from this pointer. So even though you have the stream, the data cannot be read from it AGAIN. Stream is the most efficient way to store data, but not reusable. So you need to convert from stream to string or bytes to reference is repeatedly.

-Rajesh Rao

A good lesson learned Jessica!

From my own lessons learned - working with strings can be a big gotcha.
If the file contents you are talking about are large - you will read the entire file into memory!. Also, keep in mind that every time you create a string in webmethods and start passing it around to other flows, you actually are copying the entire string, not passing the reference of the string. This is hugely costly if you are passing around large file contents as strings.

Another option to consider would be to create a java service that takes the stream object, wraps it around a PushbackInputStream. You should be able to unread the data once you have completed your sender/receiver logic, and send it back to the delivery flow. I haven’t tested this idea out. Just a thought.

Hope this helps,

Haithem

Note that Flow Map Copy always creates references when possible, so there aren’t multiple instances of a String or IData in memory at one time unless an application explicitly does a clone.

There are a couple operations that will cause IS to do a true clone of the data - mostly to support asynchronous operations, such as logging.

You can also create a SequenceInputStream given an ByteArrayInputStream of what you have read and the original stream.

HTH,
Fred

Fred,

Flow maps do reference copies for all objects except Strings. Strings are actual copies. You can test that by mapping a string to another then changing the destination string. This will not change the source, confirming that strings are not reference copies.

Yes. We are both right.

Since java.lang.String is immutable, the act of changing it’s value creates a new String instance, which one then puts into the pipeline.
There is only ever one occurance of a String value in the JVM. See
[url=“http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#intern”]http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#intern/url

The act of Flow Map Copy doesn’t use significant extra RAM. This is not very intuitive, particularly to a old C programmer like me.
http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#19369

IData (the pipeline or Documents in the pipeline) holds name/value pairs. You can do the following without a significant RAM hit for the ‘duplicate’ long String values in the pipeline:

String key0 = “Pretend I Am a Very Very Long Value”;
String key1 = new String(“Pretend I Am a Very Very Long Value”);
String key2 = new String(key0);
Object data = {
{ “key0”, key0 },
{ “key1”, key1 },
{ “key2”, key2 },
{ “key3”, “Pretend I Am a Very Very Long Value” },
};
IData pipeline = IDataFactory.create(data);

If I did:

String shorter = key0.substring(1);

I would get an new instance of a String object that takes up key0.length() - 1 characters of RAM.

You should be able to use pub.io:mark and pub.io:reset to maintain your original InputStream in 6.1.

Fred

from webMethods veiw I had encountered more than one occurance of one string value. Say for example the output string called “val” is present for more than one flow, I had noticed then multiple “val” variables are found in the pipeline during runtime (this points to the proper cleanup of pipeline ofcourse)

Thought to mention this

Thahir

Thahir,

One of the unique properties of an IData versus a Java hashtable is that it can have duplicate keys each with a unique value. Because the pipeline is just an IData you can certainly have multiple string variables with the same name (key) in the pipeline at the same time.

Mark

very good lesson
rock and roll

Not all InputStream’s support mark and reset. Plus, there is a limitation on how much you can read through the stream and still have reset work.

Using this technique can support reading the head of a stream for doc recognition purposes yet still pass a complete stream to TN for processing without using mark/reset:

  1. Get the InputStream from some source, e.g. getFile, stream from the file polling port, network socket, etc.

  2. Read the first 100 bytes (or whatever number is needed) to be used for doc recognition (do NOT use streamToBytes, use pub.io:read). Convert the bytes to a string (bytesToString) and do whatever parsing is needed to get the doc type, sender, receiver, etc. Keep the 100 byte array in the pipeline.

  3. Call the bytesToStream service and pass the 100 byte array. It will return a ByteArrayInputStream object, using the byte array as the backing data.

  4. Call a Java service (you will need to create one) that accepts the new ByteArrayInputStream and the original InputStream from step 1. In that service, create and return a SequenceInputStream object.

  5. Pass that SequenceInputStream to TN.

The SequenceInputStream will use the first stream and when that reaches EOF, it will use the second stream from the location where the read pointer is–in this case, starting at byte 101. To TN, it will appear as one continuous and complete stream.

Hope this helps someone.

Strangely enough, I was looking at a service that does exactly what jbraunstein suggests, only yesterday. It works.

Yes, it indeed works. Just be aware of the limitations as mark/reset is not supported by all subclasses of InputStream.

Sorry, I misattributed.:blush:

I meant to say that the technique using SequenceInputStream, suggested by you reamon, is the one that works.