Hi all,
We have a requirement of being able to handle large files of size in access of 200MB ~ 1+GB via HTTP/S (yes HTTPS only no FTP). Besides using large File flags what other things should be considered for processing of file of this size?
1- Is it worth using the TN and having all this data stored in the TN DB?
2- Should we use a broker to send this data to other backend IS/Adapters or go with FTP or some other connectivity from this first IS to backend IS?
3- Ideas about keeping the state in memory of this whole data/file for translation/mapping?
Your mapping/transformation processes will need to consider large file handling too. You’ll be processing the file in chunks.
It depends. If a failure occurs, is it okay to have the entire file sent via HTTPS again? Do you need to retain the data on your servers for business reasons? Do you have enough database space?
Broker is not designed to handle large files. If the large file consists of lots of little stand-alone documents, then sending those individually through Broker might be appropriate. If the entire file needs to be processed as a whole, then you should not use Broker (I think it will actually choke at some point cuz the max size of its queue files might be exceeded with the sizes you’re talking about).
Read through the large file handling documentation that wM provides. It will give you some good information.
1- Is it worth using the TN and having all this data stored in the TN DB?
TC: If you need to store it somewhere, TN is a decent place. It becomes quite nice if you need to use the partner management facilities.
2- Should we use a broker to send this data to other backend IS/Adapters or go with FTP or some other connectivity from this first IS to backend IS?
TC: I think it’s in your best interest to minimize the number of hops the giant file has to go through, and I’d try to avoid the broker at all. Unless it’s a single file, like a movie of CAD file, I’d try to break it up as soon as you can.
3- Ideas about keeping the state in memory of this whole data/file for translation/mapping?
TC: Avoid it. Use streams. Break it into chunks. Use multiple IS servers. Be smart about processing it, because brute force won’t work in IS with files of this size. Think about this: In many cases, the amount of memory available to each process is limited (often about 2 gigs). If you’ve got a 1 gig file, after IS overhead, you could probably have only one instance in memory at a time. And IS does a LOT of pipeline duplication–be careful.
RMG/Rob,
Thanks for your suggestions. This is a good start.
Unfortunately choping up the file into small chunks may not be possible before we hit the backend processing IS. So what would be an appropriate transfer/communication mechanism from the front-end IS to the backend IS for a file this size
I am having an extremely hard time trying to convince people to aviod using broker for this setup - specially when we also don’t have too many integrations points on the backend - But some folks just want to use broker, because it is there… and has guarenteed delivery, supports asynch, decoupled integrations and so on…
Here’s the easiest way to convince them to not use the Broker–it won’t work.
Broker holds messages in their entirety in memory. For your situation, a 200MB file might work, but 1GB almost certainly will bring the Broker down. Broker is designed to handle lots of relatively small documents, not huge batch files. One of the primary reasons message brokers exist is to allow the shift from batch processing to near real-time processing. That means that business events (new customer record created, order received, invoice issued, etc.) get published right away individually, not held for batch processing in big groups later.
The appropriate mechanism to transfer between IS instances would be to stream the file and process the file as a stream on both sides.
"…and has guarenteed delivery, supports asynch, decoupled integrations and so on… "
Guaranteed delivery – IS has GD facilities too. Review the services in the remote folder in WmPublic. Remember too that Broker GD is between the adapters and Broker only. Once “your code” has it, all bets are off (you have to do the GD work).
Async – Without Broker, there is a bit of work needed to do IS-side queuing. The IS GD facilities can provide this to a degree (it retries for some time before giving up–you can configure this). You can use TN as well, though it uses the same sort of retry-a-configured-number-of-times approach.
Decoupled integrations – Ooo boy you pushed my wrong button on this one! IS provides decoupled integrations too. The use of any intermediary between two apps provides decoupling. The document producing app has no idea who/what/where the receiving app(s) are. There are a couple of threads on this topic that you may find helpful.
Rob,
Thanks for providing these links. I have tried using your/these arguments But I’ll try again and again … Everybody thinks that being able to do something (like async etc) with IS than having Broker designed specifically for this type work is totally different. May be I am biased, but I still think that for our given situation having a simpler architecture should suffice our needs.
Also, Is it true that we can have IS6x talk to only one Broker? Can we start multiple IS 6x instances on the same physcial box (solaris).
Btw, I did not know that there were two different Broker and IS “nations” out there…
Do prototypes of both solutions and see what happens. Then it’s no longer a hypothetical argument.
If they insist on Broker, suggest a hybrid approach: Broker can be used for publishing metadata docs as notifications (file location, status, name, etc) and IS can do the heavy lifting, pulling the file over and processing it.
Each IS is usually associated with one broker. If the box is big enough, you can run multiple IS’s on it. However, for the sizes you’re talking about, expect to allocate 2 gigs or RAM or more to each instance of IS (remember to account for other overhead).
Great advice Tate. The approach of publishing messages that say “the file you want is over there” can be very effective.
IS is limited to connecting to one Broker, which is another source of extreme frustration for me. It’s an arbitrary and constraining design decision. IS can connect to any number of WebSphere MQ brokers/managers (a primary competitor that does the same thing as Broker) but only one Broker. Seems really short-sighted to me.
Haven’t tried this, but you might be able to get around this limitation by using JMS. Since Broker is a JMS provider, one should be able to specify any arbitrary number of connections to an arbitrary set of Brokers.
Is the requirement to use HTTPS instead of FTP a matter of security? If this is driving your architecture, then we can explore alternatives. Using HTTPS means the sending IS needs to read data into memory before being able to push it out. With FTP, you don’t have to if you don’t want (and hence avoid the memory issue).
But that still leaves us with the security issue, don’t we? Here’s where it gets interesting. webMethods has a package to do SSH protocols (SSH, SFTP, SCP), which are encrypted and also can use certificates to do authentication. This entails a lot more set up on the OS people, but it pushes most of the processing out of webMethods…
If it’s something that you think is conformant to your requirements, just search for OpenSSH in wMUsers forums.
Opps, one more thing – OpenSSH package merely take care of transporting of the data. You’ll need to worry about handling the data. It’s much more of a problem for the receiving end, if IS actually need to process it (i.e. transforming, starting models, etc)…