I tried document size: 600KB, 120KB, it always compress.
When I tried 20KB, it doesn’t compress.
It’s said that only document size smaller than 8000bytes can have compress/uncompress option.
Is that changed?
THanks!
Frank
Hello Frank,
the 4.1.4 documentation for “tsd:compress”:
.../Tamino 4.1.4.1/Documentation/tslref/compress.htm
Says the following…
You didn’t mention which of the settings you are using in your schema - could it be that the default is being used?
This might explain the behaviour you are seeing…
Greetings,
Trevor.
I use “smart”, but the document size is 20K instead of less than 8000 characters.
I have data of 330MB documents, and if “smart” is selected, the storage is about 500MB; if “always” is used, it takes 80MB.
I have a bunch of performance comparisons as well.
The algorithm has not been changed, however, it is a little complicated, and the documentation is obviously somewhat misleading.
First: There is no hard limit in terms of ‘original document size’ which can be used to decide if the document will be compressed internally or not. Apart from the sheer size of the document, this depends on the document encoding, it’s structure, and on the platform tamino runs on.
Please note that the criterion below uses the term characters, not bytes. This can make a big difference: If the document contains mostly anglo-american characters, and has 8000 of them, it will have a byte-size of about 8000 if encoded in utf-8, but will have a byte-size of about 32000 if encoded in ucs-4. Thus, to estimate whether the document will be compressed or not, you first have to determine how many characters it has, not how many bytes! ( Of course, with ASCII encoding the number of characters equals the number of bytes )
Documents are internally classified as small or large:
On Unix Platforms, most documents that contain less than 8000 characters will be classified as small.
On Windows, most documents that contain less than 16000 characters will be classified as small.
However, depending on the structure of the doument, it is possible that a document with significantly more chracters than the limit mentioned above is classified as small, and it is also possible that a document with significantly less characters is classified as large.
Tamino uses 2 very different algorithms for compression. The first one, in the following called ‘strong ompresssion’, has roughly the effect of compressing the original doucment with gzip. The second one, called ‘light compression’, has approximately the effect of converting the original document to utf-8 encoding.
compression=smart:
If a document is classified as small, it is not compressed with the strong compression algorithm. Instead, the light-compression algorithm is applied.
compression=always:
This setting disables the differentiation between large and small documents completely. All documents will be compressed with the strong compression algorithm.
compression=none:
This setting only affects documents that have been classified as ‘small’. As described above, with compression=smart this documents would undergo ‘light-compression’. This light compression is suppressed by compression=none.
Thus, documents classified as ‘small’ will not be compressed in any way. Documents classified as ‘large’ will still undergo strong compression.
Finally, i would like to stress that all this will probably change a lot in future Versions of Tamino. However i hope it helps a little.
regards, Martin
This is so clear. I really appreciate it!
Frank