What's the document size threshold Tamino will not automatic

Frankfish · November 13, 2003, 1:12am

I tried document size: 600KB, 120KB, it always compress.

When I tried 20KB, it doesn’t compress.

It’s said that only document size smaller than 8000bytes can have compress/uncompress option.

Is that changed?

THanks!

Frank

Trevor_Ford · November 14, 2003, 1:50am

Hello Frank,

the 4.1.4 documentation for “tsd:compress”:

   .../Tamino 4.1.4.1/Documentation/tslref/compress.htm

Says the following…

You didn’t mention which of the settings you are using in your schema - could it be that the default is being used?
This might explain the behaviour you are seeing…

Greetings,
Trevor.

Frankfish · November 14, 2003, 12:38pm

I use “smart”, but the document size is 20K instead of less than 8000 characters.

I have data of 330MB documents, and if “smart” is selected, the storage is about 500MB; if “always” is used, it takes 80MB.

I have a bunch of performance comparisons as well.

Guest · November 14, 2003, 6:34pm

The algorithm has not been changed, however, it is a little complicated, and the documentation is obviously somewhat misleading.

First: There is no hard limit in terms of ‘original document size’ which can be used to decide if the document will be compressed internally or not. Apart from the sheer size of the document, this depends on the document encoding, it’s structure, and on the platform tamino runs on.

Please note that the criterion below uses the term characters, not bytes. This can make a big difference: If the document contains mostly anglo-american characters, and has 8000 of them, it will have a byte-size of about 8000 if encoded in utf-8, but will have a byte-size of about 32000 if encoded in ucs-4. Thus, to estimate whether the document will be compressed or not, you first have to determine how many characters it has, not how many bytes! ( Of course, with ASCII encoding the number of characters equals the number of bytes )

Documents are internally classified as small or large:
On Unix Platforms, most documents that contain less than 8000 characters will be classified as small.
On Windows, most documents that contain less than 16000 characters will be classified as small.
However, depending on the structure of the doument, it is possible that a document with significantly more chracters than the limit mentioned above is classified as small, and it is also possible that a document with significantly less characters is classified as large.

Tamino uses 2 very different algorithms for compression. The first one, in the following called ‘strong ompresssion’, has roughly the effect of compressing the original doucment with gzip. The second one, called ‘light compression’, has approximately the effect of converting the original document to utf-8 encoding.

compression=smart:
If a document is classified as small, it is not compressed with the strong compression algorithm. Instead, the light-compression algorithm is applied.

compression=always:
This setting disables the differentiation between large and small documents completely. All documents will be compressed with the strong compression algorithm.

compression=none:
This setting only affects documents that have been classified as ‘small’. As described above, with compression=smart this documents would undergo ‘light-compression’. This light compression is suppressed by compression=none.
Thus, documents classified as ‘small’ will not be compressed in any way. Documents classified as ‘large’ will still undergo strong compression.

Finally, i would like to stress that all this will probably change a lot in future Versions of Tamino. However i hope it helps a little.

regards, Martin

Frankfish · November 17, 2003, 1:37am

This is so clear. I really appreciate it!

Frank

Topic		Replies	Views
Compression of documents stored in Tamino Tamino	2	2992	April 2, 2021
Tamino Database Size Tamino	3	3283	April 2, 2021
Can I compress the data by upgrading the shema? Tamino	2	3048	April 2, 2021
Compression of documents Tamino	2	3865	April 2, 2021
Is there any way to reduce the volume occupied by data at TA Tamino	4	3651	April 2, 2021

What's the document size threshold Tamino will not automatic

Related topics