DataHub offload time configuration

Product/components used and version/fix level:

10.17

Detailed explanation of the problem:

We’d like to adjust our data offloading process from DataHub to occur at a different time frame. Currently, it offloads data every hour. Is it possible to offload only the data that is a month old, excluding the current real-time data in IoT Core? Our plan is to retain only the last month of data in IoT Core, while any data older than a month should be offloaded to the data lake.

Error messages / full error message screenshot / log file:

Question related to a free trial, or to a production (customer) instance?

Production ( Customer )

Hi!

DataHub offloads an hour of data every hour. That way, if you have a retention of 1 month, if something goes wrong with the offloadings for whatever reason we/you have a month to fix it.
If you would only offload data one month old, in the first month you would offload nothing and then you still would offload one hour of data per hour, but with the immediate risk of data loss. What would be the benefit of that? What harm does the data do to your data lake if it is offloaded earlier?

Regards,
Michael

Thank you @Michael_Cammert, The reason for offloading data that is a month old is because once the data is transferred to the data lake, there is no option to edit the offloaded data.

That’s why we were wondering if we do not offload the latest data to the data lake and only offload the data from the last month. Then, we will always have one month of data in IoT core where we can modify it. After a month, the data will be moved to the data lake, which gives us the possibility to modify the data up to one month old.

Thanks,
Saif

Looking for help here…!

Hi Saif,

to my knowledge it is not possible to configure DataHub to only “old” data. Actually, I wondering how you’d define that “old” data. Is it data that was created a month ago? Wasn’t touched/updated for a month?

All in all, it is usually not a problem that data can’t be updated in the lake, because if you update it in Mongo that change would be reflected in the next offloading.

Maybe you can elaborate a bit on your business problem?

Cheers, Christoph

I don’t think your use case is what is intended by typical data lake use cases. Normally you just store all data (changes) so you can also keep record when values & data has been changed. If you just offload once per month you loose this kind of data record because it could have been changed multiple times in the mean time. If you are just interested in data older than 1 month you can define a nice query to do so (and retrieve the correct state at this time).

Still you can implement that by data offloading filtering:
You can filter out “newish” data by time or creationTime to be ignored for hourly offloading or vice versa: Only data that is older than 1 month should be offloaded every hour.

https://cumulocity.com/docs/datahub/working-with-datahub/#set-filter-predicate

Hi!

It is correct that you can use a filter predicate to exclude certain timeframes. The problem is, DataHub supports such filters only with static timestamps to compare to. Dynamic ones (calculating the timestamp when the offloading takes place) would require to call corresponding functions, which we currently do not support (Working with Cumulocity IoT DataHub - Cumulocity IoT documentation). This is because in certain cases the functions would be executed multiple times with different results - leading to inconsistencies. So, currently there is no way known to me to achieve an automated offloading delay.

However, it looks like a reasonable idea. So, feel free to raise a feature request for this.

Regards,
Michael

Thanks for the clarification about the dynamic timestamps @Michael_Cammert !