Looking for best practices for requesting data from Cumulocity and DataHub

What product/components do you use and which version/fix level are you on?

Cumulocity IoT ( Version 1013.0.260) and DataHub

Is your question related to the free trial, or to a production (customer) instance?

It is related to a customer instance which we are using with partners in the Software AG Research department.

What are you trying to achieve? Please describe it in detail.

We have a lot of data which we like to access with a service from outside of Cumulocity and DataHub. We want to get the data in a customized JSON Format and prefer to have a high performance solution for this. We are currently using the REST API for our Request but the solution is very slow.

I was hoping that some members of this community might have some more experience with use cases like this and will be able to give me some tips.

Do you get any error messages? Please provide a full error message screenshot and log file.

We don’t have an error. We are mainly looking for tips and best practices since we haven’t got that much experience with the technology.

Have you installed all the latest fixes for the products and systems you are using?

Hi Fabian!

You are shortly describing your use case and make a very general statement that the solution is using REST an is slow.

In order to help you, we would need to understand more details, including but not limited to information about your data sizes, structures, queries, architecture (where is the JSON transformation done?). It would also help a lot to know which variant of the REST API are you using (DataHub or Dremio / Standard or High-Performance)?
Moreover, it would be helpful to know which aspect of your solution is slow, e.g. query execution, result transfer, JSON transformation, …

Maybe it is easier if you contact me for a call internally.

Regards,
Michael

Hi Fabian,

some more details would help. What do you have in mind when you write about a “high performance solution” and what means “slow” for you using the REST API?
I think it is also related to the queries you are using and of course the amount of data in the mongo DB.

Maybe your REST Requests can be optimized when you provide an example with response times an expected response time.

There are multiple other options available for data extraction. One could be to push all messages to a high performance layer (Cache) using Notification API and providing an API allowing queries on that Cache. The effort to implement that (any keeping it in sync) is pretty high though.

Hi Stefan,
thanks for your feedback so far. I’ll try to provide some more information:

  1. High Performance means for me that it shouldn’t take several hours to access and transform e.g. a million datapoints. The current solution does this, but I’m sure it also has several reasons for it, that’s why I did such a general request to get feedback on all aspects of data gathering.

  2. The request I’m using search for values of a specific series in a given time range. To gather all the data for this time range I’m also using the current page and page size parameter. For DataHub the standard REST API is used so I first post my request as SQL and request the results with a limit and offset parameter once the query is done. The query is executed on a space in which one series gets collected.
    I don’t have expected response times I just have very slow current execution times (about an hour for 200.000 values).

  3. I’d prefer a simple solution that can be deployed to a docker container as well. The service will be hosted at a partner and as a dual student I’m not working in the project for the whole year, so I can’t assure maintenance and might not have the skills needed to implement a good complex solution.

  4. Is there a way to check the current amount of data in the Mongo DB? I think that currently there are several million values in it but I can’t tell you the correct number of values.

Thanks for the details! A few more questions: What is the use case for putting such a big amount of data into Cumulocity IoT and is this data really needed there for analytics or visualization or might it be an option to bypass Cumulocity? How are your measurements structured? Do you have multiple series with the same timestamp or one value per measurement?

In Cumulocity IoT you’re bound to the API but you can optimize how you STORE data there and of course how you RETRIEVE that data again.

To 2: Sounds like it could be optimized if you provide the structure and an example of the query.
To 3: You can implement any kind of microservice but still you are bound to the Cumulocity IoT or DataHub API so it wouldn’t make much difference. Additional high performance storage with an access layer (API) must run outside of Cumulocity but as you mentioned is a 100% custom implementation and must be maintained.
To 4: In “Administration” you can see the amount of data used in your tenant

To answer your questions:

  1. Our current use case is the connection of small industrial machines via OPC UA, to gather their data, visualize measurements and provide the data to partners outside of Cumulocity and Software AG for an AI development. We decided to use Cumulocity and DataHub for that use case, because it seemed to fulfil all the requirements we had in mind. I’m not sure if there is a way to bypass Cumulocity for the whole use case, but depending on the solution I could imagine bypassing for the data gathering.

  2. A Query looks for example like this:

    https://{{url}}/measurement/measurements/?valueFragmentType={{type}}&valueFragmentSeries={{series}}&dateFrom=JJJJ-MM-DDTHH:MM:SSZ&dateTo=JJJJ-MM-DDTHH:MM:SSZ&pageSize=2000&currentPage={{currentPage}}.

    I think I can leave out the “type” field but all other fields seem necessary to me. The current page parameter is needed because I have way more than 2000 measurements per series and no other solution for getting all results of a request.

  3. Measurements are structed with the standard information that a device protocol for OPC UA uses. As JSON it looks like this:
    {
    “self”: “measurement URI”,
    “time”: “time”,
    “id”: “id”,
    “source”: {
    “self”: “machineURI”,
    “id”: “machineId
    },
    “type”: “type”,
    type”: {
    series”: {
    “unit”: “unit”,
    “value”: value (mostly float or double)
    }
    },
    OPC UA node Id”: {}
    }

    Measurements like that one are gathered in an array in the output of the REST-request.

  4. We have multiple series, but due to OPC UA processes there timestamps differ most of the times. The timestamps of the measurements are from the original OPC UA Server and not from Cumulocity itself.
    All the measurements contain a single value with a related time stamp.

Thank you for your feedback on my questions. For me it doesn’t look like a 100% custom implementation with data storage outside of Cumulocity and DataHub is a solution for my use case, since it’s an research prototype and shouldn’t need that much maintenance to keep it running.

I would really appreciate a tip for the STORE and RETRIEVE optimizations in Cumulocity and hope that my given answers can help you to understand the data structure we are currently using.