The advent of the Internet of Things (IoT) enables companies to connect all their devices and access the device data in digital environments, complemented by various means to gain knowledge from the device data. Cumulocity IoT from Software AG is at the forefront of helping companies on this IoT journey.
Cumulocity IoT DataHub is a new pillar of the Cumulocity IoT platform, which introduces the notion of large-scale data analytics. Not only does DataHub allow you to run sophisticated analytics tasks over long-term IoT data, it also helps you to move your IoT data into a data lake.
1.1 Analytics in Cumulocity IoT as of today
The first step in each IoT project with Cumulocity IoT is to connect your devices to the platform. Once that is completed, the next step is to build applications on top of the data your devices emit. Cumulocity IoT offers different approaches for working with device data:
- REST API: query ad-hoc the latest state of devices
- Apama: analyze device data on-the-fly so that critical situations can be detected with near-zero latency
- TrendMiner: uncover temporal relationships within series of device data
- Zementis Machine Learning: generate predictions on device data using machine learning models
Cumulocity IoT DataHub augments that portfolio with large-scale data analytics. DataHub allows you to run complex and resource-intensive analytical queries against your device data without adversely affecting the operational database of the Cumulocity IoT platform. DataHub achieves that by replicating device data into a data lake so that analytical queries run against data in the data lake. Those being around in the IT world for a while might see similarities to the rise of data warehouses. Like data warehouses, DataHub is designed for analytical processing, which encompasses aggregate queries over a time series of device data. Unlike data warehouses, DataHub’s architecture allows to scale, compute and store independently, and storage cost is much lower than in typical data warehouses. DataHub’s storage layout takes the characteristics of analytical queries into account, providing both optimal performance and cost efficiency. In the machine room of DataHub, Dremio oversees moving data to the data lake and running queries against it. Dremio, a commercialized successor of Apache Drill (which again is an open-source implementation of Google’s Dremel/BigQuery) is the leading data lake engine, tailored to operationalize data lake storage and speed analytics processes.
1.2 Moving data from the operational database to the data lake
The operational database of Cumulocity IoT oversees the document-based storage of events, alarms, inventory and measurements data. To keep the database lean and performative, data is only stored for a limited time. DataHub circumvents those retention policies by replicating all data into a data lake of your choice, e.g. Amazon® S3 (and API-compatible products such as Min.io®), or Microsoft® Azure® Data Lake Storage Gen1 and Gen2 (Azure Storage). Those data lakes are purpose-built for storing huge amounts of data in a scalable and inexpensive manner.
Getting device data into the data lake requires you to define an offloading pipeline. Such a pipeline defines which collection from the operational database you want to offload and where the results are to be stored in the data lake. DataHub does not bother you with cumbersome definitions of how the original document-based data is to be transformed. Using detailed knowledge of Cumulocity’s data model, DataHub auto-transforms the data into a tabular format, which is optimal for machine learning, Business Intelligence, and most custom application use cases. Provided you want to preprocess or clean the data before it is offloaded, you can define additional result columns or add filter predicates for a more fine granular selection of the data to be offloaded.
Once you have defined the pipeline, it only remains to activate it. The initial run of a pipeline will copy all data from the operational database into the data lake. From then on, DataHub will periodically check for new data not yet offloaded and trigger the pipeline so that these increments are copied into the data lake. Of course, DataHub takes care that neither duplication occurs, nor data gets lost. The below screenshot shows an active offloading pipeline for the events collection.
How does the data in the data lake look like? The transformed, tabular data is stored in the Apache Parquet format, which gives you an analysis-friendly as well as storage-efficient columnar representation of your data. Apache Parquet is the de facto standard data format in “big data” tools giving you the freedom to, in addition to DataHub, analyze the data with tools such as Apache Spark™. Taking common analysis patterns into account, DataHub arranges the Parquet files in a temporal folder hierarchy. Additional housekeeping mechanisms in the background regularly compact smaller Parquet files, which boosts overall query performance (think of defragmenting a hard disk back in the days).
1.3 Gain knowledge from your long-term IoT data
With offloading pipelines for moving data into the data lake being in place, it is now time to discuss how to run analytics over your offloaded data. DataHub offers SQL as query interface, which is the lingua franca of data processing and analytics since many years. Dremio is the internal engine executing the SQL queries. Due to its highly scalable nature, Dremio can easily cope with many analytical queries.
With DataHub, you can quickly connect the tool or application of your choice, including:
- Business Intelligence tools using JDBC or ODBC
- Data science applications using Python scripts, which connect via ODBC
- Custom applications using JDBC for the Java ecosystem, ODBC for .NET, Python etc. and REST for (Cumulocity IoT) web applications
Below figure summarizes the high-level concept of DataHub.
Using DataHub, you are no more limited to the IoT world but can bridge the gap to the large analytics ecosystem and the world of business applications.
1.4 From the cloud to the edge
DataHub is designed as a cloud-native application, with all its components running as microservices/containers in Kubernetes clusters being deployed in private or public clouds.
An emerging number of use cases, however, also demands for a local processing. For example, in a shop floor IoT devices often are connected to local computers instead of remote Cloud platforms and need to do local processing instead of moving all data to the cloud. DataHub serves those use-cases by providing an Edge edition, which is a paid add-on for Cumulocity IoT Edge. As storage layer, DataHub Edge uses the local storage of the Edge device. Other than that, DataHub Edge is the same as the Cloud edition, excluding horizontal scalability. So, you have the same DataHub experience, whether it is deployed in the Cloud or on the Edge.
1.5 Learn more about Cumulocity IoT DataHub
Eager to learn more about how you can leverage DataHub for low-cost storage of all your IoT data and gaining knowledge from that data? Visit us under www.SoftwareAG.com/iot!