How to offload arrays in DataHub

Product/components used and version/fix level:

DataHub, version 10.16

Detailed explanation of the problem:

So, I do have a fragment within our digital twin looking a bit like this:

"additional": [
  {
    "type": "A23",
    "number": "82746t1771"
  },
  {
    "type": "B12",
    "number": "1239656691"
  },
  {
    "type": "A08",
    "number": "123966919"
  }
]

With a varying number of array entries.

I want to offload this data to our data lake “somehow” but I don’t know how to deal with the array. Is it possible to collect array entries into two columns like:
additional_type and additional_number (each then holding all array entries as comma separated string)? Or maybe something like additional_type_1 and additional_number_1 etc. for individual values, accepting a maximum number of array entries?

Thank you and kind regards, Christoph

Hi!

The problem here is that with respect to additional columns, you have to configure them - this currently implies you know their number at design time of your offloading.

So, you can either just offload the whole arrays as one column and then take care of accessing the details when querying the data lake (the function flatten can be very handy for that). Or you can offload a fixed number of positions by accessing the array additional[0].“type”

Regards,
Michael

Thanks Michael,

yes, unpacking the nested fields into individual arrays within the inventory collection would have been my preferred choice but this cannot be achieved directly. Together with Alexander Friedrich from Software AG we worked out these 3 options:

Option 1: Known number of array entries

Just as you stated - if the (max) number of array elements is known we can simply export specific indexes into named columns e.g. src.additional[0].typeadditional_type_0 and so on.

This is definitely the easiest way to do this.

Option 2: Flatten entire array

It is possible to “flatten” the array during offloading which will produce multiple rows in the offloaded table. The target column type will be of type STRUCT. For example: If FLATTEN(src.additional) is offloaded to an additional column, I can later access individual values within this column like src.additional.type.

This has two disadvantages: A) multiple rows are being created which creates a bit of redundancy and B) all fields from the array are being offloaded. It is not possible to just select specific nested fields and they cannot be offloaded to individual columns either.

Option 3: Export the array as ARRAY/LIST and deal with it later

If the array fragment is offloaded as-is it will be a single column of type LIST. This list and the elements within (of type STRUCT) can be access it queries, though. We can use this to create a Virtual Data Set (VDS) which holds just specific nested fields of the array in a nice, normalized way.

In this (phony) example I would create an inventory_additional VDS which is based on the offloaded inventory collection like so:

SELECT o.id, o.lastUpdated, o.add.type, o.add.number
FROM ( 
   SELECT src.id, src.lastUpdated, FLATTEN(src.additional) as add 
   FROM inventory as src
) as o

Because the array/list is _flatten_ed, this will result in multiple rows in the inventory_additional view. This view can then be used in all kinds of queries.

I chose option 3 in the end, as in my actual data the array has many more nested fields (and I only needed a couple) and I could not assume a max number of array elements. The flattened VDS is pretty nice, getting the whole thing closer to a normalized database design.

Cheers! Christoph

1 Like