Pub.string:tokenize to ignore commas between data fields

Hi All,

I’m using tokenize to separate values in a string list, but a value in the list is having comma which needs to be ignored. Example - suppose we have a list a,b,c,d,e,f . I want ‘c,d’ treated as a single field.
So the output should be -
a
b
c,d
e
f

Instead of
a
b
c
d
e
f

Kindly help!!!

This is not a direct answer.
The pub.string:tokenize accepts three inputs : a input string, a delimiter and a boolean option whether to use regular expressions. The delimiter in your case is a comma , the requirement is a little tricky because “c,d” are separated by the delimiter.
There are many alternatives to achieve this by using different sets of services
E.g,

  • You can first replace all but one occurrence of the comma and change the input to a;b;c,d;e;f and then use tokenize.
  • You can use substring function to split the string and use a concat with the input string e,f etc.

The implementation would largely depend on your scenario/ input data and there may be need for preprocessing the data as well.

Can you shed more light on the input data itself , are there specific scenarios where you would want to ignore the delimiter for any reason?

-NP

1 Like

I’m not aware of any parser on the planet that would be able to directly do what you describe. How would a parser know that “c,d” is to be treated differently from the others?

If you’re trying to perform flat-file processing of a single record, the the flat-file services will do what you want BUT the data would need to look like this:

a,b,“c,d”,e,f

The quotes tell the parser to not treat delimiters as delimiters. You can refer to these for additional information about delimited data parsing and how to support delimiters in the data.

https://datatracker.ietf.org/doc/html/rfc4180

The flat file services documentation describes how to properly escape delimiters and supports doing so both for creating and parsing delimited data.

Keep in mind most people/vendors tend to forget there are 2 delimiters in a file – the field delimiter (comma, tab, etc.) and the record delimiter (carriage return or line feed or both or something else). The techniques describe how to support any of these in the data without resorting to search and replace (which is error prone) or other manipulation.

On another note, be aware of the specific behaviors of tokenize which uses java.util.StringTokenizer.The tokenize service has behaviors that more often than not are not expected nor desired. E.g. it collapses consecutive delimiters into one. For this reason, we created a service that uses java.String.split instead.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.