Geschreven door Ramond Jaggessar

Using Cribl to transform XML data to JSON

Data8 minuten leestijd

It’s not a secret: we at CINQ love to use Cribl products. We’re particularly fond of using Cribl Stream, and we like it so much, we’re even doing workshops on the matter to show people what you can do with Cribl Stream. Cribl itself says Cribl Stream is an Observabilty Pipeline, but before we get lost into what an Observability Pipeline exactly means, what Cribl Stream can do is collect data from any data source, filter out any parts of the data that you don’t need, transform the data in a format of your choosing, and send the data to a data destination. You can get started with Cribl for free in their sandbox environment here.

One of the assignments in our recent Cribl workshop is based on a real life use case from one of our customers, and the few times we’ve given this workshop, participants were struggling a bit with this assignment.

Since there’s not much time available during our workshop to fully get into detail about why you have to do some steps, here’s a little blog post to cover that need.

The dataset

The National Data Portal for Road Traffic (Nationaal Data Portaal Wegverkeer, or NDW for short) is a Dutch national portal for road traffic data in the Netherlands, both from national roads, provincial roads and from municipal main roads.

In NDW, Dutch governments work together to collect, combine, store and distribute mobility data. This data is used for traffic management.

First, the NDW has a platform available where the traffic data is posted: the Open Data Portal (​​https://opendata.ndw.nu/

For the workshop assignment we’re interested in Ongevalideerde_snelheden_en_Intensiteiten.xml.gz, a file which consists of the average speed and intensity of all measure points of all national roads of the country. That’s over 24000 measure points. In a single XML file. Compressed as a gzip file. Updated every single minute.

Let that sink in for a moment and let’s take a closer look at the structure of the XML file itself.

<?xml version="1.0" encoding="UTF-8"?>
<SOAP:Envelope xmlns:SOAP="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP:Body>
    <ndw:NdwMrm xmlns:ndw="http://schemas.ndw.nu/wsdl/NdwTmisMrm10/">
      <ndw:exchange>
        <subscription>
          <subscriptionState>active</subscriptionState>
          <updateMethod>snapshot</updateMethod>
        </subscription>
      </ndw:exchange>
      <minute_speed_and_flow_events xmlns="http://speed_and_flow.trafficmanagementinfo.publicatie.hwn.rws.nl/1.1">
        <meta>
          <msg_id>
            <uuid>97b3eddf-cf1c-43cc-ae0f-e2e55ab0d22b</uuid>
          </msg_id>
          <selection />
        </meta>
        <event>
          <ts_event>2019-09-19T12:21:27.741Z</ts_event>
          <ts_state>2019-09-19T11:40:53.853Z</ts_state>
          <measuring_point_id>
            <uuid>e9af9383-2d7f-4963-88c9-38aa1d9c33cc</uuid>
          </measuring_point_id>
          <lanelocation>
            <road>A2</road>
            <carriageway>g</carriageway>
            <lane>3</lane>
            <km>64.242</km>
          </lanelocation>
        </event>
        <event>
          <ts_event>2023-08-20T19:05:06.285Z</ts_event>
          <ts_state>2023-08-20T19:05:00Z</ts_state>
          <measuring_point_id>
            <uuid>e9af9383-2d7f-4963-88c9-38aa1d9c33cc</uuid>
          </measuring_point_id>
          <avgspeed>
            <kmph>92</kmph>
          </avgspeed>
        </event>
        <event>
          <ts_event>2023-08-20T19:05:06.286Z</ts_event>
          <ts_state>2023-08-20T19:05:00Z</ts_state>
          <measuring_point_id>
            <uuid>e9af9383-2d7f-4963-88c9-38aa1d9c33cc</uuid>
          </measuring_point_id>
          <flow>
            <count>6</count>
          </flow>
        </event>
        …
        …
        …
      </minute_speed_and_flow_events>
    </ndw:NdwMrm>
  </SOAP:Body>
</SOAP:Envelope>

We can see that the interesting parts are under element event and that the same  measuring_point_id is mentioned 3 times under 3 different event elements.

Event 1: you can identify the measuring_point_id to the physical location of the measuring point on the specified national road. 

Event 2: you can see the average speed for a specific measuring_point_id

Event 3: you can see the number of vehicles for a specific measuring_point_id

So, everything we need is located on a website in a compressed XML file of around 1.6 MB. Decompressed the file is around 18 MB.

Using Cribl to transform the XML to JSON

What we need in Cribl, is to get to the gzip file, decompress it to get the XML file, and create separate events for every, well, event .

Step 1

This is the easiest part. Setup a REST API source pointing to the URL of the file. As if it’s nothing, Cribl will get the file and automatically decompress it when the source is run. 

Here comes the catch though. Since the decompressed file is around 18 MB, Cribl won’t be able to break the file into separate events later on in the pipeline you’ll be using, because by default, the maximum number of bytes in an event before it is flushed to the pipelines is set at 51200 bytes. Luckily, you can make custom event breaker rules where you can adjust the maximum size.

Step 2

Find the Event Breaker Rules (under Processing > Knowledge). Either create a new Ruleset, or make a copy of the existing Ruleset “Cribl - Do Not Break Ruleset”. What we’re interested in is to adjust the Max event bytes size so the XML file in its entirety will be sent to the pipeline for processing.

So choose a value that’s the size of the file in bytes, or go all the way up to the maximum number of bytes an event can be for Cribl to handle (134217728 bytes).

Step 3

Use the custom Event Breaker Ruleset in your Data source.

Now we’re able to get the compressed XML file from the website and send it to the pipeline where we’ll do some processing.

In the pipeline we’ll be separating the events we can identify with the event tags from the XML file. We’ll be doing this with the help of the XML Unroll Function. This function accepts a XML event with a set of elements, and converts the elements into individual events.

If we look at the XML file again, we’ll notice that the information we’re interested in in element event  is located as a child element under element minute_speed_and_flow_events which itself is a child element under element ndw:exchange which itself is a child element under element ndw:NdwMrm. And then you have your SOAP body and SOAP envelope.

With this information at hand, we can say that the information we want is located at

SOAP:Envelope\SOAP:Body\ndw:NdwMrm\minute_speed_and_flow_events\event

This is exactly what we’ll be using in the function XML Unroll. Well, we need to change it a little bit because we need to put a Regex expression in the function which describes the path to the array to unroll:

^SOAP:Envelope\.SOAP:Body\.ndw:NdwMrm\.minute_speed_and_flow_events\.event$

The Copy Elements Regex part is empty, because we don’t need anything to be copied into the separate events.

If this is our XML event we put into our pipeline (remember, it’s one big event, so the image is just a sample image)

the XML Unroll Function will make separate XML events and you’ll have something like this (with the Pretty Print option selected)

Each event will have an element measuring_point_id with respectively an element <lanelocation>, <avgspeed> or <flow>.

But we’re not done yet. In our workshop we didn’t want XML events to be sent to our external system, but rather have JSON events, since we were using the data the workshop participants were sending in a dashboard built around data in the JSON format.

Luckily, it’s rather simple to do this in Cribl, since there’s a Cribl expression available that just does that: parse an XML string and returns a JSON object: C.Text.parseXml()

We need to use this expression with the Eval Function to parse XML fields contained in our events.


With this in place the events will be available as JSON objects:

Now, our events are ready to be sent to an external system where the data can be used for whatever purpose.

In our case we used the data to show traffic near our office building:

Read more