Guide: Collecting validator usage statistics

Track	Test Bed operation

This guide walks you through the steps to enable the collection of usage statistics from a validator running on your own infrastructure. It applies to validators built using the Test Bed’s base XML, RDF, JSON and CSV validator components.

Note

Validators managed by the Test Bed: To enable statistics collection for a validator you need to control its hosting and operation. This guide does not apply if your validator is hosted and managed by the Test Bed team.

What you will achieve

At the end of the guide you will be aware of how to enable usage statistics for your hosted validators. To illustrate the process a XML validator shall be configured but the process applies identically to other kinds of validators. As part of this guide you will:

Create a XML validator.
Create a simple service to collect the validator’s statistics.
Adapt your validator’s configuration to share its usage statistics.

What you will need

About 30 minutes.
Your preferred text editor.
A Docker installation and the Docker Compose tool.
Access to the internet.

Enabling usage statistics is an advanced configuration for a validator. It is assumed that before going into this you are aware of how to set up a basic validator (see for XML, RDF, JSON and CSV). In addition, as this applies only to validators hosted on your own infrastructure, it is useful to be aware of the validator’s installation guide for on-premise instances.

How to complete this guide

This is a step-by-step guide to lead you through the creation of your validator and the configuration of statistics collection. Having said this, the focus of the guide is not the validator’s basic configuration but rather the setup related to enabling statistics. For the validator we will be considering XML and specifically the fictional EU Purchase Order scenario as introduced in the XML validation guide.

Steps

Carry out the following steps to complete this guide.

Step 1: Set up the validator

To begin let’s start with creating a simple validator (using XML for our example). We will use the configuration from the XML validation guide, with the specific resources being explained and provided for download here. To summarise the setup:

We are validating XML-based purchase orders.
We use XML Schema for syntax validation and Schematron to check business rules.
We define validation for two types of purchase orders, basic and large, of which large orders expect a minimum number of included items.

Let’s start by putting in place the validator’s resources. Create a folder /statistics as the root of all files we will be working with in this tutorial. Within this create a validator folder to hold the validator’s configuration files and validation artefacts (as described in the XML validation guide). In the end your /statistics folder should contain the following:

statistics
└── validator
    └── domains
        └── order
            ├── sch
            │   └── LargePurchaseOrder.sch
            ├── xsd
            │   └── PurchaseOrder.xsd
            └── config.properties

The domains folder listed above represents the validator’s resource root (configured as validator.resourceRoot). This is the folder that contains one or more domain configurations, which in our case is only one, matching the purchase order setup (folder order). The config.properties file is the domain configuration file that defines customisations, the available validation types, and the involved validation artefacts (XSD and Schematron).

This brings us to point of creating our validator. For this guide we will use Docker Compose as it simplifies the validator’s configuration. To reduce the number of steps involved we will also not create a new Docker image for our validator but rather use the base XML image directly.

To define the validator create file docker-compose.yml in the root folder /statistics:

statistics
├── validator
│   └── ...
└── docker-compose.yml

The contents of docker-compose.yml should be as follows (you can download this here):

version: "2.1"

services:
  po-validator:
    image: isaitb/xml-validator
    container_name: po-validator
    ports:
      - "8080:8080"
    volumes:
      - ./validator/:/config/
    environment:
      - validator.resourceRoot=/config/domains/

What we are doing here is to use the base isaitb/xml-validator image to create our validator which will be available on port 8080. Note that we are using here a Docker volume to map the validator folder as the container’s /config folder which represents the root of its configuration. With property validator.resourceRoot we are pointing to the /config/domains/ folder within the container, under which all subfolders are considered as validator domains to set up.

We can now launch the validator from the /statistics folder by issuing:

> docker compose up -d

Creating network "statistics_default" with the default driver
Creating po-validator ... done

The validator’s UI is now available for use at http://localhost:8080/order/upload.

Step 2: Create the service to collect statistics

Before we enable usage statistics for our new validator we need to create a service to receive them. Once a validation completes, the validator collects its relevant information and submits it to an external service. This happens asynchronously to the actual validation and is guaranteed to never interfere with the validator’s usage in case of errors. The overall approach is in fact very similar to how webhooks work, a popular technique for services to trigger notifications to interested external parties.

In the case of validators, this notification is a HTTP POST containing a JSON payload with the validation’s information. Any response to this POST that has a status code of 300 or more (i.e. something unexpected) will be logged by the validator as a warning. The content of the POST payload includes the following properties:

Property	Description	Type	Always present
`validator`	An identifier to distinguish the specific validator instance.	String	Yes
`domain`	The relevant validation domain.	String	Yes
`api`	The validator API that was used. May be `web`, `soap` or `rest`.	String	Yes
`validationType`	The domain’s relevant validation type that was selected.	String	Yes
`result`	The overall result of the validation. May be `SUCCESS`, `WARNING` or `FAILURE`.	String	Yes
`validationTime`	The time when the validation completed (formatted as `yyyy-MM-dd'T'HH:mm:ss`).	String	Yes
`country`	The two-letter country code (defined by ISO 3166) of the user or service that triggered the validation. This is included if country reporting is configured.	String	No
`secret`	The validator’s secret, if one has been configured.	String	No

A sample statistics payload generated by a validator would thus be as follows:

{
    "validator": "xml",
    "domain": "order",
    "api": "web",
    "validationType": "large",
    "result": "FAILURE",
    "validationTime": "2022-01-19T11:43:44",
    "country": "BE",
    "secret": "SECRET"
}

Considering the above, collecting validator statistics requires you to create a service that listens for such POST requests and processes their payloads accordingly. You could for example record each set of information in full (e.g. for detailed reporting), record certain information of interest (e.g. to maintain totals), or even take more elaborate actions (e.g. trigger processing on errors).

For our tutorial we will go for the simple approach of logging the received data and returning a 200 status code (all ok). An easy way to achieve this is to use the popular MockServer, a development tool used typically to simulate a service API with mock responses. MockServer also provides a Docker image which makes it simple to include in our setup, defining it within our docker-compose.yml script. Note that defining the statistics collection service alongside the validator is of course not needed, as long as the validator can contact it. In fact in most real-world cases the statistics collection service would be completely separate.

Create under the /statistics folder a new folder named mock to hold the MockServer’s configuration:

statistics
├── mock
│   └── config.json
├── validator
│   └── ...
└── docker-compose.yml

The config.json file defines for our server the requests to be accepted and the results to return. We will define in this a path /statistics that expects a POST and returns a 200 status code (you can download this here):

[
    {
        "httpRequest": {
            "method": "POST",
            "path": "/statistics"
        },
        "httpResponse": {
            "statusCode": 200
        }
    }
]

With the configuration in place let’s now create the server instance. Edit docker-compose.yml to add a new stats-collector service (the updated script can be downloaded from here):

version: "2.1"

services:
  po-validator:
    image: isaitb/xml-validator
    container_name: po-validator
    ports:
      - "8080:8080"
    volumes:
      - ./validator/:/config/
    environment:
      - validator.resourceRoot=/config/domains/
  stats-collector:
    image: mockserver/mockserver:latest
    container_name: stats-collector
    ports:
      - "1080:1080"
    volumes:
      - ./mock/:/config/
    environment:
      - MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json

See here how, in a similar way to our validator, we use a volume to map the mock folder as /config within the container, and configure the path for the MockServer’s configuration. We can now launch the new server by issuing:

> docker compose up -d

po-validator is up-to-date
Creating stats-collector ... done

The mock server instance is now listening on port 1080. Specifically:

It is waiting for validator statistics to be sent via POST to http://localhost:1080/statistics
It provides a monitoring dashboard at http://localhost:1080/mockserver/dashboard

We are now ready to update our validator to start sending statistics.

Step 3: Enable statistics for the validator

To enable statistics collection we need to update the validator’s configuration. Note that this is not configuration at domain-level but rather at the overall application level which will apply for all configured domains. To proceed update the docker-compose.yml by adding the new statistics collection endpoint (download the updated script here):

version: "2.1"

services:
  po-validator:
    image: isaitb/xml-validator
    container_name: po-validator
    ports:
      - "8080:8080"
    volumes:
      - ./validator/:/config/
    environment:
      - validator.resourceRoot=/config/domains/
      - validator.webhook.statistics=http://stats-collector:1080/statistics
  stats-collector:
    image: mockserver/mockserver:latest
    container_name: stats-collector
    ports:
      - "1080:1080"
    volumes:
      - ./mock/:/config/
    environment:
      - MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json

As you see all that is needed is to set the validator’s validator.webhook.statistics property with the endpoint address. In our example the hostname is set to stats-collector (the container’s name) rather than localhost as both the validator and the collection endpoint are part of the same Docker service.

We can now update the validator and start collecting statistics by issuing:

> docker compose up -d

stats-collector is up-to-date
Recreating po-validator ... done

If we proceed to validate now using the validator we will see through the mock server’s monitoring dashboard a new entry with the validation’s metadata:

POST  /statistics
{
    "method":"POST"
    "path":"/statistics"
    "headers":{...}
    "keepAlive":true
    "secure":false
    "body":{
        "validator":"xml"
        "domain":"order"
        "api":"web"
        "validationType":"large"
        "result":"FAILURE"
        "validationTime":"2022-01-19T11:43:44"
    }
}

From the JSON payload displayed above you may notice the xml value for the “validator” property. This serves to identify the specific validator instance that triggered this, which in this case is xml, the default value for a XML validator. Similar defaults are defined as well for the other validator types (rdf, json and csv for the RDF, JSON and CSV validators respectively), but you can change this by specifying the validator.identifier property. This could be interesting in case you want to distinguish between multiple instances of the same validator, for example identifying “acceptance” and “production” instances.

Finally, you also have the option of defining a secret key to be shared as part of the validator statistics. Defining property validator.webhook.statisticsSecret with a text value, will include it in the JSON payload as property “secret”. As such, this allows you to include an arbitrary string that you can subsequently use as part of server-side verifications.

Step 4: (Optional) Set up country resolution

As part of the provided JSON statistics payload you may also choose to include the country of origin of the user or service that triggered the validation. This can be interesting in particular when validators serve specifications that transcend a single country, providing you information on where the validator is actually being used.

The country is an optional addition to the statistics as it requires additional configuration from your side. The way country detection works is to use the IP address of the connected client (user or service) and check it against a geo-location database. The Test Bed’s validators use for this purpose the popular MaxMind provider, expecting a local database file in the MaxMind DB format (extension .mmdb). The only information ever extracted from this for a given IP address is the two-letter country code (as per ISO 3166); as such the level of detail of the database file is expected to cover at least countries. It is interesting to note that country-level geo-location databases contain the least level of detail, have the smallest size, and can remain without updates for the longest time. In addition, such databases are typically available by MaxMind and similar providers for free (or at least at a free subscription tier).

Note

Country statistics and data privacy: The Test Bed validators are designed to always be used in a fully anonymous manner. This is why IP addresses are never included in generated statistics, given that although the validators themselves don’t use them to precisely locate users, this could not be enforced to downstream statistics collection services.

To include countries in reported statistics we first need to obtain a geo-location database file. A good option for this is MaxMind’s free GeoLite2 database, that can be obtained and then refreshed for free, using a free GeoLite2 account.

After having obtained such a file we need to make it available to our validator, the configuration of which needs to be updated with the database’s location. Assuming we have obtained a file GeoLite2-Country.mmdb we will place it in the validator’s configuration folder:

statistics
├── validator
│   ├── domains
│   │   └── ...
│   └── GeoLite2-Country.mmdb
└── docker-compose.yml

Recall that the validator folder is passed to the validator as a volume, making it available within the validator’s container. To configure it’s use we adapt docker-compose.yml as follows (updated version available here):

version: "2.1"

services:
  po-validator:
    image: isaitb/xml-validator
    container_name: po-validator
    ports:
      - "8080:8080"
    volumes:
      - ./validator/:/config/
    environment:
      - validator.resourceRoot=/config/domains/
      - validator.webhook.statistics=http://stats-collector:1080/statistics
      - validator.webhook.statisticsEnableCountryDetection=true
      - validator.webhook.statisticsCountryDetectionDbFile=/config/GeoLite2-Country.mmdb
  stats-collector:
    image: mockserver/mockserver:latest
    container_name: stats-collector
    ports:
      - "1080:1080"
    volumes:
      - ./mock/:/config/
    environment:
      - MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json

Note here the use of the validator.webhook.statisticsCountryDetectionDbFile property to point to the database file within the container. In addition, we need to set validator.webhook.statisticsEnableCountryDetection to true to explicitly state that we want countries included. Note that any issues in the configuration of this database will lead to an error logged during the validator’s startup and country detection being deactivated.

With the database file in place and the configuration updated we can apply the changes by issuing:

> docker compose up -d

stats-collector is up-to-date
Recreating po-validator ... done

From this point on, using the validator will also include country codes in the shared statistics:

POST  /statistics
{
    ...
    "body":{
        "validator":"xml"
        "domain":"order"
        "api":"web"
        "validationType":"large"
        "result":"FAILURE"
        "validationTime":"2022-01-19T11:43:44"
        "country":"BE"
    }
}

Running behind a proxy

As discussed earlier, countries are determined based on the user’s IP address. In most cases when you are operating a production validator instance your validator will be accessed through a reverse proxy that acts as the public-facing gateway to your services. In such cases the IP address the validator sees will typically be the address of the reverse proxy which is of course not interesting.

To overcome this, a typical configuration made at the level of the reverse proxy is to include the user’s real IP address as an HTTP header for backend services. The Test Bed validators can detect such a header and use it to get the IP address to consider. By default the name of the HTTP header that is looked for is X-Real-IP. In case this is different for your setup (some servers may use for example the X-Forwarded-For header) you can set the header name to look for in your validator’s configuration. To do this edit docker-compose.yml, adding property validator.webhook.ipheader with the header name to look for:

services:
  po-validator:
    ...
    environment:
      ...
      - validator.webhook.ipheader=X-Forwarded-For
  stats-collector:
    ...

Testing statistics collection with countries enabled

If you enable country detection for your validator’s statistics you may find that testing locally does not include the country in the communicated statistics. The reason for this is that running the validator and testing from your own workstation or a machine on your internal network will yield an internal IP address with values such as 10.0.0.1 or 172.0.0.5. These IP addresses don’t identify your country and cannot be used with a geo-location database. If you proceed to test like this you will see that no country is included and relevant warnings appear in the validator’s logs:

...
20/01/2022 14:00:23.530 [order] WARN  e.e.e.i.v.c.w.w.StatisticReporting - Unable to resolve country for ip:: 172.27.0.1
...

Regardless of this issue however you would probably want to test your configuration before making the validator available over the internet. A simple way to do this is to use the X-Real-IP header as discussed in the case of a proxied setup to set the IP address to your actual public one. Given that header manipulation is not easily achieved when using the validator’s user interface you can instead test using its SOAP API.

Any development tool that allows you to make HTTP requests also allows you to set specific headers. Considering this, all you would need to do is set the X-Real-IP header with your public IP address and pass even a dummy payload for validation. For the sample purchase order validator you can test this by making an HTTP POST to http://localhost:8080/api/order/validation with a payload such as the following:

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:v1="http://www.gitb.com/vs/v1/" xmlns:v11="http://www.gitb.com/core/v1/">
<soapenv:Header/>
  <soapenv:Body>
    <v1:ValidateRequest>
    <input name="xml" embeddingMethod="STRING">
        <v11:value><![CDATA[<test/>]]></v11:value>
    </input>
    <input name="type" embeddingMethod="STRING">
        <v11:value>large</v11:value>
    </input>
    </v1:ValidateRequest>
  </soapenv:Body>
</soapenv:Envelope>

Notice that we pass here a dummy xml content given that a basic XSD-triggered syntax error is sufficient for our needs. Testing with this we would then see the statistics collection correctly including the country (based on the supplied X-Real-IP):

POST  /statistics
{
    ...
    "body":{
        "validator":"xml"
        "domain":"order"
        "api":"soap"
        "validationType":"large"
        "result":"FAILURE"
        "validationTime":"2022-01-19T11:43:44"
        "country":"BE"
    }
}

Summary

Congratulations! You just finished configuring a simple XML validator and enabling its usage statistics. In doing so you also set up a basic collection service for these statistics and extended the configuration to also include country information.