Guide: Collecting validator usage statistics
Track |
---|
This guide walks you through the steps to enable the collection of usage statistics from a validator running on your own infrastructure. It applies to validators built using the test bed’s base XML, RDF, JSON and CSV validator components.
Note
Validators managed by the test bed: To enable statistics collection for a validator you need to control its hosting and operation. This guide does not apply if your validator is hosted and managed by the Test Bed team.
What you will achieve
At the end of the guide you will be aware of how to enable usage statistics for your hosted validators. To illustrate the process a XML validator shall be configured but the process applies identically to other kinds of validators. As part of this guide you will:
Create a XML validator.
Create a simple service to collect the validator’s statistics.
Adapt your validator’s configuration to share its usage statistics.
What you will need
About 30 minutes.
Your preferred text editor.
A Docker installation and the Docker Compose tool.
Access to the internet.
Enabling usage statistics is an advanced configuration for a validator. It is assumed that before going into this you are aware of how to set up a basic validator (see for XML, RDF, JSON and CSV). In addition, as this applies only to validators hosted on your own infrastructure, it is useful to be aware of the validator’s installation guide for on-premise instances.
How to complete this guide
This is a step-by-step guide to lead you through the creation of your validator and the configuration of statistics collection. Having said this, the focus of the guide is not the validator’s basic configuration but rather the setup related to enabling statistics. For the validator we will be considering XML and specifically the fictional EU Purchase Order scenario as introduced in the XML validation guide.
Steps
Carry out the following steps to complete this guide.
Step 1: Set up the validator
To begin let’s start with creating a simple validator (using XML for our example). We will use the configuration from the XML validation guide, with the specific resources being explained and provided for download here. To summarise the setup:
We are validating XML-based purchase orders.
We use XML Schema for syntax validation and Schematron to check business rules.
We define validation for two types of purchase orders, basic and large, of which large orders expect a minimum number of included items.
Let’s start by putting in place the validator’s resources. Create a folder /statistics
as the root of all files we will
be working with in this tutorial. Within this create a validator
folder to hold the validator’s configuration files and
validation artefacts (as described in the XML validation guide). In the end your
/statistics
folder should contain the following:
statistics
└── validator
└── domains
└── order
├── sch
│ └── LargePurchaseOrder.sch
├── xsd
│ └── PurchaseOrder.xsd
└── config.properties
The domains
folder listed above represents the validator’s resource root (configured as validator.resourceRoot
).
This is the folder that contains one or more domain configurations, which in our case is only one, matching the purchase
order setup (folder order
). The config.properties
file is the domain configuration file that defines customisations,
the available validation types, and the involved validation artefacts (XSD and Schematron).
This brings us to point of creating our validator. For this guide we will use Docker Compose as it simplifies the validator’s configuration. To reduce the number of steps involved we will also not create a new Docker image for our validator but rather use the base XML image directly.
To define the validator create file docker-compose.yml
in the root folder /statistics
:
statistics
├── validator
│ └── ...
└── docker-compose.yml
The contents of docker-compose.yml
should be as follows (you can download this here
):
version: "2.1"
services:
po-validator:
image: isaitb/xml-validator
container_name: po-validator
ports:
- "8080:8080"
volumes:
- ./validator/:/config/
environment:
- validator.resourceRoot=/config/domains/
What we are doing here is to use the base isaitb/xml-validator
image to create our validator which will be available
on port 8080. Note that we are using here a Docker volume to map the validator
folder as the container’s /config
folder which represents the root of its configuration. With property validator.resourceRoot
we are pointing to the
/config/domains/
folder within the container, under which all subfolders are considered as validator domains to set up.
We can now launch the validator from the /statistics
folder by issuing:
> docker compose up -d
Creating network "statistics_default" with the default driver
Creating po-validator ... done
The validator’s UI is now available for use at http://localhost:8080/order/upload.
Step 2: Create the service to collect statistics
Before we enable usage statistics for our new validator we need to create a service to receive them. Once a validation completes, the validator collects its relevant information and submits it to an external service. This happens asynchronously to the actual validation and is guaranteed to never interfere with the validator’s usage in case of errors. The overall approach is in fact very similar to how webhooks work, a popular technique for services to trigger notifications to interested external parties.
In the case of validators, this notification is a HTTP POST containing a JSON payload with the validation’s information. Any response to this POST that has a status code of 300 or more (i.e. something unexpected) will be logged by the validator as a warning. The content of the POST payload includes the following properties:
Property |
Description |
Type |
Always present |
---|---|---|---|
|
An identifier to distinguish the specific validator instance. |
String |
Yes |
|
The relevant validation domain. |
String |
Yes |
|
The validator API that was used. May be |
String |
Yes |
|
The domain’s relevant validation type that was selected. |
String |
Yes |
|
The overall result of the validation. May be |
String |
Yes |
|
The time when the validation completed (formatted as |
String |
Yes |
|
The two-letter country code (defined by ISO 3166) of the user or service that triggered the validation. This is included if country reporting is configured. |
String |
No |
|
The validator’s secret, if one has been configured. |
String |
No |
A sample statistics payload generated by a validator would thus be as follows:
{
"validator": "xml",
"domain": "order",
"api": "web",
"validationType": "large",
"result": "FAILURE",
"validationTime": "2022-01-19T11:43:44",
"country": "BE",
"secret": "SECRET"
}
Considering the above, collecting validator statistics requires you to create a service that listens for such POST requests and processes their payloads accordingly. You could for example record each set of information in full (e.g. for detailed reporting), record certain information of interest (e.g. to maintain totals), or even take more elaborate actions (e.g. trigger processing on errors).
For our tutorial we will go for the simple approach of logging the received data and returning a 200 status code (all ok).
An easy way to achieve this is to use the popular MockServer, a development tool used typically to simulate a
service API with mock responses. MockServer also provides a Docker image which makes it simple to include in our setup,
defining it within our docker-compose.yml
script. Note that defining the statistics collection service alongside the
validator is of course not needed, as long as the validator can contact it. In fact in most real-world cases the statistics
collection service would be completely separate.
Create under the /statistics
folder a new folder named mock
to hold the MockServer’s configuration:
statistics
├── mock
│ └── config.json
├── validator
│ └── ...
└── docker-compose.yml
The config.json
file defines for our server the requests to be accepted and the results to return. We will
define in this a path /statistics
that expects a POST and returns a 200 status code (you can download this here
):
[
{
"httpRequest": {
"method": "POST",
"path": "/statistics"
},
"httpResponse": {
"statusCode": 200
}
}
]
With the configuration in place let’s now create the server instance. Edit docker-compose.yml
to add a new stats-collector
service (the updated script can be downloaded from here
):
version: "2.1"
services:
po-validator:
image: isaitb/xml-validator
container_name: po-validator
ports:
- "8080:8080"
volumes:
- ./validator/:/config/
environment:
- validator.resourceRoot=/config/domains/
stats-collector:
image: mockserver/mockserver:latest
container_name: stats-collector
ports:
- "1080:1080"
volumes:
- ./mock/:/config/
environment:
- MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json
See here how, in a similar way to our validator, we use a volume to map the mock
folder as /config
within the
container, and configure the path for the MockServer’s configuration. We can now launch the new server by issuing:
> docker compose up -d
po-validator is up-to-date
Creating stats-collector ... done
The mock server instance is now listening on port 1080. Specifically:
It is waiting for validator statistics to be sent via POST to http://localhost:1080/statistics
It provides a monitoring dashboard at http://localhost:1080/mockserver/dashboard
We are now ready to update our validator to start sending statistics.
Step 3: Enable statistics for the validator
To enable statistics collection we need to update the validator’s configuration. Note that this is not configuration at
domain-level but rather at the overall application level which will apply for all configured domains. To proceed update
the docker-compose.yml
by adding the new statistics collection endpoint (download the updated script here
):
version: "2.1"
services:
po-validator:
image: isaitb/xml-validator
container_name: po-validator
ports:
- "8080:8080"
volumes:
- ./validator/:/config/
environment:
- validator.resourceRoot=/config/domains/
- validator.webhook.statistics=http://stats-collector:1080/statistics
stats-collector:
image: mockserver/mockserver:latest
container_name: stats-collector
ports:
- "1080:1080"
volumes:
- ./mock/:/config/
environment:
- MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json
As you see all that is needed is to set the validator’s validator.webhook.statistics
property with the endpoint address. In our
example the hostname is set to stats-collector
(the container’s name) rather than localhost
as both the validator
and the collection endpoint are part of the same Docker service.
We can now update the validator and start collecting statistics by issuing:
> docker compose up -d
stats-collector is up-to-date
Recreating po-validator ... done
If we proceed to validate now using the validator we will see through the mock server’s monitoring dashboard a new entry with the validation’s metadata:
POST /statistics
{
"method":"POST"
"path":"/statistics"
"headers":{...}
"keepAlive":true
"secure":false
"body":{
"validator":"xml"
"domain":"order"
"api":"web"
"validationType":"large"
"result":"FAILURE"
"validationTime":"2022-01-19T11:43:44"
}
}
From the JSON payload displayed above you may notice the xml
value for the “validator” property. This serves to identify
the specific validator instance that triggered this, which in this case is xml
, the default value for a XML validator.
Similar defaults are defined as well for the other validator types (rdf
, json
and csv
for the RDF,
JSON and CSV validators respectively), but you can change this
by specifying the validator.identifier
property. This could be interesting in case you want to distinguish between
multiple instances of the same validator, for example identifying “acceptance” and “production” instances.
Finally, you also have the option of defining a secret key to be shared as part of the validator statistics. Defining
property validator.webhook.statisticsSecret
with a text value, will include it in the JSON payload as property “secret”.
As such, this allows you to include an arbitrary string that you can subsequently use as part of server-side verifications.
Step 4: (Optional) Set up country resolution
As part of the provided JSON statistics payload you may also choose to include the country of origin of the user or service that triggered the validation. This can be interesting in particular when validators serve specifications that transcend a single country, providing you information on where the validator is actually being used.
The country is an optional addition to the statistics as it requires additional configuration from your side. The way
country detection works is to use the IP address of the connected client (user or service) and check it against a
geo-location database. The test bed’s validators use for this purpose the popular MaxMind provider, expecting a local
database file in the MaxMind DB format (extension .mmdb
). The only information ever extracted from this for a given
IP address is the two-letter country code (as per ISO 3166); as such the level of detail of the database file is
expected to cover at least countries. It is interesting to note that country-level geo-location databases contain the least
level of detail, have the smallest size, and can remain without updates for the longest time. In addition, such databases are
typically available by MaxMind and similar providers for free (or at least at a free subscription tier).
Note
Country statistics and data privacy: The test bed validators are designed to always be used in a fully anonymous manner. This is why IP addresses are never included in generated statistics, given that although the validators themselves don’t use them to precisely locate users, this could not be enforced to downstream statistics collection services.
To include countries in reported statistics we first need to obtain a geo-location database file. A good option for this is MaxMind’s free GeoLite2 database, that can be obtained and then refreshed for free, using a free GeoLite2 account.
After having obtained such a file we need to make it available to our validator, the configuration of which needs to be
updated with the database’s location. Assuming we have obtained a file GeoLite2-Country.mmdb
we will place it in the
validator’s configuration folder:
statistics
├── validator
│ ├── domains
│ │ └── ...
│ └── GeoLite2-Country.mmdb
└── docker-compose.yml
Recall that the validator
folder is passed to the validator as a volume, making it available within the validator’s
container. To configure it’s use we adapt docker-compose.yml
as follows (updated version available here
):
version: "2.1"
services:
po-validator:
image: isaitb/xml-validator
container_name: po-validator
ports:
- "8080:8080"
volumes:
- ./validator/:/config/
environment:
- validator.resourceRoot=/config/domains/
- validator.webhook.statistics=http://stats-collector:1080/statistics
- validator.webhook.statisticsEnableCountryDetection=true
- validator.webhook.statisticsCountryDetectionDbFile=/config/GeoLite2-Country.mmdb
stats-collector:
image: mockserver/mockserver:latest
container_name: stats-collector
ports:
- "1080:1080"
volumes:
- ./mock/:/config/
environment:
- MOCKSERVER_INITIALIZATION_JSON_PATH=/config/config.json
Note here the use of the validator.webhook.statisticsCountryDetectionDbFile
property to point to the database file
within the container. In addition, we need to set validator.webhook.statisticsEnableCountryDetection
to true to
explicitly state that we want countries included. Note that any issues in the configuration of this database will lead to
an error logged during the validator’s startup and country detection being deactivated.
With the database file in place and the configuration updated we can apply the changes by issuing:
> docker compose up -d
stats-collector is up-to-date
Recreating po-validator ... done
From this point on, using the validator will also include country codes in the shared statistics:
POST /statistics
{
...
"body":{
"validator":"xml"
"domain":"order"
"api":"web"
"validationType":"large"
"result":"FAILURE"
"validationTime":"2022-01-19T11:43:44"
"country":"BE"
}
}
Running behind a proxy
As discussed earlier, countries are determined based on the user’s IP address. In most cases when you are operating a production validator instance your validator will be accessed through a reverse proxy that acts as the public-facing gateway to your services. In such cases the IP address the validator sees will typically be the address of the reverse proxy which is of course not interesting.
To overcome this, a typical configuration made at the level of the reverse proxy is to include the user’s real IP address
as an HTTP header for backend services. The test bed validators can detect such a header and use it to get the IP
address to consider. By default the name of the HTTP header that is looked for is X-Real-IP
. In case this is different
for your setup (some servers may use for example the X-Forwarded-For
header) you can set the header name to look for in
your validator’s configuration. To do this edit docker-compose.yml
, adding property validator.webhook.ipheader
with
the header name to look for:
services:
po-validator:
...
environment:
...
- validator.webhook.ipheader=X-Forwarded-For
stats-collector:
...
Testing statistics collection with countries enabled
If you enable country detection for your validator’s statistics you may find that testing locally does not include the country in the communicated statistics. The reason for this is that running the validator and testing from your own workstation or a machine on your internal network will yield an internal IP address with values such as 10.0.0.1 or 172.0.0.5. These IP addresses don’t identify your country and cannot be used with a geo-location database. If you proceed to test like this you will see that no country is included and relevant warnings appear in the validator’s logs:
...
20/01/2022 14:00:23.530 [order] WARN e.e.e.i.v.c.w.w.StatisticReporting - Unable to resolve country for ip:: 172.27.0.1
...
Regardless of this issue however you would probably want to test your configuration
before making the validator available over the internet. A simple way to do this is to use the X-Real-IP
header as
discussed in the case of a proxied setup to set the IP address to
your actual public one. Given that header manipulation is not easily achieved when using the validator’s user interface
you can instead test using its SOAP API.
Any development tool that allows you to make HTTP requests also allows you to set specific headers. Considering this, all
you would need to do is set the X-Real-IP
header with your public IP address and pass even a dummy payload for
validation. For the sample purchase order validator you can test this by making an HTTP POST to
http://localhost:8080/api/order/validation with a payload such as the following:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:v1="http://www.gitb.com/vs/v1/" xmlns:v11="http://www.gitb.com/core/v1/">
<soapenv:Header/>
<soapenv:Body>
<v1:ValidateRequest>
<input name="xml" embeddingMethod="STRING">
<v11:value><![CDATA[<test/>]]></v11:value>
</input>
<input name="type" embeddingMethod="STRING">
<v11:value>large</v11:value>
</input>
</v1:ValidateRequest>
</soapenv:Body>
</soapenv:Envelope>
Notice that we pass here a dummy xml content given that a basic XSD-triggered syntax error is sufficient for our needs. Testing with
this we would then see the statistics collection correctly including the country (based on the supplied X-Real-IP
):
POST /statistics
{
...
"body":{
"validator":"xml"
"domain":"order"
"api":"soap"
"validationType":"large"
"result":"FAILURE"
"validationTime":"2022-01-19T11:43:44"
"country":"BE"
}
}
Summary
Congratulations! You just finished configuring a simple XML validator and enabling its usage statistics. In doing so you also set up a basic collection service for these statistics and extended the configuration to also include country information.
See also
This guide briefly went through the initial setup of a validator before focusing on enabling statistics. Configuring a validator offers many options, both for the validator’s base setup as well as its deployment. For step-by-step setup information as well as a reference of all configurarion options, you can refer to the following documentation:
The XML validation guide, for XML validators using XML Schema and Schematron.
The RDF validation guide, for RDF validators using SHACL shapes.
The JSON validation guide, for JSON validators using JSON Schema.
The CSV validation guide, for CSV validators using Table Schema.
Keep in mind that statistics collection can only be configured for validators that you host (not ones managed by the test bed team). Details on how to deploy a validator on your infrastructure are provided in the validators’ production installation guide.