Data Provenance

Data Provenance #

The majority of the data used to generate the plots and charts on this site originates from the experiments run by the ProbeLab team as part of its continuous measurement infrastructure (CMI).

The principal measurement tools (Nebula, Parsec and Tiros) write data to a single Postgresql database (“the CMI database”) which is used to populate the plots on this site, perform general analysis and to generate the IPFS weekly reports.

Some data used on this site is sourced from external systems and aggregated into the CMI database by a system called Caracol. Caracol is configured to run specific queries against these external systems on a periodic basis, usually daily or weekly, to build a time series of data points. The external systems currently aggregated for use in plots on this site are:

  • Elasticsearch
    • ipfs.io / dweb.link access log summaries - used for counts of unique client IP addresses
    • IPFS bootstrap node access log summaries - used for counts of unique peer IDs and unique software agents
  • Prometheus
    • ipfs.io / dweb.link http metrics - used for request counts
    • cid.contact http metrics - used for IPNI request counts

Access to Data #

Read access to the ProbeLab CMI database may be granted to collaborators on request. Access to other data sources such as Protocol Labs Prometheus and Elasticsearch instances may also be possible in some circumstances. Please contact the team through any of the usual channels to discuss your requirements.

Determining Plot Data Sources #

The key to finding the source of any plot on the site is to determine its plot definition. A plot definition is a YAML file which contains the source of each dataset shown on the plot, additional transformation rules and presentation information. The source of the data is generally a query although it may be a static dataset. A static dataset is generally only used to supplement early history of a plot where the original measurements are no longer available from an online database.

Plot definitions are interpreted by the Ashby templating system to periodically generate a corresponding JSON file containing the raw data obtained from the source and layout instructions. These JSON files are used by the embedded Plotly component on each page to render the plot and can be downloaded in the upper right corner of each plot.

Usually the name of the plot definition is used to generate the HTML anchor of the plot. For example the Recent Kubo Versions Over Time plot has an anchor of #recent-kubo-versions-over-time which corresponds to the plot definition recent-kubo-versions-over-time.yaml. Sometimes the anchor has been adjusted to be more readable. In any case, you can inspect the source of the page. Each plot contains some Javascript which imports the JSON file generated by Ashby. For example, the page displaying the Recent Kubo Versions Over Time plot contains the following script:

d3.json("../../plots/latest/recent-kubo-versions-over-time.json").then(function (fig) {

The raw data JSON file will have the same name as the plot definition, in this case recent-kubo-versions-over-time.json corresponds to recent-kubo-versions-over-time.yaml.

All plot definitions are are publicly accessible in the website repository:

  • The plotdefs-website directory contains definitions for plots in the website section of the site. These are templated such that they are executed against each website featured on the site.

  • The plotdefs-ipni directory contains definitions for plots in the IPNI section of this site. These are also templated such that they are executed against each featured IPNI provider.

  • All other plot definitions are contained in the plotdefs directory.

You can inspect the plot definition to see the exact query used to generate the plot on the page. Typically queries will be written in SQL to be run against the CMI database and may be quite complex as in the case of recent-kubo-versions-over-time.yaml or straightforward as in dht-server-peers-current.yaml