Dealbot Methodology
This document describes how dealbot's Data Retention check monitors storage provider (SP) performance in retaining data through Filecoin's Proof of Data Possession (PDP) protocol.
Source code links throughout this document point to the current implementation.
For event and metric definitions used by the dashboard, see Dealbot Events & Metrics.
Rendered from the dealbot documentation · cached for 24 hours
The Data Retention check monitors storage providers' ability to retain data over time by tracking their PDP challenge performance. Unlike the Data Storage check which tests the upload and initial verification of new data, the Data Retention check evaluates how well providers maintain previously stored data. (See also Why is this called "data retention" vs. "data availability"?)
Every data retention check cycle, dealbot:
Provider selection: Only providers returned by WalletSdkService.getTestingProviders() are polled, minus any matching the spBlocklists configuration (via isSpBlocked).
Dealbot polls The Graph API endpoint for PDP (Proof of Data Possession) data at a configurable interval. The subgraph indexes on-chain PDP events and provides aggregated statistics about provider challenge performance.
Subgraph repository: FilOzone/pdp-explorer
Subgraph endpoint: Configured via PDP_SUBGRAPH_ENDPOINT environment variable (see environment-variables.md)
Note: The production subgraph URL is currently being finalized here.
Data retrieved:
From GET_SUBGRAPH_META query:
_meta.block.number - Current indexed block number (recorded in baseline persistence for debugging)From GET_PROVIDERS_WITH_DATASETS query for each provider:
address - Provider addresstotalFaultedPeriods - Cumulative count of faulted proving periods across all data sets (maintained by the subgraph's NextProvingPeriod event handler)totalProvingPeriods - Cumulative count of all proving periods (successful + faulted) across all data setsproofSets - Array of proof sets where nextDeadline < currentBlock (overdue deadlines), each containing:nextDeadline - Next deadline block numbermaxProvingPeriod - Maximum proving period durationNote: The subgraph query uses the field name
proofSets, but this refers to "dataSets" in the current codebase. The terminology was updated from "proof set" to "data set" but the subgraph schema retains the old naming.
Source: pdp-subgraph.service.ts (fetchSubgraphMeta, fetchProvidersWithDatasets)
Dealbot uses the subgraph-confirmed totals directly for cumulative counters:
confirmedTotalSuccess = totalProvingPeriods - totalFaultedPeriods
Additionally, dealbot calculates estimated overdue periods for real-time monitoring via a separate gauge metric. The value is the sum across all of the provider's overdue proof sets (those where nextDeadline < currentBlock); proof sets with maxProvingPeriod === 0 are skipped:
estimatedOverduePeriods = sum over overdue proofSets of:
(currentBlock - (nextDeadline + 1)) / maxProvingPeriod
This gauge provides immediate visibility into providers that are behind on submitting proofs, even before the subgraph confirms the faults. The gauge naturally resets to 0 when providers submit their proofs and the subgraph catches up.
Key distinction: The overdue gauge is independent of the cumulative counter baselines. It reflects the current state on every poll, while counters track confirmed changes over time.
To avoid double-counting, dealbot maintains a baseline of cumulative proving-period totals for each provider. On each poll, it computes the period delta since the last poll and converts it to a challenge count using a fixed multiplier (CHALLENGES_PER_PROVING_PERIOD = 5, sourced from the FilecoinWarmStorageService contract):
faultedChallengesDelta = (totalFaultedPeriods - previousTotalFaulted) * 5
successChallengesDelta = (confirmedTotalSuccess - previousTotalSuccess) * 5
Baselines are stored and compared in periods; the dataSetChallengeStatus counter is incremented in challenges.
First-seen provider handling: When a provider has no prior baseline (fresh deploy or newly added provider), dealbot initializes the baseline to the current cumulative totals without emitting any counters. This prevents dumping the provider's full cumulative history as a single metric spike. Metrics for that provider will begin accumulating from the next poll onward.
Negative delta handling: If either challenge delta is negative (due to chain reorgs, subgraph corrections, or data inconsistencies), the baseline is reset to current values without incrementing counters. This prevents stalled metrics.
Baseline persistence: Baselines are persisted to the data_retention_baselines database table after each successful provider update. Each poll reloads persisted baselines before computing deltas, so any worker pod that runs the poll uses the latest shared baseline.
Source: data-retention.service.ts (processProvider)
Only positive deltas increment Prometheus counters. This ensures metrics accurately reflect new challenges without duplication.
For very large deltas (exceeding Number.MAX_SAFE_INTEGER), increments are chunked to prevent precision loss.
Source: data-retention.service.ts (safeIncrementCounter)
To prevent metric inflation across service restarts and worker-pod handoffs, dealbot persists provider baselines to the database.
Storage: Baselines are stored in the data_retention_baselines table with columns for provider_address, faulted_periods, success_periods, last_block_number, and updated_at.
Lifecycle:
dataSetChallengeStatus.Error handling:
Source: data-retention.service.ts (loadBaselinesFromDb, persistBaseline), CreateDataRetentionBaselines migration
To prevent unbounded memory growth, dealbot periodically removes baseline data for providers no longer in the active testing list.
Cleanup strategy:
Critical safeguard: Baselines are retained if:
providerIdThis prevents metric inflation (double-counting) if a provider temporarily goes offline and returns later.
Source: data-retention.service.ts (cleanupStaleProviders)
Providers are processed in batches of 50 to avoid overwhelming the subgraph API and to enable parallel processing within reasonable limits.
Why batching instead of per-provider scheduling?
The data retention check processes all providers in a single scheduled poll rather than creating individual job schedules per provider. This design choice is driven by several technical considerations:
The batched approach stays well within rate limits and reduces infrastructure load.
Source: data-retention.service.ts (MAX_PROVIDER_BATCH_LENGTH)
The PDP subgraph service enforces Goldsky's public endpoint rate limits:
Rate limiting is enforced client-side to prevent 429 errors.
Source: pdp-subgraph.service.ts (enforceRateLimit)
dataSetChallengeStatusSee dataSetChallengeStatus for more info.
Unit: challenges (period delta × CHALLENGES_PER_PROVING_PERIOD = 5).
value label:
success — challenges in successfully-proven periods (totalProvingPeriods - totalFaultedPeriods)failure — challenges in faulted periods (totalFaultedPeriods)Increment behavior:
Number.MAX_SAFE_INTEGER, safeIncrementCounter splits the increment into MAX_SAFE_INTEGER-sized chunks to preserve precisionpdp_provider_estimated_overdue_periodsSee pdp_provider_estimated_overdue_periods for more info.
Unit: proving periods (sum across the provider's overdue proof sets).
Emission behavior:
Number.MAX_SAFE_INTEGER, safeSetGauge clamps the gauge to Number.MAX_SAFE_INTEGER and logs an overdue_periods_overflow warning (it does not chunk)Key environment variables that control data retention check behavior:
| Variable | Required | Default | Description |
|---|---|---|---|
PDP_SUBGRAPH_ENDPOINT |
No | Empty string | The Graph API endpoint for PDP subgraph queries. When empty, data retention checks are disabled. |
Source: app.config.ts
See also: environment-variables.md for the full configuration reference.
The service handles transient failures gracefully:
Validation errors (schema mismatches, type errors) are not retried as they indicate structural issues requiring investigation.
If stale provider cleanup encounters errors (database failures, missing provider info), the cleanup is skipped entirely to preserve metric baselines and prevent double-counting.
Source: data-retention.service.ts (pollDataRetention)
flowchart TD
Start[Scheduled Poll] --> CheckEndpoint{PDP Endpoint<br/>Configured?}
CheckEndpoint -->|No| Skip[Skip Check]
CheckEndpoint -->|Yes| LoadBaselines[Load Baselines from DB]
LoadBaselines --> CheckLoad{Load<br/>Success?}
CheckLoad -->|No| Skip
CheckLoad -->|Yes| FetchMeta[Fetch Subgraph Metadata]
FetchMeta --> GetProviders[Get Active Testing Providers]
GetProviders --> CheckProviders{Providers<br/>Configured?}
CheckProviders -->|No| Skip
CheckProviders -->|Yes| BatchLoop[Process Providers in Batches of 50]
BatchLoop --> FetchData[Fetch Provider Totals from Subgraph]
FetchData --> ProcessParallel[Process Providers in Parallel]
ProcessParallel --> CalcTotals[Compute Success from Confirmed Totals]
CalcTotals --> EmitGauge[Emit Overdue Periods Gauge]
EmitGauge --> CheckBaseline{Has Prior<br/>Baseline?}
CheckBaseline -->|No| InitBaseline[Initialize Baseline. No Metric Emission]
InitBaseline --> PersistBaseline
CheckBaseline -->|Yes| CalcDeltas[Calculate Deltas from Baseline]
CalcDeltas --> CheckDeltas{Any Negative<br/>Delta?}
CheckDeltas -->|Yes| ResetBaseline[Reset Baseline. No Metric Update]
CheckDeltas -->|No| PersistBaseline
ResetBaseline --> PersistBaseline[Persist Baseline to DB]
PersistBaseline --> CheckPersist{Persist<br/>Success?}
CheckPersist -->|Yes| UpdateBaseline[Update Poll-Local Baseline]
UpdateBaseline --> CheckEmit{Positive<br/>Deltas?}
CheckEmit -->|Yes| IncrementMetrics[Increment Prometheus Counters]
CheckEmit -->|No| MoreBatches
CheckPersist -->|No| MarkError[Mark Processing Error]
IncrementMetrics --> MoreBatches{More<br/>Batches?}
MarkError --> MoreBatches
MoreBatches -->|Yes| BatchLoop
MoreBatches -->|No| CheckErrors{Processing<br/>Errors?}
CheckErrors -->|Yes| SkipCleanup[Skip Cleanup]
CheckErrors -->|No| Cleanup[Cleanup Stale Providers]
Cleanup --> FetchStale[Fetch Stale Provider Info from DB]
FetchStale --> RemoveMetrics[Remove Prometheus Metrics]
RemoveMetrics --> DeleteMemory[Delete Baseline from Poll-Local Map]
DeleteMemory --> DeleteDB[Delete Baseline from DB]
SkipCleanup --> End[Complete]
DeleteDB --> End
Skip --> End
Prometheus counters are designed to track cumulative totals that only increase. By tracking deltas, we ensure:
If a chain reorg causes challenge totals to decrease, dealbot detects negative deltas and resets the baseline without incrementing counters. This prevents metric corruption while allowing the system to recover automatically.
Providers may temporarily drop from the active list due to configuration changes, approval status changes, or transient issues. Retaining baselines prevents massive metric inflation (double-counting) when providers return. Cleanup only occurs when we can successfully remove the associated Prometheus metrics.
Baselines are persisted to the database after each successful provider update. At the start of every poll, the service loads all baselines from the database, and delta computation resumes from the last persisted state. This prevents metric inflation when a process restarts or a different worker pod runs the next poll.
Example scenario:
Poll 1 (fresh start, no DB baseline):
Subgraph: faulted=1000, success=9000
No prior baseline → Initialize baseline to 1000, 9000
Emit: nothing (first-seen provider, baseline only)
Poll 2:
Subgraph: faulted=1005, success=9005
Loaded baseline: 1000, 9000 → Period delta: 5, 5 (× 5 challenges/period)
Emit: +25 faulted challenges, +25 success challenges
--- SERVICE RESTARTS ---
Poll 3 (after restart):
Subgraph: faulted=1005, success=9005
DB baseline: 1005, 9005 (loaded) → Period delta: 0, 0
Emit: nothing (no new challenges)
Poll 4:
Subgraph: faulted=1008, success=9012
Loaded baseline: 1005, 9005 → Period delta: 3, 7
Emit: +15 faulted challenges, +35 success challenges
If the database is unavailable on startup, the poll is aborted to prevent emitting inflated values. The service will retry on the next scheduled poll.
Both checks work together to provide comprehensive storage provider quality metrics.
This check relies on the Proof of Data Possession (PDP) protocol, which monitors data retention over time. We use "data retention" to be precise about the nature of the check.
Data retention = “How long do we keep it?”
Retention is about preservation over time:
Example: “Keep audit logs for 7 years” is a retention requirement even if nobody reads them most days.
Data availability = “Can I access it when I need it?”
Availability is about accessibility and uptime:
Example: “Users must be able to fetch their profile data 99.9% of the time” is an availability requirement even if you only retain profiles while the account exists.
Why the distinction matters