Metrics Registry
The Metrics Registry (src/registries/metrics/metric_registry.py) is the plugin system for custom telemetry. FedPilot treats observability as a first-class citizen; all built-in metrics (round, memory, communication, convergence, throughput, system, availability, performance) are registered using the exact same pattern exposed to researchers.
@register_metric — Custom Telemetry
Registers a new metric class that the MetricsCollector will automatically discover and invoke during each federated round.
from src.registries.metrics.metric_registry import register_metric
from src.registries.metrics.base_metric import BaseMetric
@register_metric("gradient_variance")
class GradientVarianceMetric(BaseMetric):
def collect(self, **kwargs) -> dict:
"""
Compute your custom metric values.
Returns a flat dictionary that will be appended to the CSV row.
"""
# kwargs contains round state: 'updates', 'round_num', 'node_id', etc.
updates = kwargs.get("updates", [])
if not updates:
return {"gradient_variance": 0.0}
variance = compute_gradient_variance(updates)
return {"gradient_variance": variance}
Enabling Your Custom Metric
Once registered, enable it in config.yaml under the metrics block:
metrics:
gradient_variance: true # Your custom metric
round: true # Built-in
memory: true # Built-in
When enabled, the returned dictionary from your collect() method will be automatically integrated into the telemetry pipeline: written to the local CSV logs, broadcast via OpenTelemetry, and made available to the Streamlit dashboard.
Built-In Metric Categories
The framework ships with the following metric categories, each living in its own sub-directory under src/registries/metrics/:
| Metric Group | Config Key | Description | Key Telemetry Points |
|---|---|---|---|
| Round Summary | round | Model convergence | Global accuracy, test loss, train loss |
| Convergence | convergence | Stability | Loss curve delta, weight divergence |
| Communication | communication | Network cost | Bytes sent/received, payload sizes |
| Memory | memory | RAM/GPU pressure | Peak RAM, Peak VRAM per node |
| System | system | Hardware usage | CPU % utilization, GPU % utilization |
| Throughput | throughput | Training speed | Samples processed per second |
| Performance | performance | Timings | Setup time, aggregation time, train time |
| Availability | availability | Node health | Actor liveness, drop-out rates |
All collected metrics are seamlessly routed through the MetricsActor to prevent blocking the main training loop with I/O operations.
See also: Metrics Exporting · Streamlit Dashboard