Global Object Store
Transmitting deep neural network weights over a network is expensive. A ResNet-50 state dict is ~100MB. Passing 100MB dictionaries through Ray’s message queue on every round is both slow (serialization overhead) and dangerous (memory bloat that crashes actors).
FedPilot solves this with the GlobalObjectStore (src/core/communication/global_object_store.py) — a Ray Actor that acts as a distributed in-memory key-value store, similar to Redis but built natively on Ray shared memory.
Design Philosophy: Decouple Signalling from Data
The core insight is that the message bus should carry only lightweight keys, while the actual model weights travel through shared memory:
Message body = "node_3_round_5" ← 20 bytes, routed through TopologyManager
GlobalObjectStore["node_3_round_5"] ← 100MB, accessed directly via Ray shared memory
This means the TopologyManager (and HybridTopologyManager) never touch the weights — they move only the references. The bandwidth consumed by the messaging layer stays constant regardless of model size.
How It Works
sequenceDiagram
participant Node as FederatedNode
participant GOS as GlobalObjectStore (Ray Actor)
participant TM as TopologyManager
participant Peer as Neighbor Node
Node->>Node: pickle(model.state_dict()) → bytes
Node->>GOS: put("node_3_round_5", pickled_bytes)
GOS-->>Node: True (stored)
Node->>TM: publish(Message{body: "node_3_round_5"})
TM->>Peer: deliver message
Peer->>GOS: get("node_3_round_5")
GOS-->>Peer: pickled_bytes
Peer->>Peer: pickle.loads(raw) → state_dict → aggregate
Peer->>GOS: delete("node_3_round_5")
Critical API Rule: Pre-Serialize Before Storing
The GlobalObjectStore is strict about types. Values must be pre-serialized as bytes before storage. The store does not call pickle.dumps() for you.
import pickle
# ✅ CORRECT — pre-pickle before storing
pickled = pickle.dumps(self.model.state_dict())
ray.get(self.global_object_store.put.remote("node_3_round_5", pickled))
# Retrieve and unpickle
raw = ray.get(self.global_object_store.get.remote("node_3_round_5"))
state_dict = pickle.loads(raw)
# ❌ WRONG — passing a raw dict raises TypeError inside the actor
ray.get(self.global_object_store.put.remote("key", self.model.state_dict()))
All methods must be called with
.remote()since the store is a Ray Actor running concurrently in its own process.
Full API Reference
| Method | Signature | Description |
|---|---|---|
put | put(key: str, value: bytes) → bool | Store a pickled object; returns True on success |
get | get(key: str) → bytes \| None | Retrieve pickled bytes, or None if the key is missing |
delete | delete(key: str) → bool | Remove an entry; returns True if it existed |
keys | keys() → List[str] | List all active keys currently in the store |
size | size() → int | Number of entries currently stored |
Key Naming Convention
FedPilot uses a consistent naming scheme for object store keys:
{fed_id}/{node_id}/round_{round_number}
For example: "my_experiment/client_7/round_42". This namespacing prevents key collisions when multiple federations run concurrently on the same Ray cluster.
ICRF Note
In multi-cluster (ICRF) mode, each Ray cluster runs its own GlobalObjectStore instance. Weights that cross cluster boundaries travel via the HybridClusterGateway HTTP endpoint (pickled in the message payload), not through the shared object store. The object store is an intra-cluster optimization only.
See also: Topology Manager · Inter-Cluster Ray Fabric (ICRF)