虽然网上已经有了很多关于Dirichlet分布进行数据划分的原理和方法介绍,但是整个完整的联邦学习过程还是少有人分享。今天就从零开始实现
加载FashionMNIST数据集
import torch
from torchvision import datasets, transforms
# 定义数据转换
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
# 加载训练和测试数据集
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)
定义Dirichlet分布的划分函数
这里的写法是其中一种,也可以参考其它大神的写法。
具体Dirichlet
划分的原理也可以参考下面的博客:
联邦学习:按Dirichlet分布划分Non-IID样本 - orion-orion - 博客园 (cnblogs.com)
import numpy as np
def dirichlet_distribution_noniid(dataset, num_clients, alpha):
# 获取每个类的索引
class_indices = [[] for _ in range(10)]
for idx, (image, label) in enumerate(dataset):
class_indices[label].append(idx)
# 使用Dirichlet分布进行数据划分
client_indices = [[] for _ in range(num_clients)]
for class_idx in class_indices:
np.random.shuffle(class_idx)
proportions = np.random.dirichlet([alpha] * num_clients)
proportions = (np.cumsum(proportions) * len(class_idx)).astype(int)[:-1]
client_split = np.split(class_idx, proportions)
for client_idx, client_split_indices in enumerate(client_split):
client_indices[client_idx].extend(client_split_indices)
return client_indices
将数据集划分给各客户端
这里的代码操作核心在于,对数据加载器DataLoader
中的Subset
的理解,这个函数是根据索引将数据集划分为子数据集,以前我知道它是在做什么,但是一直不太明白用法,最终在ChatGPT的帮助下完成了:
num_clients = 10
alpha = 0.5 #non-iid程度的超参数,我喜欢用0.5和0.3
client_indices = dirichlet_distribution_noniid(train_dataset, num_clients, alpha)
# 创建客户端数据加载器
from torch.utils.data import DataLoader, Subset
client_loaders = [DataLoader(Subset(train_dataset, indices), batch_size=32, shuffle=True) for indices in client_indices]
定义模型、训练函数和测试函数
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28*28, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
def train(model, train_loader, criterion, optimizer, device, epochs=5):
model.train()
model.to(device)
for epoch in range(epochs):
running_loss = 0.0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(train_loader):.4f}")
def test(model, test_loader, device):
model.eval()
model.to(device)
correct = 0
total = 0
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = correct / total
return accuracy
进行训练并记录测试准确度
# 选择设备
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 创建模型和损失函数
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# 训练和测试数据加载器
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
# 记录每轮测试准确度
test_accuracies = []
# 在每个客户端上进行训练并测试
for i, client_loader in enumerate(client_loaders):
print(f"Training on client {i+1}")
train(model, client_loader, criterion, optimizer, device)
accuracy = test(model, test_loader, device)
test_accuracies.append(accuracy)
print(f"Test Accuracy after client {i+1}: {accuracy:.4f}")
# 绘制测试准确度变化图
plt.figure(figsize=(10, 5))
plt.plot(range(1, num_clients + 1), test_accuracies, marker='o')
plt.title('Test Accuracy after Training on Each Client')
plt.xlabel('Client')
plt.ylabel('Test Accuracy')
plt.ylim(0, 1)
plt.grid(True)
plt.show()
一些踩过的坑
Expected more than 1 value per channel when training, got input size torch.Size
解决方案
这里可能是当UE数量让数据集没法整除的时候,出现了多余的batch。
设置 batch_size>1, 且 drop_last=True
DataLoader(train_set, batch_size=args.train_batch_size,
num_workers=args.num_workers, shuffle=(train_sampler is None),
drop_last=True, sampler = train_sampler)
RuntimeError: output with shape [1, 28, 28] doesn’t match the broadcast shape [3, 28, 28]
错误是因为图片格式是灰度图只有一个channel,需要变成RGB图才可以,所以需要在对图片的处理transforms
里面修改:
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Lambda(lambda x: x.repeat(3,1,1)),# 增加这一行
transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
])
运行结果
将以上的代码拼接起来,就能够正常跑起来,我也已经在自己的电脑上验证过了。
当然了,上面画的是一次epoch的各个client的准确度,进行多次epoch的训练可以自己再修改。