实战Kaggle房价预测:基于PyTorch的完整代码实践

引言

在数据科学和机器学习的实战中,Kaggle竞赛是检验和提升技能的绝佳平台。本文将带你深入实践经典的房价预测比赛,使用PyTorch框架构建回归模型。我们将完整经历数据加载、预处理、模型构建、训练验证到最终提交的全过程。

本教程特别适合想要:

  • 掌握真实数据处理流程的开发者
  • 学习PyTorch在回归任务中的应用
  • 了解Kaggle竞赛完整工作流的初学者

一、数据集概览

我们使用的数据集来自Kaggle的Housing Prices Competition,包含:

  • 训练集:1460个样本,79个特征 + 1个目标变量(SalePrice)
  • 测试集:1459个样本,79个特征(需要预测房价)

特征类型丰富多样:

  • 数值型:建筑年份、房屋面积等
  • 类别型:街区类型、屋顶样式等
  • 存在缺失值,需要特殊处理

二、环境准备与数据加载

首先安装必要的依赖包:

1
pip install pandas torch numpy matplotlib

2.1 导入库并加载数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import pandas as pd
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset

# 设置随机种子保证可复现性
torch.manual_seed(42)
np.random.seed(42)

# 检查设备
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

加载训练和测试数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 假设数据文件已下载到本地
train_data = pd.read_csv('kaggle_house_pred_train.csv')
test_data = pd.read_csv('kaggle_house_pred_test.csv')

print(f"训练集形状: {train_data.shape}")
print(f"测试集形状: {test_data.shape}")

# 查看前几行数据
print("\n训练集前5行:")
print(train_data.head())

print("\n测试集前5行:")
print(test_data.head())

三、数据预处理

这是最关键的一步,直接影响模型性能。

3.1 特征分析与清洗

1
2
3
4
5
6
7
8
9
# 移除ID列(不参与预测)
train_features = train_data.iloc[:, 1:-1] # 去掉Id和SalePrice
test_features = test_data.iloc[:, 1:] # 去掉Id

# 合并训练集和测试集以便统一处理
all_features = pd.concat([train_features, test_features], axis=0, ignore_index=True)

print(f"合并后特征形状: {all_features.shape}")
print(f"缺失值统计:\n{all_features.isnull().sum().sort_values(ascending=False).head(10)}")

3.2 数值特征处理

对数值特征进行标准化处理(零均值、单位方差):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 识别数值型和类别型特征
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
categorical_features = all_features.dtypes[all_features.dtypes == 'object'].index

print(f"数值特征数量: {len(numeric_features)}")
print(f"类别特征数量: {len(categorical_features)}")

# 对数值特征进行标准化
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std() + 1e-8) # 加小值避免除零
)

# 用0填充缺失的数值特征
all_features[numeric_features] = all_features[numeric_features].fillna(0)

3.3 类别特征处理

使用独热编码(One-Hot Encoding)处理类别特征:

1
2
3
4
5
# 独热编码处理类别特征
all_features = pd.get_dummies(all_features, dummy_na=True)

print(f"独热编码后特征数量: {all_features.shape[1]}")
print(f"原始特征数量: {len(numeric_features) + len(categorical_features)}")

3.4 转换为PyTorch张量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 分割回训练集和测试集
n_train = train_data.shape[0]
train_features_tensor = torch.tensor(
all_features[:n_train].values, dtype=torch.float32, device=device
)
test_features_tensor = torch.tensor(
all_features[n_train:].values, dtype=torch.float32, device=device
)

# 处理标签(房价),取对数使分布更均匀
train_labels = torch.tensor(
np.log1p(train_data['SalePrice'].values), dtype=torch.float32, device=device
).reshape(-1, 1)

print(f"训练特征张量形状: {train_features_tensor.shape}")
print(f"训练标签张量形状: {train_labels.shape}")
print(f"测试特征张量形状: {test_features_tensor.shape}")

四、模型构建

4.1 定义神经网络模型

我们构建一个简单但有效的前馈神经网络:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class HousePriceNet(nn.Module):
"""
房价预测神经网络模型
包含多个全连接层、Dropout正则化和ReLU激活函数
"""
def __init__(self, input_dim):
super(HousePriceNet, self).__init__()

# 第一层:输入层 -> 隐藏层1
self.fc1 = nn.Linear(input_dim, 256)
self.bn1 = nn.BatchNorm1d(256) # 批归一化加速训练
self.dropout1 = nn.Dropout(0.3) # Dropout防止过拟合

# 第二层:隐藏层1 -> 隐藏层2
self.fc2 = nn.Linear(256, 128)
self.bn2 = nn.BatchNorm1d(128)
self.dropout2 = nn.Dropout(0.3)

# 第三层:隐藏层2 -> 隐藏层3
self.fc3 = nn.Linear(128, 64)
self.bn3 = nn.BatchNorm1d(64)
self.dropout3 = nn.Dropout(0.2)

# 输出层:隐藏层3 -> 单个输出(房价)
self.output = nn.Linear(64, 1)

# 激活函数
self.relu = nn.ReLU()

def forward(self, x):
"""
前向传播
"""
# 第一层
x = self.fc1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.dropout1(x)

# 第二层
x = self.fc2(x)
x = self.bn2(x)
x = self.relu(x)
x = self.dropout2(x)

# 第三层
x = self.fc3(x)
x = self.bn3(x)
x = self.relu(x)
x = self.dropout3(x)

# 输出层(线性激活,因为是回归任务)
x = self.output(x)

return x

# 初始化模型
input_dim = train_features_tensor.shape[1]
model = HousePriceNet(input_dim).to(device)

print("模型结构:")
print(model)
print(f"\n模型参数量: {sum(p.numel() for p in model.parameters()):,}")

4.2 定义损失函数和优化器

由于Kaggle比赛使用对数均方根误差(Log-RMSE)作为评估指标,我们直接使用MSE损失在log-transformed标签上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 损失函数:均方误差(因为我们已经对标签取了对数)
criterion = nn.MSELoss()

# 优化器:Adam,带有权重衰减(L2正则化)
learning_rate = 0.001
weight_decay = 0.01
optimizer = torch.optim.Adam(
model.parameters(),
lr=learning_rate,
weight_decay=weight_decay
)

# 学习率调度器:当验证误差不再下降时降低学习率
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=10, verbose=True
)

五、训练与验证

5.1 K折交叉验证

使用K折交叉验证来评估模型性能和选择超参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def k_fold_cross_validation(k, X, y, model_class, input_dim, epochs, batch_size):
"""
K折交叉验证

参数:
k: 折数
X: 特征张量
y: 标签张量
model_class: 模型类
input_dim: 输入维度
epochs: 训练轮数
batch_size: 批次大小

返回:
train_losses: 每折的训练损失列表
val_losses: 每折的验证损失列表
"""
fold_size = X.shape[0] // k
train_losses = []
val_losses = []

print(f"\n开始{k}折交叉验证...")

for i in range(k):
print(f"\n{'='*50}")
print(f"第 {i+1}/{k} 折")
print(f"{'='*50}")

# 划分训练集和验证集
val_start = i * fold_size
val_end = (i + 1) * fold_size

# 验证集
X_val = X[val_start:val_end]
y_val = y[val_start:val_end]

# 训练集(除了当前折的其他所有数据)
X_train = torch.cat([X[:val_start], X[val_end:]], dim=0)
y_train = torch.cat([y[:val_start], y[val_end:]], dim=0)

# 创建数据加载器
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# 初始化模型
model = model_class(input_dim).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

# 训练循环
best_val_loss = float('inf')
train_loss_history = []
val_loss_history = []

for epoch in range(epochs):
# 训练阶段
model.train()
train_loss = 0.0

for batch_X, batch_y in train_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)

# 前向传播
optimizer.zero_grad()
predictions = model(batch_X)
loss = criterion(predictions, batch_y)

# 反向传播
loss.backward()
optimizer.step()

train_loss += loss.item()

train_loss /= len(train_loader)
train_loss_history.append(train_loss)

# 验证阶段
model.eval()
val_loss = 0.0

with torch.no_grad():
for batch_X, batch_y in val_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
predictions = model(batch_X)
loss = criterion(predictions, batch_y)
val_loss += loss.item()

val_loss /= len(val_loader)
val_loss_history.append(val_loss)

# 学习率调整
scheduler.step(val_loss)

# 每10轮打印一次进度
if (epoch + 1) % 10 == 0 or epoch == 0:
print(f"Epoch [{epoch+1}/{epochs}], "
f"Train Loss: {train_loss:.6f}, "
f"Val Loss: {val_loss:.6f}")

# 保存最佳模型
if val_loss < best_val_loss:
best_val_loss = val_loss

train_losses.append(train_loss_history[-1])
val_losses.append(val_loss_history[-1])
print(f"第{i+1}折完成 - 最终训练损失: {train_loss_history[-1]:.6f}, "
f"验证损失: {val_loss_history[-1]:.6f}")

# 计算平均损失
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)

print(f"\n{'='*50}")
print(f"{k}折交叉验证完成!")
print(f"平均训练损失: {avg_train_loss:.6f}")
print(f"平均验证损失: {avg_val_loss:.6f}")
print(f"{'='*50}")

return train_losses, val_losses

运行K折交叉验证:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 设置超参数
k_folds = 5
num_epochs = 100
batch_size = 64

# 执行K折交叉验证
train_losses, val_losses = k_fold_cross_validation(
k=k_folds,
X=train_features_tensor,
y=train_labels,
model_class=HousePriceNet,
input_dim=input_dim,
epochs=num_epochs,
batch_size=batch_size
)

5.2 可视化训练过程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def plot_training_history(train_losses, val_losses, k_folds):
"""
绘制训练历史曲线
"""
plt.figure(figsize=(12, 6))

# 绘制每折的训练和验证损失
for i in range(k_folds):
plt.subplot(2, 3, i+1 if i < 5 else 6)
epochs_range = range(1, num_epochs + 1)

# 这里简化处理,实际应该记录每轮的损失
plt.plot(epochs_range, [train_losses[i]] * num_epochs, 'b-', label='Train Loss', alpha=0.7)
plt.plot(epochs_range, [val_losses[i]] * num_epochs, 'r-', label='Val Loss', alpha=0.7)

plt.title(f'Fold {i+1}')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE on log scale)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_history.png', dpi=300)
plt.show()

# 绘制训练历史(简化版,实际应用中应记录每轮损失)
plot_training_history(train_losses, val_losses, k_folds)

六、最终模型训练与预测

使用全部训练数据训练最终模型,并对测试集进行预测:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def train_final_model(X, y, model_class, input_dim, epochs, batch_size):
"""
使用全部数据训练最终模型

参数:
X: 全部训练特征
y: 全部训练标签
model_class: 模型类
input_dim: 输入维度
epochs: 训练轮数
batch_size: 批次大小

返回:
model: 训练好的模型
train_loss_history: 训练损失历史
"""
print("\n开始训练最终模型(使用全部训练数据)...")

# 创建数据加载器
dataset = TensorDataset(X, y)
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# 初始化模型
model = model_class(input_dim).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

# 训练循环
train_loss_history = []

for epoch in range(epochs):
model.train()
epoch_loss = 0.0

for batch_X, batch_y in train_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)

# 前向传播
optimizer.zero_grad()
predictions = model(batch_X)
loss = criterion(predictions, batch_y)

# 反向传播
loss.backward()
optimizer.step()

epoch_loss += loss.item()

avg_loss = epoch_loss / len(train_loader)
train_loss_history.append(avg_loss)

# 学习率调整
scheduler.step(avg_loss)

# 打印进度
if (epoch + 1) % 10 == 0 or epoch == 0:
print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.6f}")

print(f"\n最终模型训练完成!")
print(f"最终训练损失: {train_loss_history[-1]:.6f}")

return model, train_loss_history

# 训练最终模型
final_model, final_train_history = train_final_model(
X=train_features_tensor,
y=train_labels,
model_class=HousePriceNet,
input_dim=input_dim,
epochs=num_epochs,
batch_size=batch_size
)

6.2 生成预测并提交

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def generate_predictions(model, test_features, original_test_data):
"""
生成测试集预测并保存为 submission.csv

参数:
model: 训练好的模型
test_features: 测试特征张量
original_test_data: 原始测试数据DataFrame(用于获取Id)
"""
print("\n生成测试集预测...")

model.eval()

with torch.no_grad():
predictions = model(test_features)

# 将对数预测转换回原始价格尺度
predictions_np = predictions.cpu().numpy()
predicted_prices = np.expm1(predictions_np) # expm1(x) = exp(x) - 1

# 创建提交文件
submission = pd.DataFrame({
'Id': original_test_data['Id'],
'SalePrice': predicted_prices.reshape(-1)
})

# 保存为CSV
submission.to_csv('submission.csv', index=False)

print(f"预测完成!共 {len(submission)} 条记录")
print(f"预测价格统计:")
print(submission['SalePrice'].describe())
print(f"\n提交文件已保存为 'submission.csv'")

# 显示前几条预测
print("\n前5条预测结果:")
print(submission.head())

return submission

# 生成预测
submission_df = generate_predictions(
model=final_model,
test_features=test_features_tensor,
original_test_data=test_data
)

七、完整代码整合

以下是完整的、可直接运行的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
"""
Kaggle房价预测 - PyTorch完整实现
作者:AI助手
日期:2024年
"""

import os
import pandas as pd
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
import warnings
warnings.filterwarnings('ignore')

# ==================== 配置部分 ====================
# 设置随机种子
torch.manual_seed(42)
np.random.seed(42)

# 超参数配置
CONFIG = {
'k_folds': 5,
'num_epochs': 100,
'batch_size': 64,
'learning_rate': 0.001,
'weight_decay': 0.01,
'hidden_dims': [256, 128, 64],
'dropout_rates': [0.3, 0.3, 0.2]
}

# 检测设备
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"使用设备: {device}")

# ==================== 数据加载 ====================
def load_data(train_path='kaggle_house_pred_train.csv', test_path='kaggle_house_pred_test.csv'):
"""加载训练和测试数据"""
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

print(f"训练集: {train_data.shape}")
print(f"测试集: {test_data.shape}")

return train_data, test_data

# ==================== 数据预处理 ====================
def preprocess_data(train_data, test_data):
"""数据预处理:标准化、独热编码"""

# 分离特征和标签
train_features = train_data.iloc[:, 1:-1] # 去掉Id和SalePrice
test_features = test_data.iloc[:, 1:] # 去掉Id
train_labels = train_data['SalePrice']

# 合并以便统一处理
all_features = pd.concat([train_features, test_features], axis=0, ignore_index=True)

# 识别数值和类别特征
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
categorical_features = all_features.dtypes[all_features.dtypes == 'object'].index

print(f"数值特征: {len(numeric_features)}, 类别特征: {len(categorical_features)}")

# 标准化数值特征
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std() + 1e-8)
)
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# 独热编码类别特征
all_features = pd.get_dummies(all_features, dummy_na=True)

print(f"处理后特征总数: {all_features.shape[1]}")

# 分割回训练和测试
n_train = train_data.shape[0]
train_features_processed = all_features[:n_train]
test_features_processed = all_features[n_train:]

# 转换为张量
train_features_tensor = torch.tensor(
train_features_processed.values, dtype=torch.float32, device=device
)
test_features_tensor = torch.tensor(
test_features_processed.values, dtype=torch.float32, device=device
)
train_labels_tensor = torch.tensor(
np.log1p(train_labels.values), dtype=torch.float32, device=device
).reshape(-1, 1)

return train_features_tensor, train_labels_tensor, test_features_tensor, test_data

# ==================== 模型定义 ====================
class HousePriceNet(nn.Module):
"""房价预测神经网络"""

def __init__(self, input_dim, hidden_dims=None, dropout_rates=None):
super(HousePriceNet, self).__init__()

if hidden_dims is None:
hidden_dims = CONFIG['hidden_dims']
if dropout_rates is None:
dropout_rates = CONFIG['dropout_rates']

layers = []
prev_dim = input_dim

# 构建隐藏层
for i, (hidden_dim, dropout) in enumerate(zip(hidden_dims, dropout_rates)):
layers.extend([
nn.Linear(prev_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Dropout(dropout)
])
prev_dim = hidden_dim

# 输出层
layers.append(nn.Linear(prev_dim, 1))

self.network = nn.Sequential(*layers)

def forward(self, x):
return self.network(x)

# ==================== 训练函数 ====================
def train_model(model, train_loader, val_loader, epochs, lr, weight_decay):
"""训练模型"""

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

train_losses = []
val_losses = []

for epoch in range(epochs):
# 训练阶段
model.train()
train_loss = 0.0

for batch_X, batch_y in train_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)

optimizer.zero_grad()
predictions = model(batch_X)
loss = criterion(predictions, batch_y)
loss.backward()
optimizer.step()

train_loss += loss.item()

train_loss /= len(train_loader)
train_losses.append(train_loss)

# 验证阶段
model.eval()
val_loss = 0.0

with torch.no_grad():
for batch_X, batch_y in val_loader:
batch_X, batch_y = batch_X.to(device), batch_y.to(device)
predictions = model(batch_X)
loss = criterion(predictions, batch_y)
val_loss += loss.item()

val_loss /= len(val_loader)
val_losses.append(val_loss)

scheduler.step(val_loss)

if (epoch + 1) % 10 == 0 or epoch == 0:
print(f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_loss:.6f}, Val Loss: {val_loss:.6f}")

return train_losses, val_losses

# ==================== K折交叉验证 ====================
def k_fold_validation(X, y, k, epochs, batch_size):
"""K折交叉验证"""

fold_size = X.shape[0] // k
val_losses = []

print(f"\n开始{k}折交叉验证...")

for i in range(k):
print(f"\n--- 第 {i+1}/{k} 折 ---")

# 划分数据
val_start = i * fold_size
val_end = (i + 1) * fold_size

X_val = X[val_start:val_end]
y_val = y[val_start:val_end]
X_train = torch.cat([X[:val_start], X[val_end:]], dim=0)
y_train = torch.cat([y[:val_start], y[val_end:]], dim=0)

# 数据加载器
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

# 初始化模型
model = HousePriceNet(input_dim=X_train.shape[1]).to(device)

# 训练
_, val_loss_hist = train_model(
model, train_loader, val_loader,
epochs, CONFIG['learning_rate'], CONFIG['weight_decay']
)

val_losses.append(val_loss_hist[-1])
print(f"第{i+1}折验证损失: {val_loss_hist[-1]:.6f}")

avg_val_loss = np.mean(val_losses)
print(f"\n平均验证损失: {avg_val_loss:.6f}")

return avg_val_loss

# ==================== 主函数 ====================
def main():
"""主执行函数"""

print("="*60)
print("Kaggle房价预测 - PyTorch实现")
print("="*60)

# 1. 加载数据
train_data, test_data = load_data()

# 2. 预处理
X_train, y_train, X_test, test_original = preprocess_data(train_data, test_data)
global input_dim
input_dim = X_train.shape[1]

# 3. K折交叉验证
avg_val_loss = k_fold_validation(
X_train, y_train,
k=CONFIG['k_folds'],
epochs=CONFIG['num_epochs'],
batch_size=CONFIG['batch_size']
)

# 4. 训练最终模型
print("\n训练最终模型...")
full_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(full_dataset, batch_size=CONFIG['batch_size'], shuffle=True)

final_model = HousePriceNet(input_dim).to(device)
train_losses, _ = train_model(
final_model, train_loader, None,
CONFIG['num_epochs'], CONFIG['learning_rate'], CONFIG['weight_decay']
)

# 5. 生成预测
print("\n生成测试集预测...")
final_model.eval()

with torch.no_grad():
predictions = final_model(X_test)

predicted_prices = np.expm1(predictions.cpu().numpy())

submission = pd.DataFrame({
'Id': test_original['Id'],
'SalePrice': predicted_prices.reshape(-1)
})

submission.to_csv('submission.csv', index=False)
print(f"提交文件已保存! 共{len(submission)}条预测")
print("\n前5条预测:")
print(submission.head())

print("\n" + "="*60)
print("完成!")
print("="*60)

if __name__ == "__main__":
main()

八、总结与改进建议

关键要点回顾

  1. 数据预处理至关重要:标准化、缺失值处理、独热编码
  2. 对数变换:使房价分布更接近正态,符合比赛评估指标
  3. K折交叉验证:可靠地评估模型泛化能力
  4. 正则化技术:Dropout、权重衰减、批归一化防止过拟合
  5. Adam优化器:自适应学习率,训练更稳定

可能的改进方向

  1. 特征工程

    • 创建新特征(如房龄、总面积等)
    • 特征选择去除冗余特征
    • 多项式特征组合
  2. 模型优化

    • 尝试不同的网络架构(更深/更浅)
    • 集成多个模型(Bagging、Stacking)
    • 使用梯度提升树(XGBoost、LightGBM)与神经网络融合
  3. 超参数调优

    • 使用贝叶斯优化或网格搜索
    • 自动化超参数调整工具(Optuna、Ray Tune)
  4. 数据增强

    • 对训练数据进行轻微扰动
    • 合成少数样本

下一步行动

  • 分析错误案例,找出模型弱点
  • 迭代改进特征和模型

记住:Kaggle竞赛不仅是技术的较量,更是学习和成长的过程。祝你在比赛中取得好成绩!