在这篇博客中,我们将逐步讲解如何训练一个 基于 BERT 的模型 来根据动漫简介和元数据预测其所属类型。这是一个 多标签分类任务,意味着一部动漫可以属于多个类型。
我们将使用 HuggingFace Transformers 库,并借助来自 nicegpu.com 的 RTX 4090 GPU 来显著减少训练时间。
我们使用 Kaggle 上的 MyAnimeList 数据集,其中包含了动漫名称、简介、制作人、类型、类别等信息。
从 Kaggle 下载并加载 CSV 文件:
import pandas as pd
pre_merged_anime = pd.read_csv('anime-filtered.csv')
print(pre_merged_anime.shape)
我们将清洗 synopsis 字段的文本,并生成带有更多上下文信息的格式化描述。
import re, string
def clean_txt(text):
text = ''.join(filter(lambda x: x in string.printable, text))
return re.sub(r'\s{2,}', ' ', text).strip()
def get_anime_description(row):
type_str = "TV Show" if row["Type"] == "TV" else row["Type"]
description = (
f"{row['Name']} is {type_str}."
f"Synopsis: {row['sypnopsis']}"
f"Produced by: {row['Producers']} from {row['Studios']} Studio."
f"Source: {row['Source']}."
f"Premiered in: {row['Premiered']}."
)
return clean_txt(description)
pre_merged_anime['generated_description'] = pre_merged_anime.apply(get_anime_description, axis=1)
from functools import reduce
all_genres = reduce(lambda y, z: y + z, pre_merged_anime['Genres'].map(lambda x: x.split(', ')))
unique_labels = sorted(set(all_genres))
id2label = {idx: label for idx, label in enumerate(unique_labels)}
label2id = {label: idx for idx, label in enumerate(unique_labels)}
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def process_data(example, text_col):
labels = []
text = example[text_col]
genres = example['Genres']
for genre in genres:
g = genre.split(', ')
row = [1 if label in g else 0 for label in unique_labels]
labels.append(torch.tensor(row, dtype=torch.float32).to(device))
encoding = tokenizer(text, truncation=True, max_length=256, padding='max_length')
encoding["labels"] = labels
return encoding
from datasets import Dataset
dataset = Dataset.from_pandas(pre_merged_anime[['sypnopsis', 'Genres', 'generated_description']])
dataset = dataset.train_test_split(test_size=0.2, seed=42)
encoded_dataset = dataset.map(
lambda x: process_data(x, 'generated_description'),
batched=True,
batch_size=128,
remove_columns=['sypnopsis', 'Genres', 'generated_description']
)
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
problem_type='multi_label_classification',
num_labels=len(unique_labels),
id2label=id2label,
label2id=label2id
).to(device)
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, jaccard_score
from transformers import EvalPrediction
def multi_label_metrics(predictions, labels, threshold=0.5):
probs = torch.sigmoid(torch.Tensor(predictions))
y_pred = (probs >= threshold).int().numpy()
y_true = labels
return {
'f1': f1_score(y_true, y_pred, average='micro'),
'roc_auc': roc_auc_score(y_true, y_pred, average='micro'),
'accuracy': accuracy_score(y_true, y_pred),
'jaccard': jaccard_score(y_true, y_pred, average='micro')
}
def compute_metrics(p: EvalPrediction):
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
return multi_label_metrics(preds, p.label_ids)
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='genre-prediction-bert',
evaluation_strategy='epoch',
save_strategy='epoch',
per_device_train_batch_size=4,
gradient_accumulation_steps=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
logging_steps=50,
load_best_model_at_end=True,
remove_unused_columns=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=encoded_dataset["train"],
eval_dataset=encoded_dataset["test"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model()
| 指标 | 分数 |
|---|---|
| F1 分数 | 0.65 |
| ROC AUC | 0.79 |
| 准确率 | 0.25 |
| Jaccard | 0.49 |
使用 CPU 训练模型速度非常慢,但使用 nicegpu.com 的 RTX 4090 能将训练时间从几小时压缩至几十分钟。
4090 on NiceGPU
CPU
💡 准备好加速你的 NLP 任务了吗?试试 nicegpu.com,轻松运行 BERT 等大型模型!