本文介绍利用HuggingFace预训练模型，并在下游数据集上微调，实现文本数据的情感分类任务。

Hugging Face简介

微调

Hugging Face上针对各种不同的任务类型，提供了很多的预训练模型，这些模型一般都是在很大的数据集合上花费很多人力物力训练出来的，我们只需要下载我们需要的预训练模型，并在自己的任务上进行微调就可以了。

下载预训练模型以及分词器

我们进入到Hugging Face的首页，选择Models，找到bert-base-uncased的Bert预训练模型，点击进入，选择Files and versions，发现里面有一堆文件，我们需要的下载的文件有：config.json文件，pytorch_model.bin文件，以及vocab.txt文件。当然，如果是应用tensorflow框架，下载对应的tf版本模型就好了，下面我们的实践以pytorch为例，但是本质都是一样的。

下载了以上的信息后，我们就可以加载分词器和预训练模型了。

from transformers import BertTokenizer, BertConfig, BertForSequenceClassification

# 1、加载分词器，其中vocab是vocab.txt文件对应的文件夹
tokenizer = BertTokenizer.from_pretrained(vocab_path)

# 2、加载模型的配置文件，然后加载模型。
# 注意这里需要和下游任务相匹配，这里我们以分类任务为例子
config = BertConfig.from_json_file(config_path)  # 加载bert模型配置信息
config.num_labels = n_class  # 设置分类模型的输出个数
model = BertForSequenceClassification.from_pretrained(pretrain_Model_path, config=config)  # 加载bert分类模型

这样，我们就导入了预训练模型。

下载微博数据集

我们的下游任务是微博用户的情感分类，该数据集有36万多条数据，包含4种情感，其中喜悦约 20 万条，愤怒、厌恶、低落各约 5 万条。

数据集地址为：

ChineseNlpCorpus/intro.ipynb at master · SophonPlus/ChineseNlpCorpus · GitHub

用pandas导入数据：

import pandas as pd

def load_data(path):
    df = pd.read_csv(path)
    text_list = df['review'].to_list()
    labels = df['label'].to_list()
    return text_list, labels

text_list, labels = load_data(train)  # 加载训练数据集

数据集处理

对数据集的处理包括：划分数据集，tokenize，构建Dataset

def split_train_val(self, data, labels):
    train_x, val_x, train_y, val_y = train_test_split(data,
                                                      labels,
                                                      test_size=0.2,
                                                      random_state=0)
    return train_x, val_x, train_y, val_y

def encode_fn(text_list):
    # 将text_list embedding成bert模型可用的输入形式
    # text_list:['我爱你','猫不是狗']
    tokenizer = tokenizer(
        text_list,
        padding=True,
        truncation=True,
        max_length=self.max_len,
        return_tensors='pt'  # 返回的类型为pytorch tensor
    )
    input_ids = tokenizer['input_ids']
    token_type_ids = tokenizer['token_type_ids']
    attention_mask = tokenizer['attention_mask']
    return input_ids, token_type_ids, attention_mask

def process_data(text_list, labels):
    input_ids, token_type_ids, attention_mask = encode_fn(text_list)
    labels = torch.tensor(labels)
    data = TensorDataset(input_ids, token_type_ids, attention_mask, labels)
    return data

train_x, val_x, train_y, val_y = split_train_val(text_list, labels)
train = process_data(train_x, train_y)
validation = process_data(val_x, val_y)

现在，我们已经有了预训练的模型，以及为微调准备好了数据集，后面，就可以开始训练了。

训练

训练的过程，需要注意两个过程，第一是优化器，微调使用的优化器是AdamW，并且采用了warmup机制；第二是单个epoch的内的流程，这个流程和正常的深度学习流程是一样的。这里重点介绍一下优化器部分。

AdamW

AdamW与Adam+L2正则化的区别用下面这张图说明，可以看到，使用Adam+L2正则化时候，我们直接将损失函数进行了改写，也就等价于将梯度改写了，这个改写的梯度会用在后续的动量法以及自适应参数中；而AdamW方法则是把L2正则化理解为权重衰减，即假设原目标函数没有变化，而是在最后一步更新的时候，先对参数做一个权值衰减，再更新。
warmup

Warm up是一种学习率的设置方法，其学习率的变化如下图所示。为什么这么设置呢？
- 有助于减缓模型在初始阶段对mini-batch的提前过拟合现象，保持分布的平稳
- 有助于保持模型深层的稳定性

总体代码如下：

import os
import pprint
import numpy as np
import pandas as pd
import random

from transformers import BertTokenizer, BertConfig, BertForSequenceClassification, AdamW, AutoTokenizer, AutoModel
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import TensorDataset, DataLoader


class MyBertModel:
    def __init__(self,
                 train,
                 vocab_path,
                 config_path,
                 pretrain_Model_path,
                 saveModel_path,
                 learning_rate,
                 n_class, epochs,
                 batch_size,
                 val_batch_size,
                 max_len,
                 gpu=True):
        self.n_class = n_class  # 类别数
        self.max_len = max_len  # 句子最大长度
        self.lr = learning_rate  # 学习率
        self.epochs = epochs

        self.tokenizer = BertTokenizer.from_pretrained(vocab_path)  # 加载分词模型
        text_list, labels = self.load_data(train)  # 加载训练数据集
        train_x, val_x, train_y, val_y = self.split_train_val(text_list, labels)
        self.train = self.process_data(train_x, train_y)
        self.validation = self.process_data(val_x, val_y)
        self.batch_size = batch_size  # 训练集的batch_size
        self.val_batch_size = val_batch_size

        self.saveModel_path = saveModel_path  # 模型存储位置
        self.gpu = gpu  # 是否使用gpu

        config = BertConfig.from_json_file(config_path)  # 加载bert模型配置信息
        config.num_labels = n_class  # 设置分类模型的输出个数
        self.model = BertForSequenceClassification.from_pretrained(pretrain_Model_path, config=config)  # 加载bert分类模型
        print("Ready!")
        if self.gpu:
            seed = 42
            random.seed(seed)
            np.random.seed(seed)
            torch.manual_seed(seed)
            torch.cuda.manual_seed_all(seed)
            torch.backends.cudnn.deterministic = True
            self.device = torch.device('cuda')
        else:
            self.device = 'cpu'

    def encode_fn(self, text_list):
        # 将text_list embedding成bert模型可用的输入形式
        # text_list:['我爱你','猫不是狗']
        tokenizer = self.tokenizer(
            text_list,
            padding=True,
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt'  # 返回的类型为pytorch tensor
        )
        input_ids = tokenizer['input_ids']
        token_type_ids = tokenizer['token_type_ids']
        attention_mask = tokenizer['attention_mask']
        return input_ids, token_type_ids, attention_mask

    def load_data(self, path):
        df = pd.read_csv(path)
        text_list = df['review'].to_list()
        labels = df['label'].to_list()
        return text_list, labels

    def process_data(self, text_list, labels):
        input_ids, token_type_ids, attention_mask = self.encode_fn(text_list)
        labels = torch.tensor(labels)
        data = TensorDataset(input_ids, token_type_ids, attention_mask, labels)
        return data

    def split_train_val(self, data, labels):
        train_x, val_x, train_y, val_y = train_test_split(data,
                                                          labels,
                                                          test_size=0.2,
                                                          random_state=0)
        return train_x, val_x, train_y, val_y

    def flat_accuracy(self, preds, labels):
        """A function for calculating accuracy scores"""
        pred_flat = np.argmax(preds, axis=1).flatten()
        labels_flat = labels.flatten()
        return accuracy_score(labels_flat, pred_flat)

    def train_model(self):
        # 训练模型
        if self.gpu:
            self.model.cuda()
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=self.lr)
        trainData = DataLoader(self.train, batch_size=self.batch_size, shuffle=True)  # 处理成多个batch的形式
        valData = DataLoader(self.validation, batch_size=self.val_batch_size, shuffle=True)

        total_steps = len(trainData) * self.epochs
        scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

        for epoch in range(self.epochs):
            self.model.train()
            total_loss, total_val_loss = 0, 0
            total_eval_accuracy = 0
            print('epoch:', epoch, ', step_number:', len(trainData))
            # 训练
            for step, batch in enumerate(trainData):
                outputs = self.model(input_ids=batch[0].to(self.device),
                                     token_type_ids=batch[1].to(self.device),
                                     attention_mask=batch[2].to(self.device),
                                     labels=batch[3].to(self.device)
                                     )  # 输出loss 和 每个分类对应的输出，softmax后才是预测是对应分类的概率
                loss, logits = outputs.loss, outputs.logits
                total_loss += loss.item()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()
                if step % 10 == 0 and step > 0:  # 每10步输出一下训练的结果，flat_accuracy()会对logits进行softmax
                    self.model.eval()
                    logits = logits.detach().cpu().numpy()
                    label_ids = batch[3].cuda().data.cpu().numpy()
                    avg_val_accuracy = self.flat_accuracy(logits, label_ids)
                    print('step:', step)
                    print(f'Accuracy: {avg_val_accuracy:.4f}')
                    print('\n')
            # 每个epoch结束，就使用validation数据集评估一次模型
            self.model.eval()
            print('testing ....')
            for i, batch in enumerate(valData):
                with torch.no_grad():
                    loss, logits = self.model(input_ids=batch[0].to(self.device),
                                              token_type_ids=batch[1].to(self.device),
                                              attention_mask=batch[2].to(self.device),
                                              labels=batch[3].to(self.device)
                                              )
                    total_val_loss += loss.item()

                    logits = logits.detach().cpu().numpy()
                    label_ids = batch[3].cuda().data.cpu().numpy()
                    total_eval_accuracy += self.flat_accuracy(logits, label_ids)

            avg_train_loss = total_loss / len(trainData)
            avg_val_loss = total_val_loss / len(valData)
            avg_val_accuracy = total_eval_accuracy / len(valData)

            print(f'Train loss     : {avg_train_loss}')
            print(f'Validation loss: {avg_val_loss}')
            print(f'Accuracy: {avg_val_accuracy:.4f}')
            print('\n')
            self.save_model(self.saveModel_path + '-' + str(epoch))

    def save_model(self, path):
        # 保存分词模型和分类模型
        self.model.save_pretrained(path)
        self.tokenizer.save_pretrained(path)

    def load_model(self, path):
        # 加载分词模型和分类模型
        tokenizer = AutoTokenizer.from_pretrained(path)
        model = BertForSequenceClassification.from_pretrained(path)
        return tokenizer, model

    def eval_model(self, Tokenizer, model, text_list, y_true):
        # 输出模型的召回率、准确率、f1-score
        preds = self.predict_batch(Tokenizer, model, text_list)
        print(classification_report(y_true, preds))

    def predict_batch(self, Tokenizer, model, text_list):
        tokenizer = Tokenizer(
            text_list,
            padding=True,
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt'  # 返回的类型为pytorch tensor
        )
        input_ids = tokenizer['input_ids']
        token_type_ids = tokenizer['token_type_ids']
        attention_mask = tokenizer['attention_mask']
        pred_data = TensorDataset(input_ids, token_type_ids, attention_mask)
        pred_dataloader = DataLoader(pred_data, batch_size=self.batch_size, shuffle=False)
        model = model.to(self.device)
        model.eval()
        preds = []
        for i, batch in enumerate(pred_dataloader):
            with torch.no_grad():
                outputs = model(input_ids=batch[0].to(self.device),
                                token_type_ids=batch[1].to(self.device),
                                attention_mask=batch[2].to(self.device)
                                )
                logits = outputs[0]
                logits = logits.detach().cpu().numpy()
                preds += list(np.argmax(logits, axis=1))
        return preds


if __name__ == '__main__':
    epoch = 3
    pretrained_path = "./pretrained/bert-base-uncased"
    dataset_path = "./datasets"
    save_path = "./results"
    train_path = os.path.join(dataset_path, "simplifyweibo_4_moods/simplifyweibo_4_moods.csv")
    save_model_path = os.path.join(save_path)
    bert_tokenizer_path = pretrained_path
    bert_config_path = os.path.join(pretrained_path, "config.json")
    bert_model_path = os.path.join(pretrained_path, "model")
    model_name = "bert_weibo"
    myBertModel = MyBertModel(
        train=train_path,
        vocab_path=bert_tokenizer_path,
        config_path=bert_config_path,
        pretrain_Model_path=bert_model_path,
        saveModel_path=os.path.join(save_model_path, model_name),
        learning_rate=2e-5,
        n_class=4,
        epochs=epoch,
        batch_size=4,
        val_batch_size=4,
        max_len=100,
        gpu=True
    )
    myBertModel.train_model()
    Tokenizer, model = myBertModel.load_model(myBertModel.saveModel_path + '-' + str(epoch - 1))
    # text_list, y_true = myBertModel.load_data_predict('xxx.csv')
    # myBertModel.eval_model(Tokenizer, model,text_list,y_true)

预训练

虽然Hugging Face的本意是实现实现预训练模型的共享，以减少模型的训练时间以及能源消耗，但是有时候有时候为了更好的契合我们的下游任务，我们可能需要自己预训练模型。

训练自己的bert模型，需要准备三样东西，分别是语料(数据)，分词器（tokenizer）和模型。

数据集准备

用于训练bert模型的语料数据，常见的大规模数据在datasets里面都可以直接下载并加载，如果要使用自己的数据集合，一般是每行为一个样本，样本中间有分隔符。

构建自己的分词器

除了对文本进行分词外，还将每个词与相应的input_id对应，同时添加句子分隔符、mask掩码等。一般我们在加载模型的时候，可以选择加载对应的分词器：

1	tokenizer = AutoTokenizer.from_pretrained('roberta-base')

但是，有的时候我们想自己构建一个分词器，这也非常简单，只需要输入预料文件，输出是一个词表。

import tokenizers
# 创建分词器
bwpt = tokenizers.BertWordPieceTokenizer()
filepath = "../excel2txt.txt" # 语料文件
#训练分词器
bwpt.train(
    files=[filepath],
    vocab_size=50000, # 这里预设定的词语大小不是很重要
    min_frequency=1,
    limit_alphabet=1000
)
# 保存训练后的模型词表
bwpt.save_model('./pretrained_models/')
#output： ['./pretrained_models/vocab.txt']

# 加载刚刚训练的tokenizer
tokenizer=BertTokenizer(vocab_file='./pretrained_models/vocab.txt')

加载模型

这里加载模式的过程和预训练时一致，也是需要下载对应的模型配置和权重文件。这里，如果我们使用自己设置的分词器，那么需要更改model的embedding矩阵。做法如下：

from transformers import (
    CONFIG_MAPPING,MODEL_FOR_MASKED_LM_MAPPING, AutoConfig,
    AutoModelForMaskedLM,
    AutoTokenizer,DataCollatorForLanguageModeling,HfArgumentParser,Trainer,TrainingArguments,set_seed,
)
# 自己修改部分配置参数
config_kwargs = {
    "cache_dir": None,
    "revision": 'main',
    "use_auth_token": None,
#      "hidden_size": 512,
#     "num_attention_heads": 4,
    "hidden_dropout_prob": 0.2,
#     "vocab_size": 863 # 自己设置词汇大小
}
# 将模型的配置参数载入
config = AutoConfig.from_pretrained('./tmp/bert-base-case/', **config_kwargs)
# 载入预训练模型
model = AutoModelForMaskedLM.from_pretrained(
            '../tmp/bert-base-case/',
            from_tf=bool(".ckpt" in 'roberta-base'), # 支持tf的权重
            config=config,
            cache_dir=None, 
            revision='main',
            use_auth_token=None,
        )
model.resize_token_embeddings(len(tokenizer))
#output:Embedding(863, 768, padding_idx=1)

训练

经过前面的准备，我们终于可以预训练了：

import os
import csv
from transformers import  BertTokenizer, WEIGHTS_NAME,TrainingArguments
from model.modeling_nezha import NeZhaForSequenceClassification,NeZhaForMaskedLM
from model.configuration_nezha import NeZhaConfig
import tokenizers
import torch
from datasets import load_dataset,Dataset 
from transformers import (
    CONFIG_MAPPING,
    MODEL_FOR_MASKED_LM_MAPPING,
    AutoConfig,
    AutoModelForMaskedLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    HfArgumentParser,
    Trainer,
    TrainingArguments,
    set_seed,
    LineByLineTextDataset
)
## 制作自己的tokenizer
bwpt = tokenizers.BertWordPieceTokenizer()
filepath = "../excel2txt.txt" # 和本文第一部分的语料格式一致
bwpt.train(
    files=[filepath],
    vocab_size=50000,
    min_frequency=1,
    limit_alphabet=1000
)
bwpt.save_model('./pretrained_models/') # 得到vocab.txt

## 加载tokenizer和模型
model_path='../tmp/nezha/'
token_path='./pretrained_models/vocab.txt'
tokenizer =  BertTokenizer.from_pretrained(token_path, do_lower_case=True)
config=NeZhaConfig.from_pretrained(model_path)
model=NeZhaForMaskedLM.from_pretrained(model_path, config=config)
model.resize_token_embeddings(len(tokenizer))

# 通过LineByLineTextDataset接口 加载数据 #长度设置为128, # 这里file_path于本文第一部分的语料格式一致
train_dataset=LineByLineTextDataset(tokenizer=tokenizer,file_path='../tmp/all_data_txt.txt',block_size=128) 
# MLM模型的数据DataCollator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
# 训练参数
pretrain_batch_size=64
num_train_epochs=300
training_args = TrainingArguments(
    output_dir='./outputs/', overwrite_output_dir=True, num_train_epochs=num_train_epochs, learning_rate=6e-5,
    per_device_train_batch_size=pretrain_batch_size,, save_total_limit=10)# save_steps=10000
# 通过Trainer接口训练模型
trainer = Trainer(
    model=model, args=training_args, data_collator=data_collator, train_dataset=train_dataset)

# 开始训练
trainer.train(True)
trainer.save_model('./outputs/')

总结

本文分为两个部分，第一部分介绍如何利用Hugging Face中提供的预训练模型，做一些微调以实现我们的下游任务；第二部分则是介绍如何自己训练一个预训练模型。

参考

1、hugging face-基于pytorch-bert的中文文本分类 - 云+社区 - 腾讯云

2、使用huggingface的Transformers预训练自己的bert模型+FineTuning_Wisley.Wang的博客-CSDN博客_huggingface重新训练bert

3、AdamW

HelloBear

HuggingFace预训练与微调