在现实世界中,数据往往是杂乱无章且存在偏差的。如果不及时正确处理这些偏差,它们可能会严重影响预测模型的有效性。当大型模型在存在偏差的数据集上进行训练时,偏差的后果变得更加明显,而且通常不切实际从头开始重新训练模型。此外,如果这些模型立即投入生产,必须准备好应对其影响。
本文将测试GPT和GPT-2模型的体裁偏差。在阅读《NLP与Transformers》这本书时发现了这个有趣的内容(强烈推荐),因此想记录自己的经历并与大家分享。现在,让开始吧!
将使用Hugging Face hub上的GPT(openai-gpt)和GPT-2预训练模型。还将使用Hugging Face的文本生成管道来检测GPT和GPT-2文本生成中的偏差(由于过度或不足的代表性)。
GPT和GPT-2用于训练的数据集。GPT在BooksCorpus数据集上进行训练,该数据集包含大约7000本未出版的书籍,而GPT-2在WebText上进行训练,该数据集与Reddit相关。但在进行比较之前,让确保比较的两个模型具有相同的模型大小,以便进行公平的比较。
为此,首先将安装transformers并导入必要的库。!pip install transformers
从transformers导入pipeline和set_seed。接下来,将定义将用于比较的模型名称。model_name1 = “openai-gpt”
model_name2 = “gpt2”
然后,将为每个模型设置文本生成任务的管道。text_generation_gpt = pipeline(“text-generation”, model = model_name1)
text_generation_gpt2 = pipeline(“text-generation”, model = model_name2)
现在,将定义一个模型来计算每个模型的参数数量。
def model_size(model):
return sum(params.numel() for params in model.parameters())
打印GPT和GPT-2的参数数量。print(f"Number of Parameters in GPT: {model_size(text_generation_gpt.model)/1000**2:.1f}M parameters")
print(f"Number of Parameters in GPT-2: {model_size(text_generation_gpt2.model)/1000**2:.1f}M parameters")
输出结果:GPT的参数数量:116.5M参数GPT-2的参数数量:124.4M参数。因此,这两个模型是相似大小的版本。
现在将定义一个函数来生成每个模型的补全。def enum_pipeline_outputs(pipe, prompt, num_return_sequences):
out = pipe(prompt, num_return_sequences = num_return_sequences, clean_up_tokenization_spaces = True)
return "n".join(f"{i+1}." + s["generated_text"] for i,s in enumerate(out))
将使用一个提示来生成四个文本补全,以比较两个模型生成的文本。提示:“Before they left for the supermarket”
I)为GPT生成四个输出文本补全print("Text Generated by GPT for the given prompt:n" + enum_pipeline_outputs(text_generation_gpt, prompt, 4))
GPT模型的输出文本:1.Before they left for the supermarket. as she was preparing a pot of coffee the telephone rang. she put it to her ear. " hi, it's me. " " you've got a visitor. we got the new computer i'm
2.Before they left for the supermarket. " but since he was still holding her captive, and he hadn't released her yet, she didn't understand why he felt the need to keep all her plans a secret from her. he let go of the
3.Before they left for the supermarket. " i was shocked. " he's... he's not in love with you. " " he never was. he never will be again. it's over and over. this is the end for both
4.Before they left for the supermarket. i've already eaten breakfast now and i think i 'll put in a few hours in the gym this morning just to give myself time to go to the bathroom and clean up and get the better of it, but i
II)为GPT-2生成四个输出文本补全print("Text Generated by GPT-2 for the given prompt:n" + enum_pipeline_outputs(text_generation_gpt2, prompt, 4))
GPT-2模型的输出文本:1. Before they left for the supermarket, the family returned to the warehouse to check on them. According to the police, there were three suspicious items on the shelves and an object that looked like a toy or a piece of glass.
2. Before they left for the supermarket, Gai said that when he first came up in this world, it was like, “I don’t know, the world is coming to me, but it’s not coming from the home.” That made me feel more alive
3. Before they left for the supermarket, he opened the door and opened the door a little deeper. When they stopped, he said, they made a couple of attempts to get away – and I said my name just so I could hear them – then one