我正在使用预先训练的LLM来为输入文本生成代表性嵌入.但无论输入文本如何,输出嵌入都是相同的.
代码:
from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
PRETRAIN_MODEL = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
def generate_embedding(document):
inputs = tokenizer(document, return_tensors='pt')
print("Tokenized inputs:", inputs)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state[0, 0, :].numpy()
print("Generated embedding:", embedding)
return embedding
text1 = "this is a test"
text2 = "this is another test"
text3 = "there are other tests"
embedding1 = generate_embedding(text1)
embedding2 = generate_embedding(text2)
embedding3 = generate_embedding(text3)
are_equal = np.array_equal(embedding1, embedding2) and np.array_equal(embedding2, embedding3)
if are_equal:
print("The embeddings are the same.")
else:
print("The embeddings are not the same.")
打印的 token 不同,但打印的嵌入是相同的.输出:
Tokenized inputs: {'input_ids': tensor([[ 1, 456, 349, 264, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679 1.9293272 -2.2413437 ... 2.6379988 -3.104867 4.806004 ]
Tokenized inputs: {'input_ids': tensor([[ 1, 456, 349, 1698, 1369]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679 1.9293272 -2.2413437 ... 2.6379988 -3.104867 4.806004 ]
Tokenized inputs: {'input_ids': tensor([[ 1, 736, 460, 799, 8079]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Generated embedding: [-1.7762679 1.9293272 -2.2413437 ... 2.6379988 -3.104867 4.806004 ]
The embeddings are the same.
有人知道问题出在哪里吗?谢谢!