这个问题是tensorflow 2 TextVectorization process tensor and dataset error人的后续问题
我想用tnesorflow 2.8在Jupyter上为处理过的文本做一个单词嵌入.
def standardize(input_data):
input_data = tf.strings.lower(input_data)
input_data = tf.strings.regex_replace(input_data, f"[{re.escape(string.punctuation)}]", " ")
return input_data
# the input data loaded from text files by TfRecordDataset(file_paths, "GZIP")
# each file can be 200+MB, totally about 300 files
# each file hold the data with multiple columns
# some columns are text
# after loading, the dataset will be accessed by column name
# e.g. one column is "sports", so the input_dataset["sports"]
# return a tensor, which is like the following example
input_data = tf.constant([["SWIM 2008-07 Baseball"], ["Football"]], shape=(2, 1), dtype=tf.string)
text_layer = tf.keras.layers.TextVectorization( standardize = standardize, max_tokens = 10, output_mode = 'int', output_sequence_length=10 )
dataset = tf.data.Dataset.from_tensors( input_data )
dataset = dataset.batch(2)
text_layer.adapt(dataset)
process_text = dataset.map(text_layer)
emb_layer = layers.Embedding(10, 10)
emb_layer(process_text) # error
错误:
AttributeError: Exception encountered when calling layer "embedding_7" (type Embedding).
'MapDataset' object has no attribute 'dtype'
Call arguments received:
• inputs=<MapDataset element_spec=TensorSpec(shape=(None, 2, 10), dtype=tf.int64, name=None)>
如何转换tf.数据集二tf.张量?
这TensorFlow: convert tf.Dataset to tf.Tensor美元帮不了我.
上述各层将在机器学习神经网络模型中实现.
loading data --> processing features (multiple text columns) --> tokens --> embedding --> average pooling --> some dense layers --> output layer
谢谢