Python 如何在微调Whisper模型时更改数据集

发布于03月04日

我试图通过参考article来微调Whisper型号.如果想参考代码，请看colab link.

我所要做的就是将本文中使用的普通语音数据集更改为我的数据集.

我使用的是一个准备好的通用语音数据集，它工作得很好.通用语音数据集似乎使用了预缓存的.row文件.因此，它消耗的内存非常少.

正因为如此，它速度很快，整个过程处理得很好.但使用我的数据集并不起作用.(它会消耗大量内存.)

具体地说，它在下面的代码中花费了大量时间，并且不起作用.

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=2)

在我看来，这是由于预缓存了原始数据.我用下面的简单代码导入数据集.

我的代码没有创建语音文件的数组缓存文件.

class DataLoader_AIHub:
    def __init__(self, rootPath):
        self.rootPath = rootPath

    def getData(self, max_files_to_load, startPoint=0):
        rootPath_audio = os.path.join(self.rootPath, 'audio')
        audioDirPaths = getDirList(rootPath_audio)

        total_files_loaded = 0

        data_list = []

        for audioDir in audioDirPaths:
            audioFileNames = getFileList(audioDir)
            audioFilePaths = [audioDir + '/' + str(item) for item in audioFileNames]
            labelFilePaths = [item.replace('/audio/','/label/').replace('.wav','.json') for item in audioFilePaths]
        
            for audioPath, labelPath in zip(audioFilePaths, labelFilePaths):
                jsonInfo = getJson(labelPath)

                if '(' in jsonInfo['발화정보']['stt']:
                    continue

                if startPoint > total_files_loaded:
                    total_files_loaded += 1
                    continue

                audio, sr = sf.read(audioPath)
                audioArray = audio.astype(np.float32)

                dict = {
                    'audio': {
                        'path': audioPath,
                        'array': audioArray,
                        'sampling_rate': sr
                    },
                    'sentence': re.sub('\r\n', '', jsonInfo['발화정보']['stt']),
                    'age': jsonInfo['녹음자정보']['age'],
                    'gender': jsonInfo['녹음자정보']['gender']
                }

                data_list.append(dict)

                total_files_loaded += 1

                if total_files_loaded >= max_files_to_load + startPoint: 
                    return Dataset.from_list(data_list)
                 
        return Dataset.from_list(data_list)

(这是韩国的数据集.)

语音文件(.wav)的采样频率为16 kHz，而AudioArray指的是已解码的array.假定.row文件存储这些解码array.

我是不是做错了什么？

Python 如何在微调Whisper模型时更改数据集

推荐答案

Python相关问答推荐

Python：在类对象内的字典中更改所有键的索引，而不是仅更改一个键

Pytest两个具有无限循环和await命令的Deliverc函数

处理(潜在)不断增长的任务队列的并行/并行方法

通过Selenium从页面获取所有H2元素

如何请求使用Python将文件下载到带有登录名的门户网站？

使用密钥字典重新配置嵌套字典密钥名

什么是合并两个embrame的最佳方法，其中一个有日期范围，另一个有日期没有任何共享列？

如何禁用FastAPI应用程序的Swagger UI autodoc中的application/json？

在单次扫描中创建列表

为什么if2/if3会提供两种不同的输出？

巨 Python ：逆向猜谜游戏

为什么我的sundaram筛这么低效

如果包含特定值，则筛选Groupby

mdates定位器在图表中显示不存在的时间间隔

获取PANDA GROUP BY转换中的组的名称

为什么Visual Studio Code说我的代码在使用Pandas concat函数后无法访问？

Polars时间戳同步延迟计算

#将多条一维曲线计算成其二维数组(图像)表示

关于数字S种子序列内部工作原理的困惑

在不降低分辨率的情况下绘制一组数据点的最外轮廓