Voice Datasets
最后更新于
最后更新于
The AISHELL-4 is a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenarios. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, the accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks.
AISHELL-4 是一个大规模的真实录制普通话语料库,由8通道圆形麦克风阵列收集,用于会议场景的语音处理。该数据集由211场会议录音组成,每场会议包含4到8位说话者,总长度为120小时。该数据集旨在通过三个方面将多说话人处理的先进研究与实际应用场景相结合。通过真实录制的会议,AISHELL-4提供了现实的声学环境和丰富的自然语音特征,如短暂停顿、语音重叠、快速说话者切换、噪音等。同时,AISHELL-4为每场会议提供了准确的转录和说话者语音活动记录。这使研究人员能够探索会议处理中的不同方面,从语音前端处理、语音识别和说话者分离等单项任务,到多模态建模和相关任务的联合优化。
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train multi-speaker Text-to-Speech (TTS) systems.The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. The word & tone transcription accuracy rate is above 98%, through professional speech annotation and strict quality inspection for tone and prosody.
AISHELL-3 是由北京壳壳科技有限公司发布的大规模高保真多说话人普通话语料库。它可用于训练多说话人文本到语音(TTS)系统。该语料库包含约85小时的情感中性录音,由218名普通话母语者录制,共计88035个语音段。语料库中明确标注并提供了辅助属性,如性别、年龄组和本地方言。此外,录音还附带了汉字级和拼音级的转录文本。通过专业的语音标注和对音调及韵律的严格质量检查,词汇和音调转录准确率超过98%。
Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd.
400 people from different accent areas in China are invited to participate in the recording, which is conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz. The manual transcription accuracy is above 95%, through professional speech annotation and strict quality inspection. The data is free for academic use. We hope to provide moderate amount of data for new researchers in the field of speech recognition.
Aishell 是由北京壳壳科技有限公司发布的开源普通话语料库。语料库录制邀请了来自中国不同口音地区的400人,在安静的室内环境中使用高保真麦克风进行录音,并将采样率降至16kHz。通过专业的语音标注和严格的质量检验,人工转录准确率超过95%。这些数据可免费用于学术研究。我们希望为语音识别领域的新研究人员提供适量的数据支持。
This corpus were recorded in silence in-door environment using cellphone. It has 855 speakers. Each speaker has 120 utterances. All utterances were carefully transcribed and checked by human. Transcription accuracy is guaranteed. If there is any problem, we agree to correct them for you. The corpus contains:
audio files
transcriptions
metadata
该语料库是在安静的室内环境中使用手机录制的。共有855名说话者,每位说话者有120段语音。所有语音均由人工仔细转录和检查,确保转录准确率。如有任何问题,我们同意为您进行更正。该语料库包含以下内容:
音频文件
转录文本
元数据
This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.
The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.
这套免费的普通话语料库由上海普华信息技术有限公司发布。 该语料库由296名中国本地说话者使用智能手机录制,转录准确率在95%的置信水平上超过98%。它可免费用于学术用途。
MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use. The contents and the corresponding descriptions of the corpus include:
The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
1080 speakers from different accent areas in China are invited to participate in the recording.
The sentence transcription accuracy is higher than 98%.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
Detail information such as speech data coding and speaker information is preserved in the metadata file.
The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
Segmented transcripts are also provided.
MAGICDATA普通话朗读语料库由MAGIC DATA Technology Co., Ltd.开发,并免费公开供非商业使用。 该语料库的内容及其对应描述如下:
语料库包含755小时的语音数据,主要为移动录制数据。
来自中国不同口音地区的1080位说话者参与了录制。
句子转录准确率超过98%。
录音在安静的室内环境中进行。
数据库按51:1:2的比例分为训练集、验证集和测试集。
语音数据编码和说话者信息等详细信息保存在元数据文件中。
录音文本的领域多样化,包括互动问答、音乐搜索、社交网络消息、家庭指令控制等。
提供分段转录文本。
Magic Data Technology Co., Ltd., "http://www.imagicdatatech.com/index.php/home/dataopensource/data_info/id/101", 05/2019
The AliMeeting Mandarin corpus, originally designed for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT), is recorded from real meetings, including far-field speech collected by an 8-channel microphone array as well as near-field speech collected by each participants' headset microphone. The dataset contains 118.75 hours of speech data in total, divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test), according to M2MeT challenge arrangement. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 meeting sessions respectively, and each session consists of a 15 to 30-minute discussion by 2-4 participants. AliMeeting covers a variety of aspects in real-world meetings, including diverse meeting rooms, various number of meeting participants and different speaker overlap ratios. High-quality transcriptions are provided as well. The dataset can be used for tasks in meeting rich transcriptions, including speaker diarization and multi-speaker automatic speech recognition.
AliMeeting普通话语料库最初为ICASSP 2022多通道多方会议转录挑战赛 (M2MeT) 设计,记录自真实会议,包括通过8通道麦克风阵列收集的远场语音以及每位参与者的耳机麦克风收集的近场语音。该数据集总共包含118.75小时的语音数据,根据M2MeT挑战赛的安排,分为104.75小时的训练集(Train)、4小时的评估集(Eval)和10小时的测试集(Test)。具体来说,训练集、评估集和测试集分别包含212、8和20个会议会话,每个会话由2至4名参与者进行15至30分钟的讨论。AliMeeting涵盖了真实会议的各种方面,包括不同的会议室、不同数量的会议参与者和不同的说话者重叠比例。该数据集还提供高质量的转录本,可用于会议丰富转录任务,包括说话者分离和多说话者自动语音识别。
The HI-MIA-CW is a supplemental database to the HI-MIA wakeup database, and used the same setup of HI-MIA database to further record 16434 audios.
The specific text of the audios is the HI-MIA confusion words in Chinese, which are the negative samples for wake-up words "hi, Mia" (ni hao mi ya). The text details can be found in the paper and the transcription file in resources. Each audio sample was recorded in real home environment using high fidelity microphone ( 48kHz,16-bit ). Then we re-sampled to 16kHz to build the database. It contains 35 speakers. There is no overlap between these 35 speakers and the speakers who are in the previous HI-MIA database. This dataset aims to promote the advanced research on wakeup words detection. It serves as negative samples for the wakeup words detection system. It helps researchers test the performance when encountering the confusing words.
HI-MIA-CW 是 HI-MIA 唤醒数据库的补充数据库,我们使用相同的 HI-MIA 数据库设置进一步录制了16434段音频。这些音频的具体文本是中文的 HI-MIA 混淆词,作为唤醒词“hi, Mia”(你好米娅)的负样本。文本详情可以在论文和资源中的转录文件中找到。每个音频样本都是在真实家庭环境中使用高保真麦克风(48kHz,16位)录制的。然后我们重新采样到16kHz来构建数据库。该数据库包含35位说话者,这35位说话者与之前 HI-MIA 数据库中的说话者没有重叠。该数据集旨在促进唤醒词检测的先进研究,作为唤醒词检测系统的负样本,帮助研究人员测试在遇到混淆词时的性能表现。
The WenetSpeech corpus is a 10000+ hours multi-domain transcribed Mandarin Speech Corpus collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.
10000+ hours high-label data: with confidence >= 0.95, for supervised training, etc.
2400+ hours weak-label data: with 0.6 <= confidence < 0.95, for semi-supervisied or noisy training, etc.
~10000 hours unlabeled data: with confidence < 0.6, for unsupervised training, etc.
22400+ hours audio in total: consists of both labeled and unlabeled data, for unsupervised training or pretraining, etc.
Diversity
The high-label data of Wenetspeech can be mainly classified into 10 categories according to speaking styles and spoken scenarios:
drama (43.36%)
reading (11.1%)
interview (9.38%)
news (8.68%)
variety (8.27%)
documentary (4.77%)
talk (2.94%)
audiobook (2.51%)
commentary (2.48%)
WenetSpeech语料库是一个包含超过10000小时多领域普通话转录语料库,数据来源于YouTube和Podcast。我们采用光学字符识别(OCR)和自动语音识别(ASR)技术,分别为每个YouTube和Podcast录音进行标注。为了提高语料库的质量,我们使用了一种新颖的端到端标签错误检测方法,进一步验证和筛选数据。
超过10000小时的高标签数据:置信度 >= 0.95,用于监督学习等。
超过2400小时的弱标签数据:0.6 <= 置信度 < 0.95,用于半监督学习或噪声训练等。
约10000小时的未标记数据:置信度 < 0.6,用于无监督学习等。
总共22400+小时的音频数据:包括标记数据和未标记数据,用于无监督学习或预训练等。
多样性 WenetSpeech的高标签数据主要根据说话风格和语境分为以下10类:
剧本 (43.36%)
朗读 (11.1%)
采访 (9.38%)
新闻 (8.68%)
综艺 (8.27%)
纪录片 (4.77%)
演讲 (2.94%)
有声书 (2.51%)
评论 (2.48%)
The contents and the corresponding descriptions of the corpus include:
The corpus contains 180 hours of speech data, which is all mobile recorded data.
663 speakers from different accent areas in China are invited to participate in the recording.
All speech data are manually labeled and the transcriptions are proofed by professional inspectors to ensure the labeling quality.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 15: 1: 2.
Detail information such as speaker and topic information and is preserved in the metadata file.
The topic of dialogues is diversified, ranging from science and technology to ordinary life. 语料库的内容及相应描述如下:
该语料库包含180小时的语音数据,全部为手机录制的数据。
邀请了来自中国不同口音地区的663位发言者参与录制。
所有语音数据均经过人工标注,并由专业检查员校对,以确保标注质量。
录音在安静的室内环境中进行。
数据库按照15:1:2的比例划分为训练集、验证集和测试集。
详细信息,如发言者和话题信息,保存在元数据文件中。
对话的话题多样化,涵盖从科学技术到日常生活的各个方面。
SHALCAS22A is a 1-channel Chinese Mandarin speech corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd. It was collected over a Hi-Fi microphone in a quiet environment. The corpus contains 14,580 utterances from 60 speakers. Each speaker has 243 utterances.
SHALCAS22A是由中国科学院上海声学研究所和无锡三度智能科技有限公司合作制作的一种单通道的中文普通话语音语料库。它是通过高保真麦克风在安静环境中收集而来的。该语料库包含来自60位发言者的14,580个话语。每位发言者有243个话语。
SHALCAS22A, a free Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd., 2022
There are 4 participants in the dataset, which includes two males and two females, and their age varies from 19 to 30. The average age is about 24. The dataset contains 4 hours of speech files, 2,579 audio samples, and the average length is about 9-10secs.
数据集中有4个参与者,包括两名男性和两名女性,他们的年龄在19到30岁之间变化。平均年龄约为24岁。数据集包含4小时的语音文件,2,579个音频样本,平均长度约为9到10秒。
THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University. The original recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System, Department of Computer Science, Tsinghua University, and the original name was 'TCMSD', standing for 'Tsinghua Continuous Mandarin Speech Database'.
THCHS30是由清华大学语音与语言技术中心(CSLT)发布的开放式中文语音数据库。最初的录音是由清华大学计算机科学与技术系智能与系统国家重点实验室的朱晓燕教授监督,由王东进行的,录制于2002年,原名为“TCMSD”,代表“清华连续普通话语音数据库”。
This is a large-scale speaker recognition dataset collected 'in the wild'. The dataset consists of two subsets, CN-Celeb1 and CN-Celeb2. All the audio files are coded as single channel and sampled at 16kHz with 16-bit precision. For CN-Celeb1, it contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world. For CN-Celeb2, it contains more than 520,000 utterances from 2,000 Chinese celebrities, and covers 11 different genres in real world. The data collection process was organized by the Center for Speech and Language Technologies, Tsinghua University. It was also funded by the National Natural Science Foundation of China No. 61633013, and the Postdoctoral Science Foundation of China No. 2018M640133
这是一个在野外收集的大规模说话者识别数据集。该数据集包含两个子集,CN-Celeb1和CN-Celeb2。所有音频文件均以单通道编码,并以16kHz的采样率和16位精度进行采样。对于CN-Celeb1,它包含来自1,000位中国名人的超过130,000个语音片段,并覆盖了现实世界中的11种不同类型。对于CN-Celeb2,它包含来自2,000位中国名人的超过520,000个语音片段,并覆盖了现实世界中的11种不同类型。数据收集过程由清华大学语音与语言技术中心组织。该项目还得到了中国国家自然科学基金项目No. 61633013和中国博士后科学基金项目No. 2018M640133的资助。
The MobvoiHotwords is a corpus of wake-up words collected from a commercial smart speaker of Mobvoi. It consists of keyword and non-keyword utterances.
For keyword data, keyword utterances contain either 'Hi xiaowen' or 'Nihao Wenwen' are collected. For each keyword, there are about 36k utterances. All keyword data is collected from 788 subjects, ages 3-65, with different distances from the smart speaker (1, 3 and 5 meters). Different noises (typical home environment noises like music and TV) with varying sound pressure levels are played in the background during the collection.
MobvoiHotwords是由Mobvoi商用智能音箱收集的唤醒词语料库,包括关键词和非关键词语音片段。
关键词数据中,收集了包含“嗨小文”或“你好文文”的唤醒词语音片段。每个关键词约有36,000个语音片段。所有关键词数据均来自788名年龄在3至65岁之间的被试者,他们与智能音箱的距离不同(1、3和5米)。在收集过程中,背景中播放了不同噪声(如音乐和电视等典型家庭环境噪声),噪声的声压级也不同。