語料庫

信息技术名词，大批量电子化自然语言样本所构成的集合

語料庫一詞在語言學上意指大量的文本，通常經過整理，具有既定格式與標記。

根據語料庫的特徵，可以分為單語語料庫、雙語語料庫、平行語料庫等，根據語料的來源，可以分為書面語語料庫、口語語料庫、作文語料庫、學習者語料庫、古文書語料庫等。^[1]

語料庫列表

多語

點通多語言語音語料庫
賓州大學語料庫（頁面存檔備份，存於互聯網檔案館）
Wikipedia XML 語料庫
紹興文理學院--中國漢英平行語料大世界（頁面存檔備份，存於互聯網檔案館）中英平行文本雙語語料庫

英語

https://www.english-corpora.org （頁面存檔備份，存於互聯網檔案館）
The Collins Corpus （頁面存檔備份，存於互聯網檔案館）
Collin's Cobuild Project - 成果：Collin's當代英語辭典、及當代英語文法。
Corpus of Political Speeches （頁面存檔備份，存於互聯網檔案館）（香港浸會大學圖書館（頁面存檔備份，存於互聯網檔案館）提供）

漢語

LIVAC漢語共時語料庫（頁面存檔備份，存於互聯網檔案館）
蘭開斯特大學漢語平衡語料庫（頁面存檔備份，存於互聯網檔案館）
蘭開斯特-洛杉磯漢語口語語料庫（頁面存檔備份，存於互聯網檔案館）
政治人物演講語料庫（頁面存檔備份，存於互聯網檔案館）（香港浸會大學圖書館（頁面存檔備份，存於互聯網檔案館）提供）

繁體中文

臺灣華語文語料庫（頁面存檔備份，存於互聯網檔案館）
中央研究院漢語平衡語料庫（頁面存檔備份，存於互聯網檔案館）

簡體中文

日語

研究機構

上海外國語大學語料庫研究院
日本國立國語研究所

等

外部連結

Free, web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese（頁面存檔備份，存於互聯網檔案館）
開放目錄專案中的「Computational Linguistics」
ACL SIGLEX Resource Links: Text Corpora
The Leipzig Glossing Rules（頁面存檔備份，存於互聯網檔案館）: Conventions for interlinear morpheme-by-morpheme glosses
Developing Linguistic Corpora: a Guide to Good Practice Archive.is的存檔，存檔日期2012-12-22
An interface for querying automatically-constructed virtual corpora^{[失效連結]}.
TEP: Tehran English-Persian Parallel Corpus.
[1] Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.
TS Corpus - A Turkish Corpus freely available for academic research.（頁面存檔備份，存於互聯網檔案館）
Turkish National Corpus - A general-purpose corpus for contemporary Turkish（頁面存檔備份，存於互聯網檔案館）
Free web-based English corpus to download (3 billion words)（頁面存檔備份，存於互聯網檔案館）

參考文獻

^ 狐狸等間隔. 日语语料库超入门. 微信公眾平台. [2022-12-20]. （原始內容存檔於2022-12-20）.

取自 "https://zh.wikipedia.org/w/index.php?title=语料库&oldid=79223786"