Chronological Database of Chinese Literature (CDCL)
CDCL is a fully machine-accessible digital database that consists of more than 2,000 titles of Chinese literature (161,460,042 characters in total) from the Three Kingdoms period to the Republican era. I initially made this for personal usage and research. To my knowledge, this is the largest publicly and freely available database of Chinese literature that is 1) fully in plain text and 2) categorized into chronological eras. All texts are in txt format and primarily for machine reading.
The pre-modern part of the database is constructed based on the catalogue outlined by guoxue.com (http://www.guoxue.com/cp/gxbd_ml01.htm); it is based on Paul Vierthaler’s Digital Siku Quanshu collection and web-scraping Wikisource. The Republican part of the database consists of the complete collections of Lu Xun 魯迅, Zhou Zuoren 周作人, and Yu Dafu 郁達夫 and prose collections of Chen Duxiu 陳獨秀 and Guo Muoruo 郭沫若.
All texts are sorted into era of origin; the chronological categorization is based on the guoxue.com catalogue (http://www.guoxue.com/cp/gxbd_ml01.htm) and Ashley’s own research and verification.
CDCL is designed to study changes in linguistic and textual properties of Chinese literature in historical periods from the Three Kingdoms era to the Republican period.
The current break-down:
By Era
Three Kingdoms: 7 titles, 81,126 characters
Western Jin: 9 titles, 733,526 characters
Eastern Jin: 13 titles, 594,411 characters
Southern and Northern Dynasties: 23 titles, 2,720,079 characters
Sui: 2 titles, 36,723 characters
Tang: 165 titles, 9,472,657 characters
Five Dynasties: 18 titles, 2,300,491 characters
Northern Song: 201 titles, 16,035,871 characters
Southern Song: 266 titles, 17,743,109 characters
Yuan: 251 titles, 12,914,513 characters
Ming: 378 titles, 24,949,003 characters
Qing: 481 titles, 64,408,992 characters
Republican: 9,469,553 characters
Republican Era Authors:
Chen Duxiu: 555 titles, 870,631 characters
Zhou Zuoren: complete collection, 2,721,071 characters
Lu Xun: complete collection, 3,827,206 characters
Guo Muo: 48 titles, 307,463 characters
Yu Dafu: complete collection, 1,743,182 characters
Proportion of Each Bibliographical Category in Title Count:
Jing 經: 5.2% of all titles
Shi 史: 7.8% of all titles
Zi 子 (Confucian and Daoist): 2.5% of all titles
Ji 集: 13.7% of all titles
Biji 筆記 (“brush notes”): 45.5% of all titles
Wenlun 文論 (“literary discourse”): 8.7% of all titles
Xiqu 戲曲 (“theater and tunes”): 8.9% of all titles
Xiaoshuo 小說 (vernacular and classical language): 7.7% of all titles
The bibliographical sorting and naming of the bibliographical categories are in accordance with the Treasured Index catalogue (http://www.guoxue.com/cp/gxbd_ml01.htm) and does not represent my personal views. The titles categorized to be xiaoshuo are not necessarily fictional. This percentage break-down accounts for only the premodern portion the CDCL and excludes the Republican section.
Download the database here:
CDCL (version with all punctuation removed)
All credits to Ashley Liu and Paul Vierthaler. Unauthorized redistribution and modification of the data are prohibited.
Please email Ashley at liuyx@sas.upenn.edu for inquiries.