Ashley Liu

Scholar of Sinophone Studies, Literautre of the Japanese Empire, and Digital Humanities

Digital Humanities

Chronological Database of Chinese Literature (CDCL)

CDCL is a fully machine-accessible digital database that consists of more than 2,000 titles of Chinese literature (161,460,042 characters in total) from the Three Kingdoms period to the Republican era. I initially made this for personal usage and research. To my knowledge, this is the largest publicly and freely available database of Chinese literature that is 1) fully in plain text and 2) categorized into chronological eras. All texts are in txt format and primarily for machine reading.

The pre-modern part of the database is constructed based on the catalogue outlined by guoxue.com (http://www.guoxue.com/cp/gxbd_ml01.htm); it is based on Paul Vierthaler’s Digital Siku Quanshu collection and web-scraping Wikisource. The Republican part of the database consists of the complete collections of Lu Xun 魯迅, Zhou Zuoren 周作人, and Yu Dafu 郁達夫 and prose collections of Chen Duxiu 陳獨秀 and Guo Muoruo 郭沫若.

All texts are sorted into era of origin; the chronological categorization is based on the guoxue.com catalogue (http://www.guoxue.com/cp/gxbd_ml01.htm) and Ashley’s own research and verification.

CDCL is designed to study changes in linguistic and textual properties of Chinese literature in historical periods from the Three Kingdoms era to the Republican period.

The current break-down:

By Era

Three Kingdoms: 7 titles, 81,126 characters

Western Jin: 9 titles, 733,526 characters

Eastern Jin: 13 titles, 594,411 characters

Southern and Northern Dynasties: 23 titles, 2,720,079 characters

Sui: 2 titles, 36,723 characters

Tang: 165 titles, 9,472,657 characters

Five Dynasties: 18 titles, 2,300,491 characters

Northern Song: 201 titles, 16,035,871 characters

Southern Song: 266 titles, 17,743,109 characters

Yuan: 251 titles, 12,914,513 characters

Ming: 378 titles, 24,949,003 characters

Qing: 481 titles, 64,408,992 characters

Republican: 9,469,553 characters

Republican Era Authors:

Chen Duxiu: 555 titles, 870,631 characters

Zhou Zuoren: complete collection, 2,721,071 characters

Lu Xun: complete collection, 3,827,206 characters

Guo Muo: 48 titles, 307,463 characters

Yu Dafu: complete collection, 1,743,182 characters

Proportion of Each Bibliographical Category in Title Count:

Jing 經: 5.2% of all titles

Shi 史: 7.8% of all titles

Zi 子 (Confucian and Daoist): 2.5% of all titles

Ji 集: 13.7% of all titles

Biji 筆記 (“brush notes”): 45.5% of all titles

Wenlun 文論 (“literary discourse”): 8.7% of all titles

Xiqu 戲曲 (“theater and tunes”): 8.9% of all titles

Xiaoshuo 小說 (vernacular and classical language): 7.7% of all titles


The bibliographical sorting and naming of the bibliographical categories are in accordance with the Treasured Index catalogue (http://www.guoxue.com/cp/gxbd_ml01.htm) and does not represent my personal views. The titles categorized to be xiaoshuo are not necessarily fictional. This percentage break-down accounts for only the premodern portion the CDCL and excludes the Republican section.


Download the database here:

CDCL

CDCL (version with all punctuation removed)


All credits to Ashley Liu and Paul Vierthaler. Unauthorized redistribution and modification of the data are prohibited.

Please email Ashley at liuyx@sas.upenn.edu for inquiries.