公开数据集统一放在 root/datasets 路径下,大家可以通过复制/软链接进行调用哦!

数据集详情页涵盖所属标签以及数据集的基本说明。数据集格式等具体信息可点击数据集来源链接查看哦~

Enwik9

数据集大小958.0 MB

更新时间2021-08-10 10:40:23

数据集标签
文本

数据集路径

数据集无需解压和下载,可直接在代码中更改数据集路径使用

/datasets/Enwik9/

数据集简介

The data is UTF-8 encoded XML consisting primarily of English text. enwik9 contains 243,426 article titles, of which 85,560 are #REDIRECT to fix broken links, and the rest are regular articles.

数据集说明

The data is UTF-8 encoded XML consisting primarily of English text. enwik9 contains 243,426 article titles, of which 85,560 are #REDIRECT to fix broken links, and the rest are regular articles. The example fragment below shows a redirection of "AdA" to "Ada programming language" and the start of a regular article with title "Anarchism". The data is UTF-8 clean. All characters are in the range U'0000 to U'10FFFF with valid encodings of 1 to 4 bytes. The byte values 0xC0, 0xC1, and 0xF5-0xFF never occur. Also, in the Wikipedia dumps, there are no control characters in the range 0x00-0x1F except for 0x09 (tab) and 0x0A (linefeed). Linebreaks occur only on paragraph boundaries, so they always have a semantic purpose. In the example below, lines were broken at 80 characters, but in reality each paragraph is one long line. The data contains some URL encoded XHTML tags such as &lt;ref&gt ... &lt;/ref&gt; and &lt;br /&gt; which decode to <ref> ... </ref> (citation) and <br /> (line break). However, hypertext links have their own encoding. External links are enclosed in square brackets in the form [URL anchor text]. Internal links are encoded as [[Wikipedia title | anchor text]], omitting the title and vertical bar if the title and anchor text are identical. Non-English characters are sometimes URL encoded as &amp;#945;, meaning &#945; (Greek alpha, α), but are more often coded directly as a UTF-8 byte sequence.