WP2TXT: Wikipedia to Text Converter


wp2txt-logo

Latest Version is 0.3.0–May 18, 2009

1. About

WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.

WP2TXT is written in the Ruby programming language] and equipped with a GUI made with wxRuby.

Screenshot Mac OS X

2. Feature

Screenshot English

Screenshot Japanese

3. Output Format

4. Download and Installation

Windows installer and Mac OS X DMG packages are available at: http://rubyforge.org/projects/wp2txt/

Notice: Only tested on Windows Vista (Ultimate Edition) and Mac OS X Leopard (10.5.5); May not work on other versions of these operating systems

5. How to Use

Quick Start

  1. Read Wikipedia: Database Download carefully and download an appropriate dump file. In many cases, the file name should look like pages-articles.xml.bz2.

  2. Set the damp file to Input File. Input file could be either BZ2 compressed or plain text.

  3. Set the directory where output files are saved to Output Dir.

  4. Set the options (see below).

  5. Click START button. Wait until the conversion has been finished. It may take several hours (depending on the hardware/software environment).

Configurable Options

Data Conversion

If To Text Format is selected, data are converted to plain text, stripping all the XML tags, MediaWiki markups (where possible), and other meta data.

If Keep XML Format is selected, no conversion is done. This option is useful when you only need to extract XML data from the compressed file.

Output File Spec

Choose the encoding of the output text from among UTF-8, Shift JIS, and EUC. (UTF-8 is the default and strongly recommended, characters could corrupt as a result of conversion to other encodings)

Specify also the preferable size (in MB) of each output file (10 MB is the default).

Elements Extracted

Specify the elements of article that need to be extracted. The default choice of elements are Title, Heading, and Paragraph. For many purposes, this will produce a good collection of text data that is easy to deal with. Elements such as Quote and List often contain text fragments and therefore may not be suitable, for instance, to linguistic studies. (All table data are skipped because they are likely to become unreadable when converted to plain text.)

6. Limitations

7. Contact

Author: Yoichiro Hasebe (Doshisha University, Japan)

Web: http://www.yohasebe.com
Email: yohasebe@gmail.com

8. License

This software is distributed under the MIT License. Please see the LICENSE file.

9. References

長谷部陽一郎. 2006. Wikipedia日本語版を利用した言語研究の手法. 『言語文化』 第9巻 第2号. 373-403.

Hasebe, Yoichiro. 2006. Method for Using Wikipedia as Japanese Corpus. Doshisha Studies in Language and Culture 9(2). 373-403