
Latest Version is 0.3.0–May 18, 2009
WP2TXT extracts plain text data from Wikipedia dump file (encoded in XML/compressed with Bzip2) stripping all the MediaWiki markups and other metadata. It is originally intended to be useful for researchers who look for an easy way to obtain open-source multi-lingual corpora, but may be handy for other purposes.
WP2TXT is written in the Ruby programming language] and equipped with a GUI made with wxRuby.



[[TITLE]]==H2==, and H3 ===H3===, etc.*, with the hierarchical structure flattened.#, with the hierarchical structure flattened.-[ref] ... [/ref]Windows installer and Mac OS X DMG packages are available at: http://rubyforge.org/projects/wp2txt/
Notice: Only tested on Windows Vista (Ultimate Edition) and Mac OS X Leopard (10.5.5); May not work on other versions of these operating systems
Read Wikipedia: Database Download carefully and download an appropriate dump file. In many cases, the file name should look like pages-articles.xml.bz2.
Set the damp file to Input File. Input file could be either BZ2 compressed or plain text.
Set the directory where output files are saved to Output Dir.
Set the options (see below).
Click START button. Wait until the conversion has been finished. It may take several hours (depending on the hardware/software environment).
Data Conversion
If To Text Format is selected, data are converted to plain text, stripping all the XML tags, MediaWiki markups (where possible), and other meta data.
If Keep XML Format is selected, no conversion is done. This option is useful when you only need to extract XML data from the compressed file.
Output File Spec
Choose the encoding of the output text from among UTF-8, Shift JIS, and EUC. (UTF-8 is the default and strongly recommended, characters could corrupt as a result of conversion to other encodings)
Specify also the preferable size (in MB) of each output file (10 MB is the default).
Elements Extracted
Specify the elements of article that need to be extracted. The default choice of elements are Title, Heading, and Paragraph. For many purposes, this will produce a good collection of text data that is easy to deal with. Elements such as Quote and List often contain text fragments and therefore may not be suitable, for instance, to linguistic studies. (All table data are skipped because they are likely to become unreadable when converted to plain text.)
Tables and similar elements are all skipped. (Please remember this software is originally intended for correcting “sentences” for linguistic studies.)
Certain types of data such as mathematical equations and computer source code are not be properly converted.
Extraction of normal text data could sometimes fail for various reasons (e.g. illegal matching of begin/end tags, language-specific conventions of formatting, etc.)
Conversion process can take far more than you would expect. It could take more than 10 hours when dealing with a huge data set such as the English Wikipedia on a low-spec environments.
Because of nature of the task, WP2TXT needs much machine power and consumes a lot of memory/storage resources. The process thus could halt unexpectedly. It may even get stuck, in the worst case, without getting gracefully terminated. Please understand this and use the software at your own risk.
Author: Yoichiro Hasebe (Doshisha University, Japan)
Web: http://www.yohasebe.com
Email: yohasebe@gmail.com
This software is distributed under the MIT License. Please see the LICENSE file.
長谷部陽一郎. 2006. Wikipedia日本語版を利用した言語研究の手法. 『言語文化』 第9巻 第2号. 373-403.
Hasebe, Yoichiro. 2006. Method for Using Wikipedia as Japanese Corpus. Doshisha Studies in Language and Culture 9(2). 373-403