Querying Wikipedia data

Recently I wrote a blog about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles which have less than 2048 bytes. But the as already mentioned in tht blog, the number of bytes for Indian languages to be considered as stub should be actually 2048*3 = 6144 bytes employing the same criteria. How to find the list of articles fulfilling this condition?

This brings us to the topic of querying Wikipedia data. Wikimedia Foundation Labs has put up a website wherein one can run SQL queries on Wikimedia data. The URL of the website is quarry.wmflabs.org. When we open the website, we get a textbox wherein one can type the SQL query which will run on Wikimedia data. In this example I will consider Wikipedia only. But the queries can be run on the data of other Wikimedia projects like Wikisource, Wikidata, Wiktionary, etc.

To begin one has to login with his/her Wikimedia login. After loging in, the SQL query can be typed in the textbox and the Submit Query button has to be clicked. The result of execution of the query on Wikimedia data will be displayed. In this blog I will be discussion Kannada Wikipedia. The database for Kannada Wikipedia is called knwiki_p. Complete list of databases can be obtained by running the SQL query “show databases;”.

To get the list of tables in Kannada Wikipedia, the following SQL queries have to be executed-

use knwiki_p;

show tables;

To know the schema of any table, run the query desc <tablename>;. For example to know the details of the table by name page, issue the query desc page;. The fields which are of importance in the current case is page_title and page_len. The following query will list of all articles in Kannada Wikipedia which are having less than 6144 bytes.

use knwiki_p;

select page_title, page_len

from page where page_len < ‘6144’ and page_namespace = 0 and page_is_redirect = 0 order by page_len ;

The resultant data can be downloaded as JSON or CSV file also.

Some other useful queries are listed below-


Query	What it does
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0;	Number of articles without redirect
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0 and page_len < 6144;	Number of articles which are having bytes less than 6144
select * from user where user_name Like “P%”;	List all users whose username starts with letter “P”
select user_id, user_name, user_editcount from user where user_editcount >3000 order by user_editcount desc;	List all users with editcount more than 3000
SELECT page_namespace, page_title, page_len FROM page WHERE page_len > 175000 AND page_title NOT LIKE “%/%” ORDER BY page_namespace ASC;	List of long articles (articles having bytes more than 175000)
SELECT rc_title as title, rc_comment as comments, count(*) as Edits FROM recentchanges WHERE rc_namespace = 0 GROUP BY 1 ORDER BY 3 DESC LIMIT 100;	Most edited 100 pages during past one month
SELECT log_title, COUNT() FROM logging WHERE log_type=”thanks” GROUP BY log_title ORDER BY COUNT() DESC LIMIT 100;	Who have been thanked most

Useful links –

2 thoughts on “Querying Wikipedia data”

Ananth says:

October 21, 2016 at 7:56 PM

How to get a list of Index pages on Wikisource by a particular User?
pavanaja says:

October 22, 2016 at 9:57 AM

@Ananth – this blog is about querying Wikipedia data. Querying Wikisource data can be a separate blog. Let me find some time for that

2 thoughts on “Querying Wikipedia data”

Leave a Reply

Latest

Categories

Pages

Recent Posts

Tags