Recently I wrote a blog about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles which have less than 2048 bytes. But the as already mentioned in tht blog, the number of bytes for Indian languages to be considered as stub should be actually 2048*3 = 6144 bytes employing the same criteria. How to find the list of articles fulfilling this condition?
This brings us to the topic of querying Wikipedia data. Wikimedia Foundation Labs has put up a website wherein one can run SQL queries on Wikimedia data. The URL of the website is quarry.wmflabs.org. When we open the website, we get a textbox wherein one can type the SQL query which will run on Wikimedia data. In this example I will consider Wikipedia only. But the queries can be run on the data of other Wikimedia projects like Wikisource, Wikidata, Wiktionary, etc.
To begin one has to login with his/her Wikimedia login. After loging in, the SQL query can be typed in the textbox and the Submit Query button has to be clicked. The result of execution of the query on Wikimedia data will be displayed. In this blog I will be discussion Kannada Wikipedia. The database for Kannada Wikipedia is called knwiki_p. Complete list of databases can be obtained by running the SQL query “show databases;”.
To get the list of tables in Kannada Wikipedia, the following SQL queries have to be executed-
To know the schema of any table, run the query desc <tablename>;. For example to know the details of the table by name page, issue the query desc page;. The fields which are of importance in the current case is page_title and page_len. The following query will list of all articles in Kannada Wikipedia which are having less than 6144 bytes.
select page_title, page_len
from page where page_len < ‘6144’ and page_namespace = 0 and page_is_redirect = 0 order by page_len ;
The resultant data can be downloaded as JSON or CSV file also.
Some other useful queries are listed below-
|Query||What it does|
|Select Count(*) from page where page_namespace = 0 and page_is_redirect =0;||Number of articles without redirect|
|Select Count(*) from page where page_namespace = 0 and page_is_redirect =0 and page_len < 6144;||Number of articles which are having bytes less than 6144|
|select * from user where user_name Like “P%”;||List all users whose username starts with letter “P”|
|select user_id, user_name, user_editcount from user where user_editcount >3000 order by user_editcount desc;||List all users with editcount more than 3000|
WHERE page_len > 175000
AND page_title NOT LIKE “%/%”
ORDER BY page_namespace ASC;
|List of long articles (articles having bytes more than 175000)|
|SELECT rc_title as title, rc_comment as comments, count(*) as Edits
WHERE rc_namespace = 0
GROUP BY 1 ORDER BY 3 DESC
|Most edited 100 pages during past one month|
|SELECT log_title, COUNT(*) FROM logging WHERE log_type=”thanks” GROUP BY log_title ORDER BY COUNT(*) DESC LIMIT 100;||Who have been thanked most|
Useful links –