<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Quarry website &#8211; Pavanaja&#039;s Blog</title>
	<atom:link href="http://pavanaja.com/tag/quarry-website/feed/" rel="self" type="application/rss+xml" />
	<link>http://pavanaja.com</link>
	<description></description>
	<lastBuildDate>Sat, 29 Oct 2016 15:13:06 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.3.8</generator>
	<item>
		<title>Querying Wikipedia data</title>
		<link>http://pavanaja.com/english/querying-wikipedia-data/</link>
					<comments>http://pavanaja.com/english/querying-wikipedia-data/#comments</comments>
		
		<dc:creator><![CDATA[U B Pavanaja]]></dc:creator>
		<pubDate>Fri, 21 Oct 2016 14:21:01 +0000</pubDate>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Tech related]]></category>
		<category><![CDATA[MySQL Queries]]></category>
		<category><![CDATA[Quarry website]]></category>
		<category><![CDATA[Wikipedia]]></category>
		<guid isPermaLink="false">http://pavanaja.com/?p=2073</guid>

					<description><![CDATA[Recently I wrote a blog about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles [&#8230;]]]></description>
										<content:encoded><![CDATA[<p>Recently I wrote a <a href="http://pavanaja.com/english/utf-8-indic-stub-length-wikipedia/" target="_blank">blog</a> about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles which have less than 2048 bytes. But the as already mentioned in tht blog, the number of bytes for Indian languages to be considered as stub should be actually 2048*3 = 6144 bytes employing the same criteria. How to find the list of articles fulfilling this condition?</p>
<p>&nbsp;</p>
<p>This brings us to the topic of querying Wikipedia data. Wikimedia Foundation Labs has put up a website wherein one can run SQL queries on Wikimedia data. The URL of the website is <a href="https://quarry.wmflabs.org" target="_blank">quarry.wmflabs.org</a>. When we open the website, we get a textbox wherein one can type the SQL query which will run on Wikimedia data. In this example I will consider Wikipedia only. But the queries can be run on the data of other Wikimedia projects like Wikisource, Wikidata, Wiktionary, etc.</p>
<p>&nbsp;</p>
<p>To begin one has to login with his/her Wikimedia login. After loging in, the SQL query can be typed in the textbox and the Submit Query button has to be clicked. The result of execution of the query on Wikimedia data will be displayed. In this blog I will be discussion Kannada Wikipedia. The database for Kannada Wikipedia is called knwiki_p. Complete list of databases can be obtained by running the SQL query “show databases;”.</p>
<p>&nbsp;</p>
<p>To get the list of tables in Kannada Wikipedia, the following SQL queries have to be executed-</p>
<blockquote><p><span style="font-family: courier new,courier,monospace;">use knwiki_p;</span></p>
<p><span style="font-family: courier new,courier,monospace;">show tables;</span></p></blockquote>
<p>&nbsp;</p>
<p>To know the schema of any table, run the query desc &lt;tablename&gt;;. For example to know the details of the table by name page, issue the query <span style="font-family: courier new,courier,monospace;">desc page;.</span> The fields which are of importance in the current case is <span style="font-family: courier new,courier,monospace;">page_title</span> and <span style="font-family: courier new,courier,monospace;">page_len</span>. The following query will list of all articles in Kannada Wikipedia which are having less than 6144 bytes.</p>
<blockquote><p><span style="font-family: courier new,courier,monospace;">use knwiki_p;</span></p>
<p><span style="font-family: courier new,courier,monospace;">select page_title, page_len</span></p>
<p><span style="font-family: courier new,courier,monospace;">from page where page_len &lt; &#8216;6144&#8217; and page_namespace = 0 and page_is_redirect = 0 order by page_len ;</span></p></blockquote>
<p>&nbsp;</p>
<p>The resultant data can be downloaded as JSON or CSV file also.</p>
<p><a href="http://pavanaja.com/wp-content/uploads/2016/10/2016-10-21-at-19-22-21.jpg"><img decoding="async" fetchpriority="high" class="aligncenter wp-image-2090" src="http://pavanaja.com/wp-content/uploads/2016/10/2016-10-21-at-19-22-21-300x167.jpg" alt="2016-10-21-at-19-22-21" width="440" height="245" srcset="http://pavanaja.com/wp-content/uploads/2016/10/2016-10-21-at-19-22-21-300x167.jpg 300w, http://pavanaja.com/wp-content/uploads/2016/10/2016-10-21-at-19-22-21-768x427.jpg 768w, http://pavanaja.com/wp-content/uploads/2016/10/2016-10-21-at-19-22-21.jpg 1708w" sizes="(max-width: 440px) 100vw, 440px" /></a></p>
<p>&nbsp;</p>
<p>Some other useful queries are listed below-</p>
<table style="height: 215px; width: 494px; border-color: #1a1717;" border="3" cellspacing="2" cellpadding="2">
<caption> </caption>
<tbody>
<tr style="height: 24px;">
<td style="width: 368.683px; text-align: left; height: 24px;"><strong>Query</strong></td>
<td style="width: 236.317px; text-align: left; height: 24px;"><strong>What it does</strong></td>
</tr>
<tr style="height: 75px;">
<td style="width: 368.683px; height: 75px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">Select Count(*) from page where page_namespace = 0 and page_is_redirect =0;</span></td>
<td style="width: 236.317px; height: 75px;">Number of articles without redirect</td>
</tr>
<tr style="height: 96.3334px;">
<td style="width: 368.683px; height: 96.3334px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">Select Count(*) from page where page_namespace = 0 and page_is_redirect =0 and page_len &lt; 6144;</span></td>
<td style="width: 236.317px; height: 96.3334px;">Number of articles which are having bytes less than 6144</td>
</tr>
<tr style="height: 72px;">
<td style="width: 368.683px; height: 72px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">select * from user where user_name Like &#8220;P%&#8221;;</span></td>
<td style="width: 236.317px; height: 72px;">List all users whose username starts with letter &#8220;P&#8221;</td>
</tr>
<tr style="height: 96px;">
<td style="width: 368.683px; height: 96px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">select user_id, user_name, user_editcount  from user where user_editcount &gt;3000 order by user_editcount desc;</span></td>
<td style="width: 236.317px; height: 96px;">List all users with editcount more than 3000</td>
</tr>
<tr style="height: 336px;">
<td style="width: 368.683px; height: 336px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">SELECT</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">page_namespace,</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">page_title,</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">page_len</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">FROM page</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">WHERE page_len &gt; 175000</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">AND page_title NOT LIKE &#8220;%/%&#8221;</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">ORDER BY page_namespace ASC;</span></td>
<td style="width: 236.317px; height: 336px;">List of long articles (articles having  bytes more than 175000)</td>
</tr>
<tr style="height: 264px;">
<td style="width: 368.683px; height: 264px;"><span style="font-family: courier new,courier,monospace; font-size: 10pt;">SELECT rc_title as title, rc_comment as comments, count(*) as Edits</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">FROM recentchanges</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">WHERE rc_namespace = 0</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">GROUP BY 1 ORDER BY 3 DESC</span></p>
<p><span style="font-family: courier new,courier,monospace; font-size: 10pt;">LIMIT 100;</span></td>
<td style="width: 236.317px; height: 264px;">Most edited 100 pages during past one month</td>
</tr>
<tr style="height: 96px;">
<td style="width: 368.683px; height: 96px;"><span style="font-size: 10pt; font-family: courier new,courier,monospace;">SELECT log_title, COUNT(*) FROM logging WHERE log_type=&#8221;thanks&#8221; GROUP BY log_title ORDER BY COUNT(*) DESC LIMIT 100;</span></td>
<td style="width: 236.317px; height: 96px;">Who have been thanked most</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>Useful links &#8211;</p>
<ol>
<li><a href="https://meta.wikimedia.org/wiki/Research:Quarry" target="_blank">Details about Quarry</a></li>
<li><a href="https://wikitech.wikimedia.org/wiki/Help:MySQL_queries" target="_blank">MySQL queries help</a></li>
</ol>
]]></content:encoded>
					
					<wfw:commentRss>http://pavanaja.com/english/querying-wikipedia-data/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/?utm_source=w3tc&utm_medium=footer_comment&utm_campaign=free_plugin

Page Caching using Disk: Enhanced 

Served from: pavanaja.com @ 2026-06-22 20:29:08 by W3 Total Cache
-->