How to analyze a database of Wikipedia articles using R?

This is a “big” question that I don’t know how to start, so I hope some of you can give me directions. And if this is not a "good" question, I will close the stream with an apology.

I want to go through the Wikipedia database (say, in English) and make statistics. For example, I am interested in how many active editors (to be defined) on Wikipedia at any given time (say, over the past 2 years).

I don’t know how to create such a database, how to access it, find out what types of data it has and so on. So my questions are:

  • What tools do I need for this (besides the base R)? MySQL on my computer? Connection to the RODBC database?
  • How do you start planning such a project?
+3
source share
3 answers

You want to start here: http://en.wikipedia.org/wiki/Wikipedia:Database_download

Which will bring you here: http://download.wikimedia.org/enwiki/20100312/

And the file you probably want:

# 2010-03-17 04:33:50 done Log events to all pages.
    * This contains the log of actions performed on pages.
    * pages-logging.xml.gz 1.0 GB

http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz

Then you import xml into MySQL. Generating histograms of users per day, week, year, etc. It does not require R. You can do this with a single MySQL query. Sort of:

select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);

and etc.

(I'm not sure what their actual layout is, but it will be something like this.)

You will run into problems, no doubt, but you will learn a lot too. Good luck

+8

WikiXRay (Python/R) zotero.

+2

Source: https://habr.com/ru/post/1740603/


All Articles