Skip to content

WikiCommunityHealth/wikimedia-revert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikimedia-revert

On wikipedia everyone can edit a page.

The hystory of a page contains a snapshot of it for each edit

On wikipedia everyone can delete an edit restoring the previous edit, this is a revert.

Repository structure

all the important code is in src/main/

the create folder contains all the files used to compute a dataset

the analyze folder contains all the files used to analyze a dataset

It's easier to understand the file names using examples:

c_admin_user_reverts_month_tsv.py

type_class_aggregation_name_month_format.py

type = create (c) or analyze(a)
class = i decided to split into different logical class the files , now the two classes are admin and chains and generic
aggregation = if the file is by user or by page
name = the name of the metrics it's computing
month (optional)= if the file is by month
format = in type c the format of the output file ( tsv or json)\

Dataset structure

it's a tsv, each line is an event, there are different type of events:

  • revision: edit
  • page: create, move, restore, etc of a page
  • user: create, delete, change group (become an admin)

this is the official information page.

this is my information page about fields i use.

this is the google doc which contains info about metrics

Datasets created

from the wikimedia dataset i computed different other datasets

sorted_by_pages.tsv : same as wikimedia but only with revision events and sorted by page name edits reverts/reverted 2001 6 0 2002 1472 2 2003 23907 207 2004 431129 13592 2005 1657387 88938 2006 4894301 314799 2007 8109283 636507 2008 8818419 867780 2009 8803086 884355 2010 9242952 988517 2011 9394522 1175935 2012 9764942 1173113 2013 8937768 1064283 2014 7376637 825818 2015 8543088 1088518 2016 8242733 1028889 2017 9307301 1019460 2018 9752861 1045824 2019 9186886 1067841 2020 9130259 1041494

chains

a chain happens when the targetted edit of a revert is a revert(which could belong to a chain)

for each page is saved each chain and some statistics about it

wars_json/pages

{
    "title": "Loligo_vulgaris", 
    "chains": 
    [{
        "revisions": ["113715375", "113715381", "113715393"], 
        "users": {"62.18.117.244": "", "Leo0428": "17181"}, 
        "len": 3, 
        "start": "2020-06-15 22:16:23.0", 
        "end": "2020-06-15 22:17:38.0"
    }], 
    "n_chains": 1, 
    "n_reverts_in_chains": 3, 
    "n_reverts": 38
    "mean": 3.0, 
    "longest": 3, 
    "G": 0,
    "M": 0, 
    "lunghezze": {"3": 1}
}

similarly, it's possibile to see every chain a user got involved wars_json/users

{
        
    "user": "80.181.45.118",
    "chains": [
        {
            "page": "Puppy_Dog_Pals",
            "revisions": [ "109421725", "109422928", "109422931","109465730"],
            "users": { "80.181.45.118": "",  "Moxmarco": "10204", "Sakretsu": "75109" },
            "len": 4,
            "start": "2019-12-14 13: 34: 12.0",
            "end": "2019-12-16 23: 08: 09.0"
        }
    ],
    "n_chains": 1,
    "n_reverts": 4,
    "mean": 4,
    "longest": 4,
    "G": [ 0, "{'87.19.234.101', 'ValeJappo', '80.181.45.118', 'Moxmarco', 'Sakretsu'}"],
    "lunghezze": { "3": 1 }
    
}

from this json i computed the metric by month adding more_than and involved monthly pages

title    year_month    nchain   nrev    mean    longest     more_than5      more_than7      more_than9      G   involved

monthly users

user    year_month    nchain   nrev    mean    longest     more_than5      more_than7      more_than9      G    involved

admin

i also computed data about the group of the reverter and of the reverted. an user could be

  • adm : sysop, administrator
  • reg : registered but not admin
  • not : anonymous user

adm_adm refert to the number of reverts an admin made to another admin reg refer to the number of revert made by registered user (admin included)

NB: the last 2 fields are not_reg and reg , in this case reg are registered users including admins

the data contains info about the reverts and the mutual reverts, a mutual reverts happens when in the same page if A reverts B then B reverts A

M is the controversiality metric computed by Yasseri
G is a metric that's similar to M which evalue the chains in a page(or user), when in a chain are involved users with a big edit count G will be bigger

pages

reverts

page_id     page_name   year_month   adm_adm    adm_reg     reg_adm     reg_reg     not_reg     reg

mutual

page_id     page_name   year_month    adm_adm    adm_reg     reg_reg     not_reg     reg

user

revert

user     group    year_month    tot_received     t_reg     t_not     t_adm     tot_done     d_reg     d_not     d_adm    

mutual

user    group   page_name   year_month  mutual_with_admin   mutual_with_reg  mutual_with_not

time

sort_dataset.py 1h for filtering 15min for sorting 1h15m total

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published