| Paper | Citation | License |
Project Wisteria's core Wikipedia link graph generation, serialisation, and analysis tools. This is developed for the paper "Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction", by Y. Song and C. H. Leung, published at the 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), Bellevue, WA, USA [doi: 10.1109/IRI58017.2023.00051].
Table of contents:
-
Make sure you have Julia installed and accessible via the command line. If not, install Julia. This code has been tested with Julia 1.7.3.
-
If on Windows, make sure you have the
curl
and7z
commands available on your command line. To add the7z
command provided by 7zip, download 7zr.exe, rename it 7z.exe, and add it to PATH. -
Open your Julia terminal and run the setup script to get started:
julia setup.jl
Internet connection is required to automatically download and extract the necessary Wikipedia dump files for Wisteria to work correctly.
-
To extract link relationships between Wikipedia articles, run the command:
julia run.jl
An Internet connection is required as Wikipedia dumps will be downloaded, unzipped, and deleted on the fly to save storage space.
-
If, for any reason, your link extraction is incomplete, you can go into
./logs
to find out the last complete parsing of a data dump (there are 63 dumps to be parsed in total). Each log file is saved under the index of its dump (i.e. logs for dump index 1 are stored under./logs/1
).A completely parsed dump will generate something like the following at the end of
title_errors.txt
:Average number of links: 34.715 Pages with links: 16226 Number of pages (not counting redirects): 15704852
If this is not generated, then parsing of that dump is incomplete, and you can instruct Wisteria to start parsing there.
For example, suppose dump 21 is incomplete. To pick up progress from there, simply run
julia run.jl 21
-
-
We use
PyPlot.jl
to generate graphs for our experiments. This relies on an existing Python interpreter withmatplotlib
installed. To run our experiments, please first install Python and add thematplotlib
package usingpip install matplotlib
. -
To run
experiments/indexStats.jl
, you will need the GitHub version ofPingouin.jl
. Enter package manager mode in Julia by pressing]
, and run:add https://github.com/clementpoiret/Pingouin.jl.git
Since the Pageman
object of our system uses a relative path to reference the list of titles on Wikipedia, please make sure that:
- The file
enwiki-20230101-all-titles-in-ns0
is present in./data
(this should be done automatically bysetup.jl
) - You are running any Julia scripts from the root of this repository (i.e. where you can see
explore.jl
,parser.jl
,./data
,./graph
, etc.)
Otherwise, things might not work!
You can easily browse and reuse the graphs generated by someone else. Just place links.ria
and pm.jld2
into the ./graph
directory, and you should be able to load, serialise, and explore the graph without any problems.
parser.jl
: Parses XML files for links and connection strengths (under development).run.jl
: Downloads all required Wikipedia dump files, extracts links, and stores graph in./graph
, with a list of unidentifiable titles stored in./logs
.serialise.jl
: Serialises graph into./ser
with all supported file formats.setup.jl
: Installs Julia packages, creates directories, checks commands, downloads data... If this runs without failure, you should be able to run the rest of Wisteria.utils.jl
: Utilities for saving links in the.RIA
file format; defines thePageman
(page management) object for handling page IDs, titles, and redirects.wikigraph.jl
: Defines theWikigraph
object for capturing links and relationships between Wikipedia pages; functions to serialiseWikigraph
into various file formats.
To load a Wikigraph:
# Include wisteria graph loading functions
include("utils.jl")
include("wikigraph.jl")
# Load a Wikigraph object
wg = loadwg("path/to/graph-directory", "path/to/all-titles-file")
# E.g. wg = loadwg("graph/", "data/enwiki-20230101-all-titles-in-ns0")
Attributes of a Wikigraph
object:
-
wg.pm
::Pageman
Attributes of a
Pageman
object:-
id2title
::Vector{String}
A vector mapping from
Int32
IDs toString
titles -
title2id
::Dict{String,Int32}
A vector mapping from
String
titles toInt32
IDs -
redirs
::Vector{Int32}
A vector mapping from
Int32
IDs to its redirectedInt32
ID (maps back to the same ID if it is not redirected) -
numpages
::Int32
Number of non-redirected pages
-
totalpages
::Int32
Total number of pages (including redirects)
-
-
wg.links
::Vector{Vector{Pair{Int32, Int32}}}
A vector mapping from
Int32
IDs to a vector ofInt32
IDs connected to it.
Tying the above together, the following is a sample code to extract all links and weights of node ID 1:
# Keep track of linked IDs
linked = Int32[]
# Loop through all IDs and weights connected to node 1
for (id, weight) in wg.links[1]
# Handle redirected pages
redirected_id = traceRedir!(wg.pm, id)
# Check if ID is already linked
if !(redirected_id in linked)
# If not, add it to the vector of linked IDs
push!(linked, redirected_id)
# Print out ID and weight
println("Connected to ", redirected_id, " with weight ", weight)
end
end
If you found our work helpful, here is our citation:
@INPROCEEDINGS{song2023large,
author={Song, Yiding and Leung, Chun Hei},
booktitle={2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)},
title={Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction},
year={2023},
pages={254-260},
doi={10.1109/IRI58017.2023.00051}
}
All code in this repository is licensed under the GNU General Public License version 3.