Wikipedia, the world's online encyclopedia, is a useful, volunteer-curated source of information. Often it's said that the most valuable parts of a given Wikipedia article are not the user-contributed topic explanations, but the references from which those explanations are based.
Wikiref aims to make the process of extracting these references for later review or analysis dead simple. It operates as a Firefox (and soon to be Chrome!) browser extension that is only active when you're on a Wikipedia page.
Currently, Wikiref is not in the Firefox Add-ons store, so you can only add it as a temporary add-on. I plan to apply to get this listed the extension store, but in the meantime, here are instructions for installing it as a temporary add-on in Firefox:
- Clone the repo:
git clone https://github.com/zaataylor/wikiref.git
- Navigate to
about:debugging
- Select "This Firefox"
- Click "Load Temporary Add-on"
- Find the location of the cloned Wikiref repository from the dropdown and click on any file in the extension directory.
- Navigate to a Wikipedia page and have fun! :)
We'll illustrate how to use Wikiref by extracting some references from this Wikipedia page about dynamic arrays:
We can capture specific references on the page by first clicking the Wikiref popup in the browser toolbar, then entering Select Mode by clicking "Select References".
Next, we can scroll down to the "References" or "External Links" sections and click on the UI element of interest. The item will be highlighted and encircled by a solid black border to make it clear what is currently selected.
Since the item we'll click on is likely a list item of some sort, we can optionally expand our selection to encompass all of the list items contained in the same list as the item that was originally clicked on. This makes extracting an entire section of references from a specific part of the page very easy.
After selecting a reference or section of references, we can extract the text and external links associated with these references by clicking the green check box that appears under the selected items. If we want to change our current selection instead of capturing the currently selected element(s), we can simply click on the new element we want to select. Alternatively, if we want to deselect the currently selected references, we can simply press the red ✕ that appears under the selected element(s). The screen capture below illustrates all of these features in the order: extract references, change selection, and cancel selection.
If we've captured a set of references and want to see what they look like in tabular form, along with any external links they contain, we can enter Display Mode. This will insert a <div>
element near the middle of the window containing a table, where items in the first column of the table are the reference text and numbered items in the second column are the external links associated with that reference text.
Since the reference titles are pulled directly from the HTML of the page, we may notice that some titles include Wikipedia page navigation indicators such as "^", or increasing character sequences like "a b c" that indicate multiple citations of a particular reference. This can be annoying if we're just trying to capture the actual reference's text, which is why Display Mode also enables us to edit the text of references.
To edit references in Display Mode, click the pencil icon in the top right of the <div>
. This icon will become highlighted, indicating that we're in Edit Mode. From here, we can click on one reference at a time, edit the text as needed, then click away from it or press the tab
character to finalize the edit. I am not a skilled web developer, and the Edit Mode view may not be the most visually appealing or well-designed user experience. I consider it a work in progress, and I welcome user feedback to help make it better!
After we've made all of our edits, we can exit Edit Mode by clicking the pencil icon again, which should revert the icon's appearance to its original form. From here, we can either download the edited references as JSON by clicking the export icon to the left of the pencil icon, or exit Display Mode entirely by clicking the ✕ icon to the right of the pencil icon.
If we decide we want to start over and delete the references we've previously captured, we can select "Delete References" in the popup UI, which will remove the references from localStorage
.
If we're satisfied with the references we've currently captured/edited and want to download these references (text and any external links) as JSON, we can do so simply by clicking "Download References" in the popup UI.
This will create a JSON file that is named based on the lowercased version of the last portion (splitting based on /
and ignoring document sections indicated by #
) of the document.baseURI
of the current page. For instance, if the current page (and section) we've navigated to and captured references on is https://en.wikipedia.org/wiki/Hard_disk_drive#References
, downloading the references will generate hard_disk_drive.json
.
Wikiref is comprised of three components, following the pattern used by extensions: background, popup, and content.
background.js
: This script contains the logic for actually executing a download of references after receiving a message from the content script. It primarily consists of a handleMessage
event listener that currently just listens for messages related to downloads, but could easily be extended later on to encompass other kinds of events.
popup.js
: Logic in this script listens for clicks on the extension popup and sends specific messages to the content script running in the active tab of the current window, triggering different extension modes such as Select Mode and Display Mode.
popup.css
: Styling for the popup.
wikiref.html
: Skeleton of the popup.
wikiref.js
: The "brain" behind Wikiref. Contains all of the logic for selecting, extracting, displaying, initiating download of, and editing references.
A reference is represented by a relatively simple data structure. It is a JavaScript object that consists of text
, links
, hash
, and id
properties. text
is a string containing the text of the reference as it appears on the topic page. links
is an array containing the href
values of each external link included in the specific reference text; currently non-external links aren't captured. hash
is a SHA-1 hash of the normalized document.baseURI
concatenated with text
using the |
character as a separator. id
is an incrementing integer value that indicates the order in which a reference was extracted.
The algorithm for capturing references is relatively straightforward. Here's the sequence of steps it follows:
-
User enters Select Mode by clicking "Select References" in the extension popup UI.
- This adds a
click
event listener todocument.body
and changes the style of the cursor topointer
so it's more intuitive to the user that they can now click on references.
- This adds a
-
User clicks on a particular reference item.
- Internal logic determines what item was clicked using
event.target
, then traverses up theelement.parentElement
lineage for the clicked element until it finds an<li>
. This parent element is marked as the actual element from which reference information will be extracted. This makes it easy to consistently apply styles to a selected reference regardless of what part of the reference is clicked, since clicks are really "bubbled up" to the parent<li>
containing the clicked element. - A check is made to see if there was a previously selected element, and if so, any applied styles are removed from the previously selected element.
- Styles are applied to the newly selected element to visually indicate it is highlighted. This includes a
<div>
containing action buttons -- ✓ for confirm selection, "Expand Selection" for highlighting the entire list the<li>
is contained in, and ✕ for cancelling selection -- that's inserted directly under the highlighted<li>
.
- Internal logic determines what item was clicked using
-
When a user clicks on the green ✓ icon -- possibly after having expanded selection to all items in a list by clicking
Expand Selection
in the<div>
option panel inserted at the end of step #2, it triggers a function that ultimately callsextractReference(child, index)
, which has a relatively straightforward implementation:/** * Extracts a reference from a child element. * This should be an <li> */ async function extractReference(child, index) { var ref = { id: index, hash: "", text: "", links: [], }; ref.text = child.innerText; // Hash based on current document and text of reference. ref.hash = await digestMessage(`${getBaseURI()}|${ref.text}`); var a_children = child.getElementsByTagName("a"); for (var k = 0; k < a_children.length; k++) { // Extract the unique links that are external references only (for now) if ( a_children[k].classList.contains("external") && a_children[k].rel === "nofollow" && !ref.links.includes(a_children[k].href) ) { ref.links.push(a_children[k].href); } } return ref; }
-
The captured reference(s) are stored in
localStorage
. -
Highlighting is removed from the captured references.
Currently, references captured for a given page are stored using localStorage
, with the key being equal to a normalized version of document.baseURI
, and the value being a JSON.stringify
'd version of a list of reference objects.