Skip to content
suntong edited this page Jan 19, 2023 · 11 revisions

Examples

None-block selection mode

All the three -i -o -c options are required. By default it reads from stdin and output to stdout:

$ echo '<input type="radio" name="Sex" value="F" />' | tee /tmp/cascadia.xml | cascadia -i -o -c 'input[name=Sex][value=F]'
1 elements for 'input[name=Sex][value=F]':
<input type="radio" name="Sex" value="F"/>

Either the input or the output can be followed by a file name:

$ cascadia -i /tmp/cascadia.xml -o -c 'input[name=Sex][value=F]'
1 elements for 'input[name=Sex][value=F]':
<input type="radio" name="Sex" value="F"/>
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html
1 elements for 'input[name=Sex][value=F]':

$ cat /tmp/out.html
<input type="radio" name="Sex" value="F"/>

More other options can be applied too:

# using --wrap-html
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html -w
1 elements for 'input[name=Sex][value=F]':

$ cat /tmp/out.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="">

</head>
<body>
<input type="radio" name="Sex" value="F"/>
</body>

# using --wrap-html with --style
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html -w -y '<link rel="stylesheet" href="styles.css">'
1 elements for 'input[name=Sex][value=F]':

$ cat /tmp/out.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="">
<link rel="stylesheet" href="styles.css">
</head>
<body>
<input type="radio" name="Sex" value="F"/>
</body>

For more on using the --style option, check out "adding styles".

Manual selection helper

There might also be cases when we don't want to figure out the selection css, be it a one-off extraction, or the extraction section keep changing from the same site, or the tags keep changing on each request like the following,

image

For whatever reason, manual selection with the help of the developer tool's visual aid is the fastest way to grab what we want. However, most often than not, the links will be wrong and images will be missing if we store the extracted html somewhere else as-is (extract from https://site-a/ and put it in https://site-b/), because all the href and images links should be still pointing to https://site-a/, not https://site-b/.

cascadia can help with such situation too, and you still don't need to figure out the exact css selectors, thanks to the :root css selectors. But, there needs a twist:

$ echo '<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"></p></div>' | tee /tmp/cascadia.xml 
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"></p></div>

$ cat /tmp/cascadia.xml | cascadia -q -i -o -c 'div.container' | tee /tmp/w3schools-img.html
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
# Using a proper css selector, but the /tmp/w3schools-img.html file would *not* be able to show the image properly

$ cat /tmp/cascadia.xml | cascadia -q -i -o -c ':root'
<html><head></head><body><div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
</body></html>

I.e., the :root css selectors does not quite gives us what we want cleanly, it gives us the "extra" html and head tags as well, because "In HTML, the root element is always the html element". Not a problem, we can easily overcome the hiccup like this:

baseHref=https://www.w3schools.com/html/

$ cat /tmp/cascadia.xml | cascadia -q -i -o -c ':root' | cascadia -q -i -o -c 'body > div' --wrap-html --base $baseHref | tee /tmp/w3schools-img.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="https://www.w3schools.com/html/">

</head>
<body>
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
</body>

The /tmp/w3schools-img.html file would now be able to show the image properly.
(Of course, if the extracted root tag is not div, you need to change the above command accordingly)

Multi-selection

Of course, any number of selections are allowed (provided out of box from the CSS selection "," syntax):

$ echo '<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tr style="height:64px">aaa</tr></table>' | cascadia -i -o -c 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]'
2 elements for 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]':
<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tbody><tr style="height:64px"></tr></tbody></table>
<tr style="height:64px"></tr>

Or, to make the multi-selection explicit on cli, emphasizing selecting being from different parts using different selectors, one can provide multiple --css on the command line. E.g.,

cascadia -o -i http://www.iciba.com/conformity -c 'div.js-base-info > div > div > div.in-base-top.clearfix' -c 'div.js-base-info > div > div > ul' -c 'div.js-base-info > div > div > li' -c 'div.info-article.article-tab'

It'll construct the return from all four -c CSS selectors.

It has the same effect as using the "," syntax, but

  • The CSS selectors are provided explicitly with multiple --css parameters.
  • The "," syntax will return according to the order the selections occur in source, while
  • The multiple --css will return according to the order the --css parameters.

Block selection mode

First, as the none-block selection mode will output the selection as HTML source, so if you want HTML text instead, then you can make use of the block selection mode.

$ echo '<div class="container"><p align="justify"><b>Name: </b>John Doe</p></div>' | tee /tmp/cascadia.xml | cascadia -i -o -c 'div > p'
1 elements for 'div > p':
<p align="justify"><b>Name: </b>John Doe</p>

$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='p'
SelText
Name: John Doe

Block selection mode HTML output

Note that the block selection mode can output in HTML as well -- it just outputs (HTML) text by default:

$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='RAW:p'
SelText 
<p align="justify"><b>Name: </b>John Doe</p>

Block selection mode table output

The real power of block selection mode resides in its capability of producing tsv/csv tables without any go programming:

$ curl --silent https://news.ycombinator.com | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No      Title   Site
1.      Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016)   microsoft.com
2.      Starting today, users of Firefox can also enjoy Netflix on Linux        netflix.com
3.      Research Debt   distill.pub
...
27.     USPS Informed Delivery ? Digital Images of Front of Mailpieces  usps.com
28.     Performance bugs ? the dark matter of programming bugs  forwardscattering.org
29.     Most items of clothing have complicated international journeys  bbc.co.uk
30.     High-performance employees need quieter work spaces     qz.com

It's poor man's scrapper tool if text are the only thing needed. For scrapping beyond text, then just go one step further, to use andrew-d/goscrape (or my goscrape instead, which has some enhancements to it).

Again, if text are the only thing needed, then cascadia might be already enough. Here is how to scrap Hacker News across several pages:

$ curl --silent https://news.ycombinator.com/news?p=[1-3] | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No      Title   Site
1.      Starting today, users of Firefox can also enjoy Netflix on Linux        netflix.com
2.      Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016)   microsoft.com 
3.      Research Debt   distill.pub
...
27.     Yes I Still Want to Be Doing This at 56 (2012)  thecodist.com
28.     Performance bugs ? the dark matter of programming bugs  forwardscattering.org
29.     USPS Informed Delivery ? Digital Images of Front of Mailpieces  usps.com
30.     High-performance employees need quieter work spaces     qz.com
31.     Most items of clothing have complicated international journeys  bbc.co.uk
32.     Telstra?s Gigabit Class LTE Network     cellularinsights.com
...
58.     The New Laptop Ban Adds to Travelers' Lack of Privacy and Security      eff.org 
59.     QEMU: user-to-root privesc inside VM via bad translation caching        chromium.org
60.     Startups that debuted at Y Combinator W17 Demo Day 2    techcrunch.com
61.     The Cracking Monolith: Forces That Call for Microservices       semaphoreci.com 
62.     Amsterdam Airport Launches API Platform schiphol.nl
...
88.     Founder Stories: Leah Culver of Breaker (YC W17)        ycombinator.com 
89.     Find out what you, or someone on your team, did on the last working day github.com
90.     PSD2 ? a directive that will change banking in Europe   evry.com

By default it uses tab \t as fields delimiter, so the output is in .tsv format. To change to .csv, add -d , to the command line.

Attribute selection

Thanks to PR by @himcc, cascadia can now select from element attributes as well, which is impossible by css selection syntax itself. It will

  1. need block selection mode, and
  2. output in table form.

The usage syntax is explained in cascadia help itself:

$ cascadia
. . .
  -p, --piece       sub CSS selectors within -css to split that block up into pieces
                        format: PieceName=[OutputStyle:]selector_string
                         OutputStyle:
                          RAW : will return the selected as-is
                          attr[xx] : will return the value of xx attribute
                        else the text will be returned 
. . .

Here are some examples:

$ cat /tmp/ul.html
<ul>
                <li><a id="a1" href="http://www.google.com/finance"/>
                <li><a id="a2" href="http://finance.yahoo.com/"/>
                <li><a id="a3" href="https://www.google.com/news"></a>
                <li><a id="a4" href="http://news.yahoo.com"/>
</ul>

$ cat /tmp/ul.html | cascadia -i -o -c "li" -p 'url=attr[href]:a'
url
http://www.google.com/finance
http://finance.yahoo.com/
https://www.google.com/news
http://news.yahoo.com

$ cat /tmp/ul.html | cascadia -i -o -c "li" -p 'LinkID=attr[id]:a'
LinkID
a1
a2
a3
a4

Twitter Search

Block selection mode is poor man's web scrapping tool, and it is very simple to use. Here is another practical example -- Twitter searching. We all know that you have to pay for the Twitter Search API and it only serves Tweets from the past week. With cascadia, you can search the tweets for free, and get the latest content as well.

Here is how I watch for Toronto/GTA's Gas Price Alert, without getting all other tweets from him:

$ cascadia -i 'https://twitter.com/search?q=%22Gas%20Price%20Alert%22%20%23GTA%20from%3AGasBuddyDan&src=typd' -o -c 'div.stream div.original-tweet div.content' --piece Time='small.time' --piece Tweet='div.js-tweet-text-container > p'
Time    Tweet

  Jul 31
        Gas Price Alert #Toronto #GTA #Hamilton #Ottawa #LdnOnt #Barrie #Kitchener #Niagara #Windsor N/C Tues and to a 2ct/l HIKE gor Wednesday

  Jul 6
        Gas Price Alert #Toronto #GTA #LdnOnt #Hamilton #Ottawa #Barrie #KW to see a 1 ct/l drop @ for Friday July 7

  May 30
        Gas Price Alert #Toronto #GTA #Ottawa #LdnOnt #Hamilton #KW #Barrie #Windsor prices won't change Wednesday but will DROP 1 ct/l Thursday

  May 15
        Gas Price Alert #Toronto #GTA #Barrie #Hamilton #LdnOnt #Ottawa #KW #Windsor NO CHANGE @  except gas bar shenanigans for Tues & Wednesday

  Mar 7
        Gas Price Alert #Toronto #GTHA #LdnOnt #Ottawa #Barrie #KW #Windsor to see a 1 cent a litre HIKE Wed March 8 (to 107.9 in the #GTA)

Reconstruct the separated pages

Many web sites annoyingly separated one file into several small pieces so that they can show it to you in different web pages, with different ads. However, I'd like to view them in one page and no ads. Or, at least that is what I'd been hoping for all the time, but I didn't have an easy way of doing it until now, with cascadia.

With cascadia then no more programming is necessary. All we need to do now is to pass on some command line parameters, and the magic will happen. There are so many such sites that break thing into several small pieces, the following two are those I just did the other day.

The first one is separated across over 23 pages! Twenty-three! I would just give up if I don't have cascadia, but with it, it is so simple:

curl --silent http://www.chinadmd.com/file/prrxtuivvxsxxwwaexuuwovp_[1-23].html | cascadia -i -o -c div.panel-body -p 'Book=div.tofu-txt' > /tmp/book.txt

The first page is here, and all 23-pages are collected here. I collect them as plain text because the HTML were just wrapping around the plain text, thus no need HTML, plain text is good enough.

Collecting as HTML is no trouble either. Here is another example:

 curl --silent http://www.shangxueedu.com/shuxue/ksdg/20170113_162_[1-6].html | cascadia -i -o -c div.m-post -p 'Book=RAW:div.post-con' --wrap-html | tee /tmp/book.html

The fifth page is here, and all pages are collected here. Please check them out.

More On CSS Selector

I'm not an expert on CSS Selector at all, but the following resources are what I found most helpful to me.

  • CSS Selectors Cheat Sheet I think It's very good, because it's usage oriented and very practical, i.e., it arranges the Selectors according to their purposes. If that's too dry for you, check out
  • The 30 CSS Selectors You Must Memorize It only lists those selectors that are important, but it gives concrete examples and explanations
  • CSS Selector Reference from w3schools. This is the one I most often refer to, because the list is comprehensive, and there is also an online CSS Selector Tester that really helped me learn and understand
Clone this wiki locally