-
Notifications
You must be signed in to change notification settings - Fork 11
Home
All the three -i -o -c
options are required. By default it reads from stdin
and output to stdout
:
$ echo '<input type="radio" name="Sex" value="F" />' | tee /tmp/cascadia.xml | cascadia -i -o -c 'input[name=Sex][value=F]'
1 elements for 'input[name=Sex][value=F]':
<input type="radio" name="Sex" value="F"/>
Either the input or the output can be followed by a file name:
$ cascadia -i /tmp/cascadia.xml -o -c 'input[name=Sex][value=F]'
1 elements for 'input[name=Sex][value=F]':
<input type="radio" name="Sex" value="F"/>
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html
1 elements for 'input[name=Sex][value=F]':
$ cat /tmp/out.html
<input type="radio" name="Sex" value="F"/>
More other options can be applied too:
# using --wrap-html
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html -w
1 elements for 'input[name=Sex][value=F]':
$ cat /tmp/out.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="">
</head>
<body>
<input type="radio" name="Sex" value="F"/>
</body>
# using --wrap-html with --style
$ cascadia -i /tmp/cascadia.xml -c 'input[name=Sex][value=F]' -o /tmp/out.html -w -y '<link rel="stylesheet" href="styles.css">'
1 elements for 'input[name=Sex][value=F]':
$ cat /tmp/out.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="">
<link rel="stylesheet" href="styles.css">
</head>
<body>
<input type="radio" name="Sex" value="F"/>
</body>
For more on using the --style
option, check out "adding styles".
There might also be cases when we don't want to figure out the selection css, be it a one-off extraction, or the extraction section keep changing from the same site, or the tags keep changing on each request like the following,
For whatever reason, manual selection with the help of the developer tool's visual aid is the fastest way to grab what we want. However, most often than not, the links will be wrong and images will be missing if we store the extracted html somewhere else as-is (extract from https://site-a/ and put it in https://site-b/), because all the href and images links should be still pointing to https://site-a/, not https://site-b/.
cascadia
can help with such situation too, and you still don't need to figure out the exact css selectors, thanks to the :root css selectors. But, there needs a twist:
$ echo '<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"></p></div>' | tee /tmp/cascadia.xml
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"></p></div>
$ cat /tmp/cascadia.xml | cascadia -q -i -o -c 'div.container' | tee /tmp/w3schools-img.html
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
# Using a proper css selector, but the /tmp/w3schools-img.html file would *not* be able to show the image properly
$ cat /tmp/cascadia.xml | cascadia -q -i -o -c ':root'
<html><head></head><body><div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
</body></html>
I.e., the :root css selectors does not quite gives us what we want cleanly, it gives us the "extra" html
and head
tags as well, because "In HTML, the root element is always the html element". Not a problem, we can easily overcome the hiccup like this:
baseHref=https://www.w3schools.com/html/
$ cat /tmp/cascadia.xml | cascadia -q -i -o -c ':root' | cascadia -q -i -o -c 'body > div' --wrap-html --base $baseHref | tee /tmp/w3schools-img.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<base href="https://www.w3schools.com/html/">
</head>
<body>
<div class="container"><p align="justify"><img src="pic_trulli.jpg" alt="Trulli" width="500" height="333"/></p></div>
</body>
The /tmp/w3schools-img.html
file would now be able to show the image properly.
(Of course, if the extracted root tag is not div
, you need to change the above command accordingly)
Of course, any number of selections are allowed (provided out of box from the CSS selection ",
" syntax):
$ echo '<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tr style="height:64px">aaa</tr></table>' | cascadia -i -o -c 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]'
2 elements for 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]':
<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tbody><tr style="height:64px"></tr></tbody></table>
<tr style="height:64px"></tr>
Or, to make the multi-selection explicit on cli, emphasizing selecting being from different parts using different selectors, one can provide multiple --css
on the command line. E.g.,
cascadia -o -i http://www.iciba.com/conformity -c 'div.js-base-info > div > div > div.in-base-top.clearfix' -c 'div.js-base-info > div > div > ul' -c 'div.js-base-info > div > div > li' -c 'div.info-article.article-tab'
It'll construct the return from all four -c
CSS selectors.
It has the same effect as using the ",
" syntax, but
- The CSS selectors are provided explicitly with multiple
--css
parameters. - The "
,
" syntax will return according to the order the selections occur in source, while - The multiple
--css
will return according to the order the--css
parameters.
First, as the none-block selection mode will output the selection as HTML source, so if you want HTML text instead, then you can make use of the block selection mode.
$ echo '<div class="container"><p align="justify"><b>Name: </b>John Doe</p></div>' | tee /tmp/cascadia.xml | cascadia -i -o -c 'div > p'
1 elements for 'div > p':
<p align="justify"><b>Name: </b>John Doe</p>
$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='p'
SelText
Name: John Doe
Note that the block selection mode can output in HTML as well -- it just outputs (HTML) text by default:
$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='RAW:p'
SelText
<p align="justify"><b>Name: </b>John Doe</p>
The real power of block selection mode resides in its capability of producing tsv/csv tables without any go programming:
$ curl --silent https://news.ycombinator.com | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No Title Site
1. Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016) microsoft.com
2. Starting today, users of Firefox can also enjoy Netflix on Linux netflix.com
3. Research Debt distill.pub
...
27. USPS Informed Delivery ? Digital Images of Front of Mailpieces usps.com
28. Performance bugs ? the dark matter of programming bugs forwardscattering.org
29. Most items of clothing have complicated international journeys bbc.co.uk
30. High-performance employees need quieter work spaces qz.com
It's poor man's scrapper tool if text are the only thing needed. For scrapping beyond text, then just go one step further, to use andrew-d/goscrape (or my goscrape instead, which has some enhancements to it).
Again, if text are the only thing needed, then cascadia
might be already enough. Here is how to scrap Hacker News across several pages:
$ curl --silent https://news.ycombinator.com/news?p=[1-3] | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No Title Site
1. Starting today, users of Firefox can also enjoy Netflix on Linux netflix.com
2. Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016) microsoft.com
3. Research Debt distill.pub
...
27. Yes I Still Want to Be Doing This at 56 (2012) thecodist.com
28. Performance bugs ? the dark matter of programming bugs forwardscattering.org
29. USPS Informed Delivery ? Digital Images of Front of Mailpieces usps.com
30. High-performance employees need quieter work spaces qz.com
31. Most items of clothing have complicated international journeys bbc.co.uk
32. Telstra?s Gigabit Class LTE Network cellularinsights.com
...
58. The New Laptop Ban Adds to Travelers' Lack of Privacy and Security eff.org
59. QEMU: user-to-root privesc inside VM via bad translation caching chromium.org
60. Startups that debuted at Y Combinator W17 Demo Day 2 techcrunch.com
61. The Cracking Monolith: Forces That Call for Microservices semaphoreci.com
62. Amsterdam Airport Launches API Platform schiphol.nl
...
88. Founder Stories: Leah Culver of Breaker (YC W17) ycombinator.com
89. Find out what you, or someone on your team, did on the last working day github.com
90. PSD2 ? a directive that will change banking in Europe evry.com
By default it uses tab \t
as fields delimiter, so the output is in .tsv
format. To change to .csv
, add -d ,
to the command line.
Thanks to PR by @himcc, cascadia
can now select from element attributes as well, which is impossible by css selection syntax itself. It will
- need block selection mode, and
- output in table form.
The usage syntax is explained in cascadia
help itself:
$ cascadia
. . .
-p, --piece sub CSS selectors within -css to split that block up into pieces
format: PieceName=[OutputStyle:]selector_string
OutputStyle:
RAW : will return the selected as-is
attr[xx] : will return the value of xx attribute
else the text will be returned
. . .
Here are some examples:
$ cat /tmp/ul.html
<ul>
<li><a id="a1" href="http://www.google.com/finance"/>
<li><a id="a2" href="http://finance.yahoo.com/"/>
<li><a id="a3" href="https://www.google.com/news"></a>
<li><a id="a4" href="http://news.yahoo.com"/>
</ul>
$ cat /tmp/ul.html | cascadia -i -o -c "li" -p 'url=attr[href]:a'
url
http://www.google.com/finance
http://finance.yahoo.com/
https://www.google.com/news
http://news.yahoo.com
$ cat /tmp/ul.html | cascadia -i -o -c "li" -p 'LinkID=attr[id]:a'
LinkID
a1
a2
a3
a4
Block selection mode is poor man's web scrapping tool, and it is very simple to use. Here is another practical example -- Twitter searching. We all know that you have to pay for the Twitter Search API and it only serves Tweets from the past week. With cascadia
, you can search the tweets for free, and get the latest content as well.
Here is how I watch for Toronto/GTA's Gas Price Alert, without getting all other tweets from him:
$ cascadia -i 'https://twitter.com/search?q=%22Gas%20Price%20Alert%22%20%23GTA%20from%3AGasBuddyDan&src=typd' -o -c 'div.stream div.original-tweet div.content' --piece Time='small.time' --piece Tweet='div.js-tweet-text-container > p'
Time Tweet
Jul 31
Gas Price Alert #Toronto #GTA #Hamilton #Ottawa #LdnOnt #Barrie #Kitchener #Niagara #Windsor N/C Tues and to a 2ct/l HIKE gor Wednesday
Jul 6
Gas Price Alert #Toronto #GTA #LdnOnt #Hamilton #Ottawa #Barrie #KW to see a 1 ct/l drop @ for Friday July 7
May 30
Gas Price Alert #Toronto #GTA #Ottawa #LdnOnt #Hamilton #KW #Barrie #Windsor prices won't change Wednesday but will DROP 1 ct/l Thursday
May 15
Gas Price Alert #Toronto #GTA #Barrie #Hamilton #LdnOnt #Ottawa #KW #Windsor NO CHANGE @ except gas bar shenanigans for Tues & Wednesday
Mar 7
Gas Price Alert #Toronto #GTHA #LdnOnt #Ottawa #Barrie #KW #Windsor to see a 1 cent a litre HIKE Wed March 8 (to 107.9 in the #GTA)
Many web sites annoyingly separated one file into several small pieces so that they can show it to you in different web pages, with different ads. However, I'd like to view them in one page and no ads. Or, at least that is what I'd been hoping for all the time, but I didn't have an easy way of doing it until now, with cascadia
.
With cascadia
then no more programming is necessary. All we need to do now is to pass on some command line parameters, and the magic will happen. There are so many such sites that break thing into several small pieces, the following two are those I just did the other day.
The first one is separated across over 23 pages! Twenty-three! I would just give up if I don't have cascadia
, but with it, it is so simple:
curl --silent http://www.chinadmd.com/file/prrxtuivvxsxxwwaexuuwovp_[1-23].html | cascadia -i -o -c div.panel-body -p 'Book=div.tofu-txt' > /tmp/book.txt
The first page is here, and all 23-pages are collected here. I collect them as plain text because the HTML were just wrapping around the plain text, thus no need HTML, plain text is good enough.
Collecting as HTML is no trouble either. Here is another example:
curl --silent http://www.shangxueedu.com/shuxue/ksdg/20170113_162_[1-6].html | cascadia -i -o -c div.m-post -p 'Book=RAW:div.post-con' --wrap-html | tee /tmp/book.html
The fifth page is here, and all pages are collected here. Please check them out.
I'm not an expert on CSS Selector at all, but the following resources are what I found most helpful to me.
- CSS Selectors Cheat Sheet I think It's very good, because it's usage oriented and very practical, i.e., it arranges the Selectors according to their purposes. If that's too dry for you, check out
- The 30 CSS Selectors You Must Memorize It only lists those selectors that are important, but it gives concrete examples and explanations
- CSS Selector Reference from w3schools. This is the one I most often refer to, because the list is comprehensive, and there is also an online CSS Selector Tester that really helped me learn and understand