-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
uploaded converting HTML entities to 'normal' UTF-8 in bash.md
- Loading branch information
Showing
1 changed file
with
78 additions
and
0 deletions.
There are no files selected for viewing
78 changes: 78 additions & 0 deletions
78
notes/converting HTML entities to 'normal' UTF-8 in bash.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
--- | ||
date: 2024-11-08 | ||
title: converting HTML entities to 'normal' UTF-8 in bash | ||
tags: | ||
- bash | ||
- html | ||
--- | ||
I do a fair amount of web scraping, and you come across a lot of HTML entities. They're the things that look like `>`, ` `, `&`. | ||
|
||
See <https://developer.mozilla.org/en-US/docs/Glossary/Character_reference> | ||
|
||
Naturally, you usually want to turn them back into characters, usually UTF-8. I came across a particularly gnarly site that had some normal HTML entities, some rare (Unicode) ones, and also some special characters that weren't encoded at all. From it, I made file to test HTML entity decoders on. Here it is, as `file.txt`: | ||
|
||
```text | ||
Children's event, | ||
Wildlife & Nature, | ||
peddler-market-nº-88, | ||
Artists’ Circle, | ||
surface – Breaking | ||
woodland walk. (nbsp) | ||
Justin Adams & Mauro Durante | ||
``` | ||
|
||
I wanted to find a way to convert the entities (i.e., decode `'` `&` `º`, but NOT decode `’` `–` ` ` (nbsp) `&`) with a single command I could put in a bash pipe. I tried several contenders: | ||
|
||
## perl | ||
|
||
I'd used this one before in a script scraping an RSS feed: <https://github.com/alifeee/openbenches-train-sign/blob/a29cc24df919c67809f84586f9e0a90aed6ea3cf/transformer/full.cgi#L49>, but on this input, it fails as it doesn't decode the symbol `º` (`U+00BA : MASCULINE ORDINAL INDICATOR` from <https://babelstone.co.uk/Unicode/whatisit.html>). | ||
|
||
```bash | ||
$ cat file.txt | perl -MHTML::Entities -pe 'decode_entities($_);' | ||
Children's event, | ||
Wildlife & Nature, | ||
peddler-market-n�-88, | ||
Artists’ Circle, | ||
surface – Breaking | ||
woodland walk. (nbsp) | ||
Justin Adams & Mauro Durante | ||
``` | ||
## recode | ||
I found recode after some searching, but it failed as it tried to unencode things that were already unencoded. | ||
```bash | ||
$ sudo apt install recode | ||
$ cat file.txt | recode html | ||
Children's event, | ||
Wildlife & Nature, | ||
peddler-market-nº-88, | ||
Artistsâ Circle, | ||
surface â Breaking | ||
woodland walk. (nbsp) | ||
Justin Adams Mauro Durante | ||
``` | ||
|
||
## php | ||
|
||
At first I used `html_specialchars_decode` which didn't work, but then I found `html_entity_decode`, which does the job perfectly. Thanks PHP. | ||
|
||
```bash | ||
$ cat file.txt | php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }' | ||
Children's event, | ||
Wildlife & Nature, | ||
peddler-market-nº-88, | ||
Artists’ Circle, | ||
surface – Breaking | ||
woodland walk. (nbsp) | ||
Justin Adams & Mauro Durante | ||
``` | ||
The only thing I don't know how to do now is to make a bash function or alias so that I could write | ||
|
||
```bash | ||
cat file.txt | decodeHTML | ||
``` | ||
|
||
instead of the massive `php -r 'while ($f = fgets(STDIN)){ echo html_entity_decode($f); }'`. |