A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.
This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.
Add to your Gemfile:
gem "uri-idna"
And then run bundle install
.
There are plenty of ways to convert IDNs between Unicode and ACE forms.
The RFC 5891 defines two protocols for IDN conversion: Registration and Domain Name Lookup.
URI::IDNA.register(alabel:, ulabel:, **options)
check_hyphens
:true
– whether to check hyphens according to Section 5.4.leading_combining
:true
– whether to check leading combining marks according to Section 5.4.check_joiners
:true
– whether to checkCONTEXTJ
code points according to Section 5.4.check_others
:true
– whether to checkCONTEXTO
code points according to Section 5.4.check_bidi
:true
– whether to check bidirectional characters according to Section 5.4.
require "uri/idna"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
#=> "xn--gdkl8fhk5egc.jp"
URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>
URI::IDNA.lookup(domain_name, **options)
check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1.leading_combining
:true
– whether to check leading combining marks according to Section 4.2.3.2.check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3.check_others
:true
– whether to check CONTEXTO code points according to Section 4.2.3.3.check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4.verify_dns_length
:true
– whether to check DNS length according to Section 4.4.
require "uri/idna"
URI::IDNA.lookup("ハロー・ワールド.jp")
#=> "xn--pck0a1b0a6a2e.jp"
URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
#=> "xn--pck0a1b0a6a2e.jp"
URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>
Current revision: 31
The UTS46 defines two IDN conversion functions: ToASCII and ToUnicode.
URI::IDNA.to_ascii(domain_name, **options)
use_std3_ascii_rules
:true
– whether to apply STD3 rules for both mapping and validation.check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1 of RFC 5891.check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4 of RFC 5891.check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3 of RFC 5891.transitional_processing
:false
– (deprecated) whether to apply transitional processing for mapping.ignore_invalid_punycode
:false
– whether to fast-path invalid Punycode labels according to 4th step of Processing.verify_dns_length
:true
– whether to check DNS length according to Section 4.4 of RFC 5891.
require "uri/idna"
URI::IDNA.to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"
# UTS46 transitional processing is disabled by default,
# but can be enabled via option:
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
#=> "bloss.de"
# Note that UTS46 processing is not fully IDNA2008 compliant:
URI::IDNA.to_ascii("☕.us")
#=> "xn--53h.us"
URI::IDNA.to_unicode(domain_name, **options)
use_std3_ascii_rules
:true
– whether to apply STD3 rules for both mapping and validation.check_hyphens
:true
– whether to check hyphens according to Section 4.2.3.1 of RFC 5891.check_bidi
:true
– whether to check bidirectional characters according to Section 4.2.3.4 of RFC 5891.check_joiners
:true
– whether to check CONTEXTJ code points according to Section 4.2.3.3 of RFC 5891.transitional_processing
:false
– (deprecated) whether to apply transitional processing for mapping.ignore_invalid_punycode
:false
– whether to fast-path invalid Punycode labels according to 4th step of Processing.
require "uri/idna"
URI::IDNA.to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:
require "uri/idna"
# For example we can use UTS46 mapping to downcase some characters
char = "⼤"
char.ord # "\u2F24"
#=> 12068
# just downcase doesn't work in this case
char.downcase.ord
#=> 12068
# but UTS46 mapping does it's thing:
URI::IDNA::UTS46::Mapping.call(char).ord
#=> 22823
# so here is a full example:
domain = "⼤.cn" # "\u2F24.cn"
URI::IDNA.lookup(domain)
# <URI::IDNA::InvalidCodepointError: Codepoint U+2F24 at position 1 of "⼤" not allowed>
mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
#=> "xn--pss.cn"
WHATWG's URL Standard uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the be_btrict
flag instead.
Note that the check_hyphens
UTS46 option is set to false
in this algorithm.
URI::IDNA.whatwg_to_ascii(domain_name, **options)
be_strict
:true
– defines values ofuse_std3_ascii_rules
andverify_dns_length
UTS46 options.
require "uri/idna"
URI::IDNA.whatwg_to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"
# The be_strict flag sets use_std3_ascii_rules and verify_dns_length UTS46 flags to its value
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
#=> "2003_rules.com"
# By default be_strict is set to true
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>
URI::IDNA.whatwg_to_unicode(domain_name, **options)
be_strict
:true
-be_strict
:true
– defines value ofuse_std3_ascii_rules
UTS46 option.
require "uri/idna"
URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
#=> "bloß.de"
Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.
require "uri/idna/punycode"
URI::IDNA::Punycode.encode("ハロー・ワールド")
#=> "gdkl8fhk5egc"
URI::IDNA::Punycode.decode("gdkl8fhk5egc")
#=> "ハロー・ワールド"
- RFC 5890 – Definitions and Document Framework
- RFC 5891 – Protocol
- RFC 5892 – The Unicode Code Points
- RFC 5893 – Bidi rule
- RFC 3492 – Punycode: A Bootstring encoding of Unicode
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
This gem uses Unicode data files to perform IDN conversion. To generate new Unicode data files, run bundle exec rake idna:generate
.
To specify Unicode version, use VERSION
environment variable, e.g. VERSION=15.1.0 bundle exec rake idna:generate
.
By default, used Unicode version is the one used by the Ruby version (RbConfig::CONFIG["UNICODE_VERSION"]
).
To set directory for generated files, use DEST_DIR
environment variable, e.g. DEST_DIR=lib/uri/idna/data bundle exec rake idna:generate
.
Unicode data cached in the tmp
directory by default, to change it, use CACHE_DIR
environment variable, e.g. CACHE_DIR=~/.cache/unicode_data bundle exec rake idna:generate
.
Note: rake idna:generate
might generate different results on different versions of Ruby due to usage of built-in Unicode normalization methods.
To inspect Unicode data, run bundle exec rake 'idna:inspect[<HEX_CODE>]'
.
To specify Unicode version, or cache directory, use VERSION
or CACHE_DIR
environment variables, e.g. VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'
.
To update UTS46 test suite data, run bundle exec rake idna:update_uts46_test_suite
.
Bug reports and pull requests are welcome on GitHub at https://github.com/skryukov/uri-idna.
The gem is available as open source under the terms of the MIT License.