Skip to content

A IDNA2008, UTS46 and Punycode implementation in pure Ruby

License

Notifications You must be signed in to change notification settings

skryukov/uri-idna

Repository files navigation

URI::IDNA

Gem Version Ruby

A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.

This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.

Sponsored by Evil Martians

Installation

Add to your Gemfile:

gem "uri-idna"

And then run bundle install.

Usage

There are plenty of ways to convert IDNs between Unicode and ACE forms.

IDNA2008

The RFC 5891 defines two protocols for IDN conversion: Registration and Domain Name Lookup.

Registration protocol

URI::IDNA.register(alabel:, ulabel:, **options)

Options
  • check_hyphens: true – whether to check hyphens according to Section 5.4.
  • leading_combining: true – whether to check leading combining marks according to Section 5.4.
  • check_joiners: true – whether to check CONTEXTJ code points according to Section 5.4.
  • check_others: true – whether to check CONTEXTO code points according to Section 5.4.
  • check_bidi: true – whether to check bidirectional characters according to Section 5.4.
require "uri/idna"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
#=> "xn--gdkl8fhk5egc.jp"

URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>

Domain Name Lookup Protocol

URI::IDNA.lookup(domain_name, **options)

Options
  • check_hyphens: true – whether to check hyphens according to Section 4.2.3.1.
  • leading_combining: true – whether to check leading combining marks according to Section 4.2.3.2.
  • check_joiners: true – whether to check CONTEXTJ code points according to Section 4.2.3.3.
  • check_others: true – whether to check CONTEXTO code points according to Section 4.2.3.3.
  • check_bidi: true – whether to check bidirectional characters according to Section 4.2.3.4.
  • verify_dns_length: true – whether to check DNS length according to Section 4.4.
require "uri/idna"

URI::IDNA.lookup("ハロー・ワールド.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
#=> "xn--pck0a1b0a6a2e.jp"

URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>

Unicode UTS46 (TR46)

Current revision: 31

The UTS46 defines two IDN conversion functions: ToASCII and ToUnicode.

ToASCII

URI::IDNA.to_ascii(domain_name, **options)

Options
require "uri/idna"

URI::IDNA.to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# UTS46 transitional processing is disabled by default,
# but can be enabled via option:
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
#=> "bloss.de"

# Note that UTS46 processing is not fully IDNA2008 compliant:
URI::IDNA.to_ascii("☕.us")
#=> "xn--53h.us"

ToUnicode

URI::IDNA.to_unicode(domain_name, **options)

Options
require "uri/idna"

URI::IDNA.to_unicode("xn--blo-7ka.de")
#=> "bloß.de"

IDNA2008 compatibility

It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:

require "uri/idna"

# For example we can use UTS46 mapping to downcase some characters
char = "⼤"
char.ord # "\u2F24"
#=> 12068

# just downcase doesn't work in this case
char.downcase.ord
#=> 12068

# but UTS46 mapping does it's thing:
URI::IDNA::UTS46::Mapping.call(char).ord 
#=> 22823

# so here is a full example:
domain = "⼤.cn" # "\u2F24.cn"
URI::IDNA.lookup(domain)
# <URI::IDNA::InvalidCodepointError: Codepoint U+2F24 at position 1 of "⼤" not allowed>

mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
#=> "xn--pss.cn"

WHATWG

WHATWG's URL Standard uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the be_btrict flag instead.

Note that the check_hyphens UTS46 option is set to false in this algorithm.

ToASCII

URI::IDNA.whatwg_to_ascii(domain_name, **options)

Options
  • be_strict: true – defines values of use_std3_ascii_rules and verify_dns_length UTS46 options.
require "uri/idna"

URI::IDNA.whatwg_to_ascii("Bloß.de")
#=> "xn--blo-7ka.de"

# The be_strict flag sets use_std3_ascii_rules and verify_dns_length UTS46 flags to its value
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
#=> "2003_rules.com"

# By default be_strict is set to true
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>

ToUnicode

URI::IDNA.whatwg_to_unicode(domain_name, **options)

Options
  • be_strict: true - be_strict: true – defines value of use_std3_ascii_rules UTS46 option.
require "uri/idna"

URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
#=> "bloß.de"

Punycode

Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.

require "uri/idna/punycode"

URI::IDNA::Punycode.encode("ハロー・ワールド")
#=> "gdkl8fhk5egc"

URI::IDNA::Punycode.decode("gdkl8fhk5egc")
#=> "ハロー・ワールド"

Full technical reference:

IDNA2008

Punycode

  • RFC 3492 – Punycode: A Bootstring encoding of Unicode

UTS46 (also referenced as TS46)

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and the created tag, and push the .gem file to rubygems.org.

Generating Unicode data

This gem uses Unicode data files to perform IDN conversion. To generate new Unicode data files, run bundle exec rake idna:generate.

To specify Unicode version, use VERSION environment variable, e.g. VERSION=15.1.0 bundle exec rake idna:generate.

By default, used Unicode version is the one used by the Ruby version (RbConfig::CONFIG["UNICODE_VERSION"]).

To set directory for generated files, use DEST_DIR environment variable, e.g. DEST_DIR=lib/uri/idna/data bundle exec rake idna:generate.

Unicode data cached in the tmp directory by default, to change it, use CACHE_DIR environment variable, e.g. CACHE_DIR=~/.cache/unicode_data bundle exec rake idna:generate.

Note: rake idna:generate might generate different results on different versions of Ruby due to usage of built-in Unicode normalization methods.

Inspect Unicode data

To inspect Unicode data, run bundle exec rake 'idna:inspect[<HEX_CODE>]'.

To specify Unicode version, or cache directory, use VERSION or CACHE_DIR environment variables, e.g. VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'.

Update UTS46 test suite data

To update UTS46 test suite data, run bundle exec rake idna:update_uts46_test_suite.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/skryukov/uri-idna.

License

The gem is available as open source under the terms of the MIT License.

About

A IDNA2008, UTS46 and Punycode implementation in pure Ruby

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages