The Experimental "PanCJKV" IVD Collection—Redux

News|Adobe Blogs|Dr. Ken Lunde 2016-02-20 15:24:53

Continuing where I left off with the first article about this subject, I'd like to point out some of the implementation details and their ramifications in this article.

Once again, I'd like to clearly state that the PanCJKV IVD Collection is experimental and completely unregistered, meaning that it would be unwise for anyone—besides yours truly, of course—to build and deploy an implementation based on it. The implementation that I built functions correctly because modern environments that support IVSes (Ideographic Variation Sequences) do so in a generic way whereby any valid, though not necessarily registered, sequence is recognized.

One important ramification that implementations need to consider is how unsupported regions are handled. For example, Source Han Sans (and Google's Noto Sans CJK) currently support four of the eight regions, specifically CN, TW, JP, and KR, in terms of including glyphs that are intended for those regions. In terms of CJK Unified Ideograph coverage, HK is also supported, though it is not yet claimed due to inappropriate glyphs for many of the characters. HK is expected to be a fifth region in the next major version.

So, what happens with IVSes that correspond to an unsupported region? The simple answer is that the VS (Variation Selector) is ignored—but not discarded—and the BC (Base Character), which is a CJK Unified Ideograph, is rendered according to the Format 12 (UTF-32) 'cmap' subtable of the selected font. In the case of the example implementation in the open source PanCJKV IVD Collection project, the default region of the 'cmap' table is CN, meaning that IVSes for unsupported regions will display according to CN conventions. However, if the same IVSes are rendered using a different Pan-CJK font that supports one or more of the unsupported (in the example implementation) regions, they will display as intended by that font. The image at the top of this article uses red to indicate IVSes that are unsupported in the example implementation, and are thus rendered according to the default region of the 'cmap' table whose glyph corresponds to the first one on each line.

My current plan is to submit a document for UTC #147 (May 9–13, 2016) that proposes that the UTC (Unicode Technical Committee) itself submit the "PanCJKV" IVD collection for registration according to what I have described in these two articles. There are two reasons for this plan:

Every CJK Unified Ideograph would require eight registered IVSes, meaning that as new CJK Unified Ideographs are added to Unicode, additional IVSes would need to be registered for them. This thus becomes a process that would need to be executed whenever new CJK Unified Ideographs are added in a new version of Unicode, such as what is likely for Unicode Version 10.0 in June of 2017.The third field of the IVD_Collections.txt file of the IVD requires a URL of a site describing the collection, and in the case of this IVD collection, Unicode's Unihan Database Lookup seems to be the most appropriate site for this purpose. The UTC no doubt would want to be involved if an IVD collection were to point to anything on unicode.org

In closing, perhaps the best way to describe the purpose and intent of this IVD collection is as follows: CJK Unified Ideographs are expected to be displayed according to the conventions and limitations of a particular implementation, in terms of which glyphs are supplied for the supported regions and which regions are supported, and that there is no guarantee that characters will display according to the Unicode code charts.

Once again, any and all feedback is welcome and encouraged.