[tex-k] Question about UNC (Universal Naming Convention) tests and Kanji

Wed Dec 27 01:19:43 CET 2017

All -

Several subroutines in kpathsea check the starts of paths, looking for a UNC (Universal Naming Convention) prefix, which is/was used only on WIN32 systems.  These routines are

  kpathsea_normalize_path() in "elt-dirs.c"
  normalize_filename() in "mingw32.c"  (only an initial check)
  xbasename() in "xbasename.c"
  xdirname() in "xdirname.c"

Each UNC prefix has the syntax

  <dir-sep> <dir-sep> <host-name> <dir-sep> <share-name> <dir-sep> <object-name>

where <dir-sep> is a backslash for WIN32, and <object-name> is basically a standard path string.

On non-WIN32 systems, the code testing for a possible UNC prefix is never invoked, because the macro IS_UNC_NAME() is defined to always be 0.

But when kpathsea is compiled for WIN32, the testing macro is defined (in "c-pathch.h") as

  #  define IS_UNC_NAME(name) (strlen(name)>=3 && IS_DIR_SEP(*name)  \
                               && IS_DIR_SEP(*(name+1)) && isalnum(*(name+2)))

(the initial strlen() test is entirely superfluous, assuming a NULL is never a directory separator or alphanumeric, but that's a separate efficiency issue).

The last component of the macro test is whether the third character (name[2]) is an alphanumeric character (preceded by two directory separators).  If the result is true, then the various routines mentioned in the list above continue with an attempt to parse a legal UNC prefix.

It appears that there's the usual improper signed char bug/problem when presenting arguments to isalnum() (or similar character class testers).  Should this not be the kpathsea wrapper macro ISALNUM()?  I would guess yes, but before answering that let's get to the main problem/question.

Consider this snippet of code in xbasename() (in "xbasename.c").  The snippet attempts to find the end of <host-name>:

    else if (IS_UNC_NAME(name)) {
        unsigned limit;

        for (limit = 2; name[limit] && !IS_DIR_SEP (name[limit]); limit++)
#if defined(WIN32) && defined (KPSE_COMPAT_API)
            if (IS_KANJI(name+limit)) limit++
#endif
            ;

Looking at this code, the first thing I noticed is that the variable |limit| in the loop is initialized to 2.  In an ASCII world, this means redundant work, because the IS_UNC_NAME() macro just guaranteed that name[2] is alphanumeric, which means it's not a NULL and not a directory separator, so the first iteration through the loop is guaranteed.  Thus, one would think that |limit| could safely be initialized to 3.  But ... there's that IS_KANJI() test in the loop body, as if the <host-name> could start with a Kanji set of bytes at name[2].  The first byte of a Kanji set of bytes has its high-order bit set (see the code in "knj.c"), and is thus not ASCII.  And if |name| is a pointer to a signed char (which it ambiguously might or might not be, see "simpletypes.h"), name[2] can evaluate to a negative integer.

The question is, can a <host-name> in a UNC-based path on a WIN32 system start with a Kanji sequence of bytes?  The fact that the |limit| variable is starting with 2 and then that first byte is tested with IS_KANJI() indicates that someone thought the answer is yes.  But if yes, then isn't the IS_UNC_NAME() macro going to incorrectly return false, because the first non-ASCII byte of a Kanji sequence is not going to pass either the isalnum() or ISALNUM() test?  (isalnum() is not locale-dependent).

On the other hand, if a <host-name> must consist solely of ASCII alphanumeric bytes, then (a) the loop can initialize |limit| with 3, but (b) there need be no IS_KANJI() byte test at all inside it.  Which also doesn't make sense, given all the effort to use it for WIN32 compilations in the above subroutines.

I've tried to pin down what the syntax rules are for UNC names.  The Microsoft page at

  <https://msdn.microsoft.com/en-us/library/gg465305.aspx>

says a <host-name> is either an IP address or a <reg-name>.  A <reg-name> is defined in RFC3986 (as cited by Microsoft) at

  <http://www.rfc-editor.org/rfc/rfc3986.txt>

as including "unreserved" bytes.  Unreserved bytes, in turn, are ALPHA, NUM, or certain other ASCII values, which implies that no non-ASCII 8-bit byte (i.e., a first Kanji byte) should appear in a hostname of a UNC prefix.  But the kpathsea code quoted above plainly assumes it does (perhaps because WIN32 did not feel inclined to conform with the rest of the world back in the day, whereas in later Windows systems MS decided to play nice?).  Sigh.

In any case, I'm wondering whether this is an obscure bug that's never been triggered, or something else.  Since I have no experience doing either Windows programming nor using Japanese Kanji text in UNC names, I'm not sure what's going on (or whether any of this matters anymore).  Is the IS_UNC_NAME() macro the correct test?  I'm just trying to understand the code, in which it sure seems like some kind of inconsistency for UNC testing/parsing is going on.

Apologies for the noise should it be something obvious that I'm misunderstanding.

Onward, into the fog …

Doug McKenna