[XeTeX] Devanagari ASCII to Unicode mapping
ShreeDevi Kumar
shreeshrii at gmail.com
Sun Feb 18 10:10:35 CET 2018
Thank you for this info.
There is still a lot of content in Hindi being generated in non-Unicode
fonts (lot of DTP software being used in India still does not support
Unicode).
>> The LDC *might* still have the encoding converters laying around
somewhere.
These will be very useful, if they can be made available. There is a need
for easily converting legacy documents to Unicode. One of the applications
for which someone was looking for these recently was for checking for
plagiarism in student projects/thesis.
ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Sat, Feb 17, 2018 at 10:45 PM, Mike Maxwell <maxwell at umiacs.umd.edu>
wrote:
> On 2/17/2018 11:58 AM, ShreeDevi Kumar wrote:
>
>> Before unicode, devanagari fonts used the ASCII range (legacy fonts) -
>> however AFAIK there is no standardization in the mapping, though various
>> families of fonts had similar mapping.
>>
>> see http://hindi-fonts.com/tools for converters from different mappings
>> to unicode.
>>
>> So, ASCII to Unicode mapping for Devanagari will change based on the
>> font used.
>>
>
> Indeed! In 2003, DARPA held a "surprise language exercise", the goal of
> which was to produce (very basic) MT etc. tools for Hindi, in a month's
> time. I had been involved in the prep for it to ensure that there would be
> no roadblocks (at the time, I was working at the LDC). One of the things
> that Bill Poser and I verified was that there was a Unicode encoding for
> Hindi/Devanagari. There was, but that was the wrong question.
>
> The right question was whether any Hindi website used Unicode. The answer
> to that was that the BBC and Colgate did, but hardly anyone else. A few
> Indian government sites used ISCII, which wouldn't have been bad, but most
> places used proprietary encodings that went along with a proprietary font.
> Worse, these were not simple code-point-to-character encodings; it was as
> if the Latin letter 'l' had been encoded as 'l', but then 'd' had been
> encoded as 'c' + 'l', 'b' as 'l' + a sort of backwards 'c', 'p' as a
> lowered 'l' _ the backwards 'c', etc. It was a mess, and for awhile it was
> unclear whether the exercise would fail because most of the data we needed
> was in these weird proprietary encodings. (It eventually succeeded.)
>
> There are some notes here--
>
> http://languagelog.ldc.upenn.edu/myl/ldc/hindi_fonts_and_conversions.html
> --that Mark Liberman of the LDC made at the time concerning some of the
> issues. Most of it is long out of date (and the links are probably
> broken), and these proprietary encodings have thankfully been replaced by
> Unicode; but if you're dealing with documents from that era, you might
> still run into them. The LDC *might* still have the encoding converters
> laying around somewhere.
> --
> Mike Maxwell
> "My definition of an interesting universe is
> one that has the capacity to study itself."
> --Stephen Eastmond
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tug.org/pipermail/xetex/attachments/20180218/38802128/attachment-0001.html>
More information about the XeTeX
mailing list