Damn Small Characters for Interlanguage Interchang

Forum: DSL Ideas and Suggestions
Topic: Damn Small Characters for Interlanguage Interchang
started by: newby

Posted by newby on Mar. 11 2008,14:04

Damn Small Characters for Interlingual Interchange (DaSCII)

-- a proposal --

THE PROBLEM

The majority languages spoken by over 50% of people worldwide are(in order): Chinese (962 million), English (322m), Spanish (266m), Russian (170m), Portuguese (170m), Japanese (125m), German (98m), Bengali (189m) & Hindi 182m). Neither 7-bit nor 8-bit systems provide enough characters to directly cover these languages.

THE SOLUTION

By examining the transliteration systems for each of the languages, the number of unique characters can be drastically reduced. The dravidian languages (Bengali, Hindi, et cetera) present the greatest difficulty due to a great number of diacritical and other marks. Therefore:

1. Start with the IBM 8-bit character set.

2. Insert characters 177 - 250 from the Indian Script Code for Information Interchange.

3. Add the following characters:

158 --- (the Euro symbol)
166 - z (with a tail underneath, used in Arabic transliteration)
167 - t (with a tail underneath, used in Arabic transliteration)
169 - e rising tone (the other rising-tone characters are covered)
170 - a falling-rising tone
235 - e falling-rising tone
236 - i falling-rising tone
237 - o falling-rising tone
238 - u falling-falling tone
251 - s (with a tail underneath, used in Arabic transliteration)
252 - d (with a tail underneath, used in Arabic transliteration)
253 - o (with a shallow u-shaped mark above, used in Korean transliteration)
254 - u (with a shallow u-shaped mark above, used in Korean transliteration)

USAGE

1. Where a character is found in DaSCII, use it.
2. Where a character is not found in DaSCII, use transliterations, for example:

German - Use dipthongs suggested in the BGN/PCGN 2000 Agreement
Russian - Use dipthongs suggested in the BGN/PCGN 1947 Agreement
Romanji - Use the umlauted character for the high tone.

== This will allow transliterations of over 50% of human languages with _one_ font ==

Posted by humpty on Mar. 12 2008,03:41

what about the other 50% ?

why can't everyone communicate with just one language ?

why add yet another system ?

Posted by newby on Mar. 12 2008,16:38

Quote (humpty @ Mar. 11 2008,22:41)

what about the other 50% ?

why can't everyone communicate with just one language ?

why add yet another system ?

Answering your questions in order:

1. Actually, the 50% figure came from looking at population statistics. Looking at the actual transliteration systems, the figure will be far greater than 50%.

2. Social evolution.

3. It's _not_ "another system." What I am proposing is to maximise the usefulness of the system we have, the 8-bit character set.

Remember, I am _not_ proposing a _language_ font, but a _transliteration_ font. It's purpose is to make Damn Small Linux accessable to the greatest number of people. And, therefore, maximally successful world-wide.

I've looked at the Anglo/American systems, the United Nations systems and the ex-Soviet Block systems. The difficulty comes from the addition of marks to the roman consenants - the increase is explosive! Use of dipthongs decreases the necessity for extra consenants.

The political choice is between providing a so-so solution that includes all the vowel variants and leaving South-Asia in the lurch versus including the South-Asian characters and using dipthongs to cover the rest of the systems.

Ultimately, there is a physical limit = 256 characters.

Posted by lucky13 on Mar. 12 2008,17:29

It *IS* another system. How many people who don't know English or use a Latin or Cyrillic alphabet use transliterated Latin characters to communicate?

Those already familiar with this particular alphabet most likely already speak the (pardon me) lingua franca of the Internet and of most programming, English. I don't see what's so bloody important about this subject that it requires at least two polls and yet another thread.

"Social evolution" isn't tied to transliteration but literacy and actual translation. If you want more people who don't speak English (or use Latin/Cyrillic alphabets) to use DSL, perhaps you can help add the characters they actually know and use.

Posted by newby on Mar. 12 2008,18:08

Quote (lucky13 @ Mar. 12 2008,12:29)

Actually, a lot of people learn transliterations, for the purpose of access to computers as a means to communicate with others. For example, the following is a true statement about myself: "Wo shi zhung wen xue shung." It is meaningless without the vowel marks. With the vowel marks, is is understandable to millions that "I am a Chinese language student."

Your use of "lingua franca" illustrates the issue: French was promoted as an "international" language when France was an economic power. Loss of economic power reduced linguistic dominance to the historic relic of lingua franca.

The US has been declining as a percentage of the gross international product since 1970. Soon, lingua yankee may be all that is left of our claim to linguistic dominance.

Ultimately, the issue is neither nations nor languages, but the limits of the 8-bit byte, which looks to be more durable than nations or languages.

It would have been much easier if we had gone with the PDP-8 and 12-bit bytes. 4096 characters could have phonetically covered the world.

At this point, this is becoming one of those "lighter-more filling" arguments that DSL seems to inspire. I come down on the "lighter" side (8-bit font) for initial access to DSL. Once one has discovered that DSL is really useful, one can switch to the "more filling" camp and use UNICODE.

Posted by u2musicmike on Mar. 12 2008,19:19

Just from my experience I don't think adding a few extra characters will work. When I was using a Windows PC in an internet cafe in Keiv, it has a button labeled "ru" on the icon bar. This button changed the keyboard and font to cryllic. Hit the "en" button and it went back to english. You could blend russian and english in the same Word document so if I was telling someone about a local restaurant I could type the name as it was.

Posted by lucky13 on Mar. 12 2008,22:27

Quote

Your use of "lingua franca" illustrates the issue

No, it doesn't and I didn't mean to set you off on another tangent. Users are dealing with this with extensions of native language fonts. Some have already been submitted, iirc.

If you have some extension to submit for MyDSL, please do. I don't think we need every rambling digression and obtuse angels-on-pins discussion about standards that weren't accepted in the process of your project. I don't think we need to adopt a new standard that -- if it really will require lots of "research" (as you'd written previously but seem to have since edited out) -- sounds too much like reinventing of wheels. Using it in the base as a default seems kind of stupid because the kludging that would likely be required to get every single application to work with a novel character set would take up more space than it's worth. We already have settled standards, we have NL fonts available. The apps already are compiled for those standards. We should use those instead of reinventing wheels.

Does this subject merit multiple threads and polls?

Posted by lucky13 on Mar. 12 2008,22:32

Quote

Just from my experience I don't think adding a few extra characters will work.

In the Linux/BSD/Unix/Solaris/OSX universe, there are myriad apps with myriad libraries those apps are compiled against. Most of those are based on set standards for languages. Introducing novel and non-standard character sets in such a meiotic environment introduces more confusion than developers need to deal with.

Posted by stupid_idiot on Mar. 13 2008,18:06

Hi newby:

I don't think the slight added convenience of tone marks (Chinese pinyin) and diacritics (Indic languages) justifies the major trouble of a new charset (research, testing, everyday use).

If 2 people want to communicate using phonetic representation, I think there is a much simpler solution:

In the case of Chinese, the user can just use numbers (as in "1, 2, 3, 4") to represent tone:
< http://en.wikipedia.org/wiki/Pinyin#Numbers_in_place_of_tone_marks >
This way, 2 people can communicate in Chinese using pinyin, without the aid of a Chinese font.
This is easier compared to using tone marks. When using tone marks, the user must know where (which vowel) to place the tone mark.
This is not required when using numbers:
e.g.
Wo3 shi4 zhong1 wen2 xue2 sheng1.

Similarly, 2 people can communicate in Japanese using < Romaji >, without the aid of a Japanese font.

Indic languages might be harder to transliterate/romanize:

Quote

[newby:]
The dravidian languages (Bengali, Hindi, et cetera) present the greatest difficulty due to a great number of diacritical and other marks.

Idea: There are ascii-only transliteration schemes for Bengali/Hindu/Tamil. But because I don't know any of these languages at all, I don't know whether Bengali/Hindu/Tamil speakers prefer the official schemes (with diacritics), or if they prefer the ascii-only schemes (more user-friendly??).

If I understand correctly, your goal is:
To input [any-language] characters using phonetic representation -- i.e. actual input, rather than transliteration/romanization.

If so, I think the existing combination of pinyin input methods (e.g. SCIM, and others) + Unicode/locale-specific font(s) is the only solution. (Your proposed new charset/font can be used to input other languages, but it cannot display them.)

For romanization: I think we should use ascii-only methods, rather than create a new charset. In the case of Indic languages (Bengali, Hindu, Tamil, etc), I don't know those languages at all; but in my uninformed opinion, I think the proposed new charset will be of very little use.
e.g.
If 2 people want to communicate in Tamil:
-- If no Tamil font is installed, then they would use the most compatible way of transliterating Tamil (ascii??). If so, they would not want to use an obscure charset...?
-- If a Tamil font is installed, then the proposed new font would be redundant?

Posted by newby on Mar. 13 2008,20:14

Quote (stupid_idiot @ Mar. 13 2008,13:06)

Quote

[newby:]
The dravidian languages (Bengali, Hindi, et cetera) present the greatest difficulty due to a great number of diacritical and other marks.

"stupid_idiot",

I put you login in quotes, since it is obviously intended to be ironic, coming from one who responds so intelligently.

Yes, numbers can be used to indicate tones, that being Jyutping, Yale and Cantonese-Pinyin practice. I suspect that many find the 4 tone marks in bejinghua to be more intuitive, since they are pictures of how the tones rise and fall. Personally, I would struggle through deciphering accented Pinyin, but would pass on the numeric notation unless I had a very strong motivation otherwise.

As for my intention, I'm focused on the first impression, a-la the first impression one receives when meeting another person. It is to make DSL more universally useful, without bloat.

Universally useful is similar my standard test for new software. Boot it up and see if I can use it. If it is intuitive enough that I can do something useful, then I will read the "friendly" manual. I'd like to see a majority of people be able to boot DSL and use it well enough to seek further help and recieve it.

Without bloat, to me, means an initial, small-footprint means of access that doesn't unduly burden the 50MB footprint. One 8-bit font, compared to N unicode fonts seems to be more in keeping with the DSL style.

Perhaps just including the Indic glyphs in the 128 - 255 range would be enough. I'm still looking at it.

What would be great is a system that allows someone to change languages and keyboards with a click of a button, as another poster mentioned. The $64,000 question, "Is that feasable within a 50MB footprint?" John and Robert are the ones to answer that question, and will wisely ignore it until there is something real to assess.

Posted by stupid_idiot on Mar. 14 2008,01:49

Quote

[newby:]
One 8-bit font, compared to N unicode fonts seems to be more in keeping with the DSL style.

< UTF-8 > is also 8-bit.
What are the reasons for making a new charset, instead of using a subset of the glyphs in UTF-8?

We could just as easily have one single UTF-8 font - rather than "N unicode fonts":
Last year, I tried a shareware Unicode font that includes all the glyphs from all major languages; it worked across all webpages (in Firefox) -- Russian, Japanese, Chinese, Tamil, Arabic...
The name of the font is Code2000.
The webpage is here:
< http://www.code2000.net/code2000_page.htm >
Quote from website:

Quote

The Code2000 download doesn�t degrade or expire and there are no annoying pop-up screens. This has been left open-ended intentionally. In some cases, members of minority script user communites � those who need a font like Code2000 the most � can least afford it. Clearly, if registering the font means your family doesn�t get enough food on the table, even for one meal, then it is not reasonable to register the font.

This is a great attitude!

I think this is a good idea for making a font -- even if we don't intend to make a full/complete font; only a smaller/essential one.
Other than that, I agree with all your points.
i.e. It should be easy to use; it should fit in a small footprint; etc...
Please by all means continue on this project!!
Good luck!!
(I cannot help you though -- I don't know programming at all.)