MathAction UnicodeIssues

Topics

FrontPage
- FAQ
  - FriCAS Problems
    - FriCASIssues
      - UnicodeIssues <-- You are here.
- Unicode
  - UnicodeIssues <-- You are here.

This page has more details about the unicode issue introduced on the FriCASIssues page.

Support For Unicode

Mathematics makes a lot of use of special symbols and characters from different alphabets such as Greek alphabet. It seems very unfortunate that users of panAxiom CAS are limited to a small set of characters from ASCII character set. Some of the reasons for this are external to Axiom such as command line and lisp code. One partial solution would be to use some escape sequence to encode unicode characters in strings, variable names and operator names, then use the various OutputForm? to translate into the appropriate coding. Each OutputForm? may need some lookup table, for instance to support the Greek alphabet in html based based formats we might need a table like this:

α = &alpha;
β = &beta;
Γ = &Gamma;  γ = &gamma;
Δ = &Delta;  δ = &delta;
δ = &epsilon;
ζ = &zeta;
η = &eta;
Θ = &Theta; θ = &theta;
ι = &iota;
κ = &kappa;
Λ = &Lambda; λ = &lambda;
μ = &mu;
ν = &nu;
Ξ = &Xi; ξ = &xi;
ο = &omicron;
Π = &Pi; π = &pi;
ρ = &rho;
Σ = &Sigma; σ = &sigma;
Τ = &Tau; τ = &tau;
Υ = &Upsilon; υ = &upsilon;
Φ = &Phi; φ = &phi;
χ = &chi;
Ψ = &Psi; ψ = &psi;
Ω = &Omega; ω = &omega;

There is a possible hack suggested by Waldek but it is output type dependant (not general) and involves messy code:

(3) -> )set output mathml on
(3) -> ("&alpha;"::Symbol)::Polynomial(Integer)

   (3)  &alpha;
 $α$ 

                                                    Type: Polynomial(Integer)
(4) -> )set output tex on
(4) -> )set output mathml off
(4) -> ("\alpha"::Symbol)::Polynomial(Integer)

   (4)  \alpha
$$
\alpha 
\leqno(4)
$$

                                                    Type: Polynomial(Integer)

What is needed is "complete" support for Unicode (or at least mathematical subset of Unicode). Namely, for something better than hack above we really need a worked out design.

Bill has suggested the following but unicode would be limited to the Symbol domain? See for example now how subscripts are handled via

script:(Symbol, List List OutputForm?) -> Symbol

When a Symbol is coerced to OutputForm? and rendered as algebra, tex, mathml etc. subscripts and superscripts are displayed as appropriate to the presentation format. It seems to me that supporting unicode symbols as part of the Symbol domain would not be so hard. So then for example:

_&alpha()^2

would be a polynomial and might render as

          2
alpha

\alpha^2

α²

etc.

This would be very limited as it would not allow the use of unicode in interpreted and SPAD variable names.

The following is a very informative reply from Waldek from the forum here

First, problem of processing "rich" charcter data appeared long ago and various appraches were tried. One worth mentioning was based on codepage switching: there are special seqences of characters which signal change of used character set. That way it is possible to switch between 8-bit and 16-bit encodings (and theoretically 24-bit (and more) are possible). And even using 8-bit encodings it is possible to use rich character set. Unicode movement began due to dissatifaction with codepage switching -- one of design goals of Unicode was to avoid escape sequences and/or codepage switching. In other words Unicode characters were supposed to have the same meaning regardless of neighbouring characters -- note that when escape seqences are in use you need to scan the whole string to determine if given character is part of escape seqence (or it meaning is modified by escape seqence). So, what is wrong with escape seqences? Well, due to stateful nature they very much complicate processing. To get some feel of difficulties you may look at recent changes to axiom/fricas script. The task was trivial: get character string from one command line and put it in another command line. The problem was that on the way escape seqences where transformed ("expanded") by shell and we had to effectively undo this transformation. The resulting code is not very complicated, but went trough few buggy iterations.

Why the rant above: as I wrote escape seqences are necessary evil when communicating via limited media, but should be avoided (at leat logically) during processing (note that with escape seqences trivial tasks, like replacing all occurences of substing 'pha' in given string by some other substring becomes not so trivial). For processing one should use format which is more or less free from escape seqences. Unicode seem to be quite good in this respect. Now, Unicode is not entirely free from escape seqences, there are so called combing characters. Moreover UTF-8 uses multiple octets to encode single Unicode codepoint and conseqently meaning of given octet in UTF-8 encoded string may depend on previous octets. However, Unicode (and UTF-8) were specially designed to avoid various bad effects. For example, the second octet of multioctet character can not occur alone: just seeing this octet you know that there must be octet before belonging to the same character.

Writing "Unicode is done" I meant that it makes no sense to invent a system of escape seqence to encode special characters _during processing_. Note: character entities on Web pages logically are purely input/output mechanizm. Logically when browser gets a page character entities are replaced by characters. Then page is parsed to get a tree and all interesting things happen at the level of parse tree (or above).

Concerning Unicode in Spad: UTF-8 can be used in any 8-bit clean system. In particular using UTF-8 we can process Unicode in non-Unicode aware Lisp. And Unicode aware Lisp by definition can handle Unicode.

There are three problems with this. One is that all Unicode aware Lisps use UTF-32 encoding which is different than UTF-8. Conseqently we need different code when using Unicode aware Lisp compared to non-Unicode aware Lisp. The second problem is that Lisp typically performs some recoding on input/output and rejects invalid encodings. This is a problem because at least theoretically needed encoding may be not installed. Also, at least clisp does not allow changing encoding on already open stream, which means that for example on standard input we are stuck with encoding which was active when clisp started up. The third problem is that current Spad character type is 8-bit. This means that using UTF-32 encoding we currently can not extract characters from them without risking out-of-range character codes (but if instead of characters one uses one-character long strings things should work fine). For UTF-8 8-bit "characters" would work, but actually they whould not be Unicode characters but octets.

Bottom line: to use Unicode we need to:

- decide what to do with Spad character type (extending it to 21 bits needed for UTF-32 looks trivial, but there may be some hidden gotcha). In particular we need to decide if Spad character correspond to Unicode code point or to units of encoding (that is octets in case of UTF-8).

- normalize ways of creating Unicode strings/characters. In particular how to represent them in source code.

Currently we support Unicode in Unicode aware Lisps by using native support and assume that otherwise we can store UTF-8 in String, we extended Spad character type to 21 bits.

Just now using sbcl based FriCAS you can do:

1) start FriCAS in Unicode setting, for example:

(export LC_CTYPE=en_US.UTF-8; fricas)

Make sure your console is set to UTF-8

2) In FriCAS:

char(8721)

You should see sigma sign on the screen. In the same way you can create any unicode character you want. To put several characters into a String use 'ucodeToString' and 'concat' like:

concat("sigma: ", ucodeToString(8721))

Note: on current Unicode aware Lisp you can just assign character to position in a string, but when UTF-8 is used to represent strings a single Unicode character is a multibyte sequence and assigning to portion of string can garble it.

You can also do:

alpha := ucodeToString(945)::Symbol

alpha^2+1

In principle we could provide domain(s) which would give access to large number of special symbols via names (I am not sure if this is really useful).

What does not work:

functions from CharacterClass?. More precisely, ASCII it works as before and non-ASCII is ignored. Proper support requires getting database of data about Unicode characters. Adapting database for FriCAS is easy (IIUC database can be obtained from Unicode website), but adds bulk (IIUC its is several megabytes in size). Also, naively extending CharacterClass? to full Unicode is likely to lead to poor performance (currently CharacterClass? uses bitwectors of length 256, changing that to 1114112 means much bigger bitwectors and conseqently much more work creating them and poorer cache utilization).

For better support it is hard to give timeline because once you try to use Unicode you will find that some things do not work as expected, some things suddenly are extremally slow and need to be rewritten to get reasonable speed.

Subject: Be Bold !!

	( 14 subscribers )