Jump to content


Photo

Odd newline and return characters in eTextPara


  • Please log in to reply
15 replies to this topic

#1 prl

  • Member
  • 20 posts

+1
Neutral

Posted 1 May 2018 - 09:16

I've been look in at some bugs in justification (dirBlock) in eTextPara (lib/gdi/font.cpp), and I've been a bit puzzled by some odd alternatives for newline ('\n') and return ('\r') in eTextPara::renderString().

 

The alternatives for '\n' (U+000A) are U+008A (LINE TABULATION SET) and U+E08A (character in the Unicode Private Use Area)

 

The alternatives for '\r' (U+000D) are U+0086 (START OF SELECTED AREA), U+0087 (END OF SELECTED AREA), U+E086 and U+E087 (characters in the Unicode Private Use Area).

 

They seem to be odd choices as alternatives for '\n' and '\r'. Are they actually used as that anywhere? Does anyone know the history of this? They seem to have been like that since 2003, so perhaps th answer is lost in the mists of time...



Re: Odd newline and return characters in eTextPara #2 s3n0

  • Senior Member
  • 109 posts

+10
Neutral

Posted 1 May 2018 - 12:04

Hello.

 

I do not quite understand what you need to know. Can you please specify your query?

For decades, the special mark \n has been used to indicate the transition to the next line + to the top of the line. That's just the \n character is a credit for the history of Unix/Linux (programming language C). To date, this special feature is used to move to a new line in most programming and scripting languages.

The difference is, in particular, in operating systems that code the transitions to new rows in files containing different text (including source codes) differently.

On Unix/Linux based systems:
- 1 byte is used
- LF (Line Feed)
- ASCII: 0Ah (10 in decimals numeric)

On Windows systems:
- 2 bytes
- CR+LF (Carriage Return + Line Feed)
- ASCII: 0Dh + 0Ah (13 + 10 in the decimal numeric system).
- ie. shifting "carriage" (used term as it was on old typewriters) to the beginning of the line + move to the next line in the order

 

https://www.google.c...h?q=ascii table


Edited by s3n0, 1 May 2018 - 12:06.


Re: Odd newline and return characters in eTextPara #3 s3n0

  • Senior Member
  • 109 posts

+10
Neutral

Posted 1 May 2018 - 13:18

Or even better information about ASCII and its history you will find here:
https://en.wikipedia.org/wiki/ASCII


You can also look at the bottom of the site, from which I quote:
 

b.^ The Unicode characters from the area U+2400 to U+2421 reserved for representing control characters when it is necessary to print or display them rather than have them perform their intended function. Some browsers may not display these properly.

 

I do not personally use the Unicode code page. Only if its use is necessary. Unique control characters in Unicode only cause other compatibility issues. Instead, ASCII control codes (0 to 31) are still supported almost everywhere within the standard. For example, when you send a bell code (ASCII - 07h - BELL) to the printer that is switched to "text mode", a real beep sounds in the printer.

Mostly in all programming languages, "\n" is now used to hit a new line. Only rarely is used in languages "\r\n".
Even in Visual Basic language, the "\n" tag replaces the two CR + LF control codes. But the ASCII tag "\n" (0Ah) should only mean CR (not CR + LF).

 

https://www.develope...line-feed-chars :

13 years ago
by shim.gifDarius

You can use Chr(10) & Chr(13)

or

vbCrLf = "\n" # Carriage returnlinefeed combination
vbCr = chr(13) # Carriage return character
vbLf = chr(10) # Linefeed character
vbNewLine = "\n" # Platform-specific new line character; whichever is appropriate for current platform

Edited by s3n0, 1 May 2018 - 13:20.


Re: Odd newline and return characters in eTextPara #4 prl

  • Member
  • 20 posts

+1
Neutral

Posted 1 May 2018 - 13:26

I know what the CR and LF characters do and their respective roles in Unix and DOS-based text files.

 

I'm asking why eTextPara::renderString() allows the use of Unicode characters that don't have a function that is anything like either CR or LF to act as CR and LF. I.e. why do U+008A (LINE TABULATION SET), U+0086 (START OF SELECTED AREA), U+0087 (END OF SELECTED AREA) and the Unicode Private Use Area characters U+E08A, U+E086 and U+E087 appear in this code from eTextPara::renderString():

int eTextPara::renderString(const char *string, int rflags, int border)
{
	...
			switch (chr)
			{
			...
			case 0x8A:
			case 0xE08A:
			case '\n':
newline:			isprintable=0;
				newLine(rflags);
				nextflags|=GS_ISFIRST;
				break;
			case '\r':
			case 0x86: case 0xE086:
			case 0x87: case 0xE087:
nprint:				isprintable=0;
				break;

They seem quite inappropriate character codes to use to have the same functions as CR and LF, except if there is an data source used by enigma2 somewhere that uses those characters that in that way.

 

In other words, would anything break if the case entries for 0x8A, 0x86, 0x87, 0xE08A, 0xE086 and 0xE087 were removed from that switch statement?

 

I'm not sure what the Unicode characters U+2400 to U+2421 have to do with my question. I didn't ask about them, and they don't appear in the code I'm asking about.



Re: Odd newline and return characters in eTextPara #5 WanWizard

  • Forum Moderator
    PLi® Core member
  • 41,201 posts

+651
Excellent

Posted 1 May 2018 - 13:29

Just a wild guess: EPG data that is being rendered?


Many answers to your question can be found in our new and improved wiki.

Currently in active use: VU+Solo 4K (1xFBC, 2xS2), VU+Zero, Edision OS mini+, Amiko Viper 2TC, Zgemma H3.2TC, Zgemma H6

For testing purposes: XP1000, Formuler F1 (2xS2), Miraclebox Premium Micro (S2+C/T), ET7500 (S2), ET8500 (S2), Zgemma H2.H (S2+C), Zgemma H5.2TC, SAB TripleAlpha (S2+C/T), Galaxy 4K (FBC), VU Zero 4K, HD2400 (4xS2), ET10000 (4xS2), VU+Duo2 (1xS2), Edision OS nino


Re: Odd newline and return characters in eTextPara #6 s3n0

  • Senior Member
  • 109 posts

+10
Neutral

Posted 1 May 2018 - 15:03

Maybe this is Unicode UTF-8 encoding ? With a variable length of one code for each UTF-8 Unicode character. So there are so many codes there ?

http://www.utf8-char...f8=0x&htmlent=1

 

The advantage of UTF-8 Unicode encoding is the use of moving code lengths for individual characters. It's 1 to 4 bytes long code (8 to 32 bits) for each character (as needed). For Unicode UTF-8, only few bytes as it needed (1 to 4 bytes) ... used for each one character.

A standard Unicode encoding text file occupies much more disk space than a UTF-8 Unicode encoding.

Standard Unicode uses 2 bytes for each character - expandable to 4 bytes. So for a file containing basic ASCII characters 0 to 127 - for example, when programming source code, the text file will have a double size. This may be slowing down interpreters such as Python, Java, Perl, and so on, which execute direct code from a file and do not use binary (compiled) executable files. Interpreters in the case of classic Unicode file encoding must then retrieve much more data from the source code than it was a plain ASCII file or UTF-8 file.

The answer to your question is:

It's UTF-8 Unicode maybe ? Therefore, there are up to 3 bytes of code for this special code, instead of the original 2 bytes of the classic Unicode.



Re: Odd newline and return characters in eTextPara #7 WanWizard

  • Forum Moderator
    PLi® Core member
  • 41,201 posts

+651
Excellent

Posted 1 May 2018 - 15:12

The default for everything in the image is utf-8. But EPG data not always is...


Many answers to your question can be found in our new and improved wiki.

Currently in active use: VU+Solo 4K (1xFBC, 2xS2), VU+Zero, Edision OS mini+, Amiko Viper 2TC, Zgemma H3.2TC, Zgemma H6

For testing purposes: XP1000, Formuler F1 (2xS2), Miraclebox Premium Micro (S2+C/T), ET7500 (S2), ET8500 (S2), Zgemma H2.H (S2+C), Zgemma H5.2TC, SAB TripleAlpha (S2+C/T), Galaxy 4K (FBC), VU Zero 4K, HD2400 (4xS2), ET10000 (4xS2), VU+Duo2 (1xS2), Edision OS nino


Re: Odd newline and return characters in eTextPara #8 Erik Slagter

  • PLi® Core member
  • 43,096 posts

+467
Excellent

Posted 1 May 2018 - 18:15

Just a wild guess: EPG data that is being rendered?

Exactly what I was going to say.

 

Even further, I think some EPG provider uses these, completely non-standard, codes and someone thought it easy/quick/simple to "fix" it here, which I think is awful.

 

What does git blame say about these lines?


* Wavefrontier T90 with 28E/23E/19E/13E/9E/4.8E/0.8W/5W via SCR switches 2 x 2 x 6 user bands
* Ziggo digital cable TV (FTA)
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: Odd newline and return characters in eTextPara #9 Erik Slagter

  • PLi® Core member
  • 43,096 posts

+467
Excellent

Posted 1 May 2018 - 18:18

Maybe this is Unicode UTF-8 encoding ? With a variable length of one code for each UTF-8 Unicode character. So there are so many codes there ?

http://www.utf8-char...f8=0x&htmlent=1

 

The advantage of UTF-8 Unicode encoding is the use of moving code lengths for individual characters. It's 1 to 4 bytes long code (8 to 32 bits) for each character (as needed). For Unicode UTF-8, only few bytes as it needed (1 to 4 bytes) ... used for each one character.

A standard Unicode encoding text file occupies much more disk space than a UTF-8 Unicode encoding.

Standard Unicode uses 2 bytes for each character - expandable to 4 bytes. So for a file containing basic ASCII characters 0 to 127 - for example, when programming source code, the text file will have a double size. This may be slowing down interpreters such as Python, Java, Perl, and so on, which execute direct code from a file and do not use binary (compiled) executable files. Interpreters in the case of classic Unicode file encoding must then retrieve much more data from the source code than it was a plain ASCII file or UTF-8 file.

The answer to your question is:

It's UTF-8 Unicode maybe ? Therefore, there are up to 3 bytes of code for this special code, instead of the original 2 bytes of the classic Unicode.

There is no such thing as "standard Unicode". I think you are referring to UCS-16, which is just another Unicode transport format, and is commonly used by Windows (and as far as I know Windows only). The rest of the world uses the less wasteful UTF-8 transport format.

 

I think the "weird" EPG may be UTF-8 encoded just as well, just some "private" code points are used otherwise (properly or not).


Edited by Erik Slagter, 3 May 2018 - 18:38.

* Wavefrontier T90 with 28E/23E/19E/13E/9E/4.8E/0.8W/5W via SCR switches 2 x 2 x 6 user bands
* Ziggo digital cable TV (FTA)
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: Odd newline and return characters in eTextPara #10 prl

  • Member
  • 20 posts

+1
Neutral

Posted 2 May 2018 - 01:59

At the point in the code I'm asking about, the characters have already been decoded into Unicode (the decoding is done in eTextPara::renderString() before the switch statement is reached). This isn't a question about encoded character forms (like UTF-8), other than perhaps whether some data source, somewhere, is being decoded into these strange Unicode values and is using them to represent CR or LF.

 

When I talk about Unicode characters, I mean single Unicode codepoints in the allowed range U+0000 - U+​10FFFF, not variable-length multibyte encodings of them like UTF-8. A single UTF-8 byte can't possibly have a value like 0xE08A.

 

Also, I'm not encountering data from an EPG (or elsewhere) that decode into the strange Unicode codepoints like U+008A (LINE TABULATION SET), I'm asking why the code is treating those values as equivalent to CR or LF, when they don't actually represent control functions that resemble either CR or LF.

 

Has anyone responding to my question looked at the code I'm referring to?

 

If the answer to my question is "no-one knows, just leave it as it is because removing those case entries from the switch statement might cause problems somewhere" that is a reasonable, though disappointing, answer.



Re: Odd newline and return characters in eTextPara #11 prl

  • Member
  • 20 posts

+1
Neutral

Posted 2 May 2018 - 02:03

There is no such think as "standard Unicode".

 

What does: Julie D. Allen. The Unicode Standard, Version 6.0, The Unicode Consortium, Mountain View, 2011, ISBN 9781936213016 describe, then?

 

There are, though, a multiplicity of encodings of Unicode codepoints into multi-byte strings, the most common of which is UTF-8.



Re: Odd newline and return characters in eTextPara #12 Erik Slagter

  • PLi® Core member
  • 43,096 posts

+467
Excellent

Posted 3 May 2018 - 18:40

 

There is no such think as "standard Unicode".

 

What does: Julie D. Allen. The Unicode Standard, Version 6.0, The Unicode Consortium, Mountain View, 2011, ISBN 9781936213016 describe, then?

 

There are, though, a multiplicity of encodings of Unicode codepoints into multi-byte strings, the most common of which is UTF-8.

That is the Unicode Standard. That is not quite the same as "standard Unicode", where one refers to one single transport encoding as being "the standard" which is simply not true.


* Wavefrontier T90 with 28E/23E/19E/13E/9E/4.8E/0.8W/5W via SCR switches 2 x 2 x 6 user bands
* Ziggo digital cable TV (FTA)
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: Odd newline and return characters in eTextPara #13 Erik Slagter

  • PLi® Core member
  • 43,096 posts

+467
Excellent

Posted 3 May 2018 - 18:41

Has anyone responding to my question looked at the code I'm referring to?

Yes I did. And I think I am understanding what you're referring to, see #8, that you didn't react to:

 

 

Just a wild guess: EPG data that is being rendered?

Exactly what I was going to say.

 

Even further, I think some EPG provider uses these, completely non-standard, codes and someone thought it easy/quick/simple to "fix" it here, which I think is awful.

 

What does git blame say about these lines?


* Wavefrontier T90 with 28E/23E/19E/13E/9E/4.8E/0.8W/5W via SCR switches 2 x 2 x 6 user bands
* Ziggo digital cable TV (FTA)
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: Odd newline and return characters in eTextPara #14 s3n0

  • Senior Member
  • 109 posts

+10
Neutral

Posted 4 May 2018 - 20:43

@Erik:

 

Common, standard, basic, main, unbundled, classical, original, old, ... it is absolutely no matter how I write it. Everybody understands it.

If we would like to be the engineer and the scientist, we are going to write history for a few days and start, for example, with ISO/IEC 10646, UCS-2 (BOM code), etc. :) . There is a lot of history. Unicode itself has been standardized several times (versions). "The Unicode Standard" is the only organization that deserves to define Unicode standards. Only a few years back, Unicode has finally become usable for almost all countries in the world. But many years back, Unicode meant calling into the void. (see Unicode history)  At that time, it was used simply to say "ANSI" :) .

I just wanted to point out the distinction between UTF-8 and classic Unicode. So it was clear from what I meant. If there was a "Unicode Classic" term, will you be able to correct me again because no "classic Unicode" exists ? That the correct term is "Unicode Classic" and not "classic Unicode" ? Seriously ?!



Re: Odd newline and return characters in eTextPara #15 Erik Slagter

  • PLi® Core member
  • 43,096 posts

+467
Excellent

Posted 5 May 2018 - 08:34

Indeed that is exactly what I am going to say. You don't understand the point I am trying to make. There is no such thing as "classic" (neither) Unicode.

As I said before, I think you're referring to UCS-16, one of the many transport formats of Unicode, which Microsoft (and afaik Microsoft only) uses. In this encapsulation, Unicode has never been a great success, just like e.g. UTF-32 and UTF-7. The success came with UTF-8 which also enigma2 is using (because it's more or less "standard" on Linux).

 

We (as: Linux and Enigma2) have no relation to UCS-16 whatsoever, just the Unicode codepoints.


* Wavefrontier T90 with 28E/23E/19E/13E/9E/4.8E/0.8W/5W via SCR switches 2 x 2 x 6 user bands
* Ziggo digital cable TV (FTA)
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: Odd newline and return characters in eTextPara #16 prl

  • Member
  • 20 posts

+1
Neutral

Posted 6 May 2018 - 00:14

Sorry for the delay getting back on this. I've been busy with other stuff.

 

Just a wild guess: EPG data that is being rendered?

 

That was my guess for where the weird characters that were being handled as CR and LF were coming from, but so far no-one seems to know just what the source is, and perhaps that knowledge has been lost.

 

I found the odd characters when I was trying to work out why an EPG source (IceTV, Australian subscription EPG and series programming service) that had <CR><LF> in its description data wasn't rendering correctly for halign="block". I also found a few other bugs around non-printing characters in block alignment, and wondered just what these other odd CR & LF codes were for and whether they were really needed.

 

I suspect the answer to my question about whether the additional codes (0x8A, 0x86, 0x87 etc) could be safely removed from the code is "no-one knows, so best leave them there".

 

The bugs I was working on are issues 641-645 in the Beyonwiz issues tracker.






1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users