Jump to content


Photo

RC 9.0 - Problems with Windows filenames that contain Umlauts


  • Please log in to reply
116 replies to this topic

Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #101 athoik

  • PLi® Core member
  • 8,458 posts

+327
Excellent

Posted 29 February 2024 - 22:40

>>> import os
>>> os.listdir()
['München.png', 'M\udc81nchen.png']

I was able to reproduce the issue, after extracting the zip from post #7.

 

Core utils, ls is much more informative.

ls
M?nchen.png  München.png
ls -b
M\201nchen.png  München.png

So M\201nchen.png is crashing e2.


Edited by athoik, 29 February 2024 - 22:41.

Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #102 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 29 February 2024 - 23:18

>>> import os
>>> os.listdir()
['München.png', 'M\udc81nchen.png']

I was able to reproduce the issue, after extracting the zip from post #7.

 

Core utils, ls is much more informative.

ls
M?nchen.png  München.png
ls -b
M\201nchen.png  München.png

So M\201nchen.png is crashing e2.

 

https://github.com/O...cc543db95f7d8eb

 

That patch seems to work for the ePicload issue. The other problems are cured by the patches I already linked to.

# filename fixes in eServiceReference and MovieList (fixes enigma crash)
https://github.com/OpenViX/enigma2/commit/2c74d42983c6969c0d2ec87b3c48622ba0ff3a45
https://github.com/OpenViX/enigma2/commit/6efb844aba07357c3285e120bde4dc048e2bbad8
https://github.com/OpenViX/enigma2/commit/93fa129f5555bef5e24442d1654d28656359f5e2
https://github.com/OpenViX/enigma2/commit/d9f180b06300586bf7f1e9fa34eed310937e2bef
https://github.com/OpenViX/enigma2/commit/69bb2ba48b5376c5f303b46fd399d8bf53cf28c7

# similar fixes for file_eraser and Trashcan
https://github.com/OpenViX/enigma2/commit/29a92e82ce71df88b6d48af0968df2833cab022b
https://github.com/OpenViX/enigma2/commit/ab86bf81616e75aa3928a82be90d6a91bc5aa220
https://github.com/OpenViX/enigma2/commit/9bb758215a1d7996be4c4a4f880ed9ab1d5758ab
https://github.com/OpenViX/enigma2/commit/69bb2ba48b5376c5f303b46fd399d8bf53cf28c7
https://github.com/OpenViX/enigma2/commit/9624cf8a1df2eca5a11b6810b63f9b972e408076

 

 



Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #103 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 1 March 2024 - 13:47

But you have not applied OpenVix patches to eServiceReference? It was 2022 but I don't remember getpath() needed fixing in python. getItemDisplayNameText function was added to movielist.

 

I haven't done anything. If only because I haven't seen a proper fix yet, just a lot of workarounds...

 

And no, getPath() doesn't need a fix in Python, it needs a fix in the C code, like I've been saying all along. getPath() crashes as soon as it is called, because of the fact it returns an Str object with invalid content.

 

This ( https://github.com/O...c48622ba0ff3a45 ), the change to the typemap, doesn't fix the problem, it only prevents the crash in getPath(), you still end up with invalid data in Python.

 

You could (and from the looks of it did) follow up with another workaround on the Python side by converting the Str back to Bytes, and then try to detect and correctly convert, but ideally that should happen on the C side, so the data passed back to Python is valid.

 

This is true for both eServiceReference and ePicLoad (same issue).


Edited by WanWizard, 1 March 2024 - 13:54.

Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #104 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 1 March 2024 - 14:26

 

But you have not applied OpenVix patches to eServiceReference? It was 2022 but I don't remember getpath() needed fixing in python. getItemDisplayNameText function was added to movielist.

 

This ( https://github.com/O...c48622ba0ff3a45 ), the change to the typemap, doesn't fix the problem, it only prevents the crash in getPath(), you still end up with invalid data in Python.

 

That accepts a bytestring from Python for use in C++. Nothing to do with sending data to Python.


Edited by Huevos, 1 March 2024 - 14:30.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #105 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 1 March 2024 - 15:07

That accepts a bytestring from Python for use in C++. Nothing to do with sending data to Python.

 

Yes, I know.

 

It's been a while I've looked into this, and as you know, my (mental) health isn't all that. My Asperger doesn't really help once my brain has taken a left turn... ;)

 

Doing a step back, the root cause of this issue is that for functional purposes, the original path needs to be stored, otherwise the resource can't be accessed.

 

The directory content is enumerated in Python, which creates eServiceReference objects for each entry found. This passes on the incorrectly encoded filename to C++.

 

The typemap is there to make sure the resulting string is utf-8 encoded, and uses surrogatescape to ensure that when the string is passed back to Python ( via getPath() ), it is valid utf-8 (but not correct for display) and Python doesn't crash.

 

This allows the returned path to be used in Python ( which requires conversion back to Bytes, which works because surrogateescape is lossless ), and allows for encoding detection on that byte string, and consequently, proper conversion.

 

Am I correct, @ocean04 ?


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #106 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 1 March 2024 - 17:06

"The typemap is there to make sure the resulting string is utf-8 encoded".

 

So that is not true.

 

Let me try to explain with simple examples. So let's look at "fileeraser" and "trashcan"

	for root, dirs, files in os.walk(trash, topdown=False):
		for name in files:
			fn = os.path.join(root, name)
			try:
				fn = fn.encode(encoding="utf8", errors="ignore").decode(encoding="utf8")	# ensure string is all utf-8, if dataset name changed, erase will handle not found.  			
				enigma.eBackgroundFileEraser.getInstance().erase(fn)
			except Exception as e:
				print("[Trashcan] Failed to erase file:", name, "   ", e)
fn = fn.encode(encoding="utf8", errors="ignore").decode(encoding="utf8")	

So if we "ignore" that is not going to work and we end up with a filename that is wrong so impossible to erase... so... we want to read the filenames in bytes...

for root, dirs, files in os.walk(trash.encode(), topdown=False):	# handle non utf-8 filenames

To do that we encode the input.

trash.encode()

Now python returns bytes, not strings from the directory walk.

 

So now we want to pass that bytestring to "file_eraser.cpp". The result is a crash. c++ is expecting type std::string&, not bytes.

 

To accept the bytestring we add the typemap (as an extra, the old code still works on string inputs).

 

So now the bytesring from os.walk is fed to "file_eraser.cpp" and the file is successfully erased.

 

-------------------------------------------------------------------

 

The chardet code you can play with in the Python interpreter.

It stops the Python crash and displays the correct characters in the GUI.

 

 



Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #107 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 2 March 2024 - 15:05

I have no clue what you're on about. The typemap is implemented in eServiceReference, which has nothing to do with eBackgroundFileEraser ?

 

I don't need to be lectured about strings, bytes, encode() and decode(), I know what they do and how they work.

 

And I made a point of the fact I understand the need to preserve the original path because that is needed for all I/O operations (which include delete), and the fact that surrogateescape does that.


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #108 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 2 March 2024 - 15:45

I have no clue what you're on about. The typemap is implemented in eServiceReference, which has nothing to do with eBackgroundFileEraser ?

 

I don't need to be lectured about strings, bytes, encode() and decode(), I know what they do and how they work.

 

And I made a point of the fact I understand the need to preserve the original path because that is needed for all I/O operations (which include delete), and the fact that surrogateescape does that.

Just like eBackgroundFileEraser the typemap in eServiceReference follows exactly the same reasoning... and I am not lecturing anyone... you asked how it works.



Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #109 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 2 March 2024 - 17:50

No, I asked @ocean04 to confirm if my reasoning was correct, to which the correct answer would have been a very short "yes".

 

Your explanation didn't add anything new.


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #110 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 2 March 2024 - 18:41

The answer is no. The type map is not to ensure that the input is correct utf8.

Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #111 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 2 March 2024 - 22:48

I didn't write that.


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #112 Huevos

  • PLi® Contributor
  • 4,247 posts

+158
Excellent

Posted 2 March 2024 - 23:32

In post #105

The typemap is there to make sure the resulting string is utf-8 encoded

Maybe I didn't understand what you were asking, but it is not what the typemap is for.


Edited by Huevos, 2 March 2024 - 23:33.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #113 athoik

  • PLi® Core member
  • 8,458 posts

+327
Excellent

Posted 26 March 2024 - 22:29

diff --git a/lib/service/iservice.h b/lib/service/iservice.h
index 9ec1009f4c..be11e47582 100644
--- a/lib/service/iservice.h
+++ b/lib/service/iservice.h
@@ -57,6 +57,9 @@ public:
        std::string getPath() const { return path; }
        void setPath( const std::string &n ) { path=n; }

+       // getRawPath will return a bytes object in python
+       std::vector<char> getRawPath() const { return std::vector<char>(path.begin(), path.end()); }
+
        unsigned int getUnsignedData(unsigned int num) const
        {
                if ( num < sizeof(data)/sizeof(int) )

 

How about using a new property like getRawPath?

 

Returning std::vector<char> will be converted directly into bytes using SWIG.

 

Please note that os.listdir returns both unicode string and bytes depending on the input...

 

>>> os.listdir(b'.')
[b'M\xc3\xbcnchen.png', b'M\x81nchen.png']
>>> os.listdir(b'.')[1].decode("utf-8",errors='surrogateescape')
'M\udc81nchen.png'
>>> os.listdir(b'.')[1].decode('cp437')
'München.png'
...
>>> list(os.walk(b'.'))
[(b'.', [], [b'M\xc3\xbcnchen.png', b'M\x81nchen.png'])]
>>> list(os.walk('.'))
[('.', [], ['München.png', 'M\udc81nchen.png'])]

 

https://docs.python....icode-filenames

 

The os.listdir() function returns filenames, which raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() can do both, depending on whether you provided the directory path as bytes or a Unicode string.

 

What you think is possible to use getRawFile().decode("utf-8",errors='surrogateescape')?


Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #114 littlesat

  • PLi® Core member
  • 56,274 posts

+691
Excellent

Posted 27 March 2024 - 07:03

I think the first reply that might lead to a structured solution.

WaveFrontier 28.2E | 23.5E | 19.2E | 16E | 13E | 10/9E | 7E | 5E | 1W | 4/5W | 15W


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #115 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 27 March 2024 - 15:05

That depends on whether we want to work around it (which this does), or fix it.

 

A fix would be (with my non-existant C++ and SWIG knowledge):

-    std::string getPath() const { return path; }
+    std::vector<char> getPath() const { return std::vector<char>(path.begin(), path.end()); }

causing getPath() to return bytes again, like it did in Python2.

The rest can then be handled in the python code.
 

If we go for getRawPath(), we introduce a new method which requires changes through the entire E2 code (as getPath() is useless, you can only use it if you know up front it will contain valid utf8, so all getPath() calls have to be replaced).


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.


Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #116 athoik

  • PLi® Core member
  • 8,458 posts

+327
Excellent

Posted 27 March 2024 - 19:32

Sure, changing getPath will solve an issue with corrupted (or better to tell filenames created using different encoding than unicode, like cp437).

 

On the other hand, everywhere, we expect getPath to return unicode string.

 

So code comparing string and bytes, will start to fail.

 

>>> if 'hello' in b'hello': print("hello!")
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
>>> if b'hello' in 'hello': print("hello!")
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'in <string>' requires string as left operand, not bytes
>>> if 'hello' in b'hello'.decode('utf-8'): print("hello!")
...
hello!

 

One way or another, we need to make changes.

 

The code on previous posts, I think it was trying to solve exactly that issue. Let getPath return "valid" unicode using "surrogate" in SWIG.

 

No code is bad, but the if can get simpler, clear to maintain, yes let's go for it.

 

Let's sleep on it.


Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: RC 9.0 - Problems with Windows filenames that contain Umlauts #117 WanWizard

  • PLi® Core member
  • 68,625 posts

+1,739
Excellent

Posted 27 March 2024 - 19:59

True, I have no real preference as a non-E2 developer, as long as it is fixed ;).


Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.



1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users