Jump to content


athoik

Member Since 20 Sep 2012
Offline Last Active 01 Apr 2024 20:54
*****

Posts I've Made

In Topic: RC 9.0 - Problems with Windows filenames that contain Umlauts

27 March 2024 - 19:32

Sure, changing getPath will solve an issue with corrupted (or better to tell filenames created using different encoding than unicode, like cp437).

 

On the other hand, everywhere, we expect getPath to return unicode string.

 

So code comparing string and bytes, will start to fail.

 

>>> if 'hello' in b'hello': print("hello!")
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: a bytes-like object is required, not 'str'
>>> if b'hello' in 'hello': print("hello!")
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'in <string>' requires string as left operand, not bytes
>>> if 'hello' in b'hello'.decode('utf-8'): print("hello!")
...
hello!

 

One way or another, we need to make changes.

 

The code on previous posts, I think it was trying to solve exactly that issue. Let getPath return "valid" unicode using "surrogate" in SWIG.

 

No code is bad, but the if can get simpler, clear to maintain, yes let's go for it.

 

Let's sleep on it.


In Topic: RC 9.0 - Problems with Windows filenames that contain Umlauts

26 March 2024 - 22:29

diff --git a/lib/service/iservice.h b/lib/service/iservice.h
index 9ec1009f4c..be11e47582 100644
--- a/lib/service/iservice.h
+++ b/lib/service/iservice.h
@@ -57,6 +57,9 @@ public:
        std::string getPath() const { return path; }
        void setPath( const std::string &n ) { path=n; }

+       // getRawPath will return a bytes object in python
+       std::vector<char> getRawPath() const { return std::vector<char>(path.begin(), path.end()); }
+
        unsigned int getUnsignedData(unsigned int num) const
        {
                if ( num < sizeof(data)/sizeof(int) )

 

How about using a new property like getRawPath?

 

Returning std::vector<char> will be converted directly into bytes using SWIG.

 

Please note that os.listdir returns both unicode string and bytes depending on the input...

 

>>> os.listdir(b'.')
[b'M\xc3\xbcnchen.png', b'M\x81nchen.png']
>>> os.listdir(b'.')[1].decode("utf-8",errors='surrogateescape')
'M\udc81nchen.png'
>>> os.listdir(b'.')[1].decode('cp437')
'München.png'
...
>>> list(os.walk(b'.'))
[(b'.', [], [b'M\xc3\xbcnchen.png', b'M\x81nchen.png'])]
>>> list(os.walk('.'))
[('.', [], ['München.png', 'M\udc81nchen.png'])]

 

https://docs.python....icode-filenames

 

The os.listdir() function returns filenames, which raises an issue: should it return the Unicode version of filenames, or should it return bytes containing the encoded versions? os.listdir() can do both, depending on whether you provided the directory path as bytes or a Unicode string.

 

What you think is possible to use getRawFile().decode("utf-8",errors='surrogateescape')?


In Topic: RC 9.0 - Problems with Windows filenames that contain Umlauts

29 February 2024 - 22:40

>>> import os
>>> os.listdir()
['München.png', 'M\udc81nchen.png']

I was able to reproduce the issue, after extracting the zip from post #7.

 

Core utils, ls is much more informative.

ls
M?nchen.png  München.png
ls -b
M\201nchen.png  München.png

So M\201nchen.png is crashing e2.


In Topic: RC 9.0 - Problems with Windows filenames that contain Umlauts

25 February 2024 - 19:58

root@osmio4kplus:~# mount | grep DOSFAT

/dev/sdb1 on /media/DOSFAT type vfat (rw,relatime,gid=6,fmask=0007,dmask=0007,allow_utime=0020,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)

root@osmio4kplus:~# ls /media/DOSFAT

München.png

Attached File  picture.jpg   81.54KB   2 downloads

 

I cannot reproduce, how filename should be on disk?


In Topic: RC 9.0 - Problems with Windows filenames that contain Umlauts

25 February 2024 - 14:43

Maybe on 8.X the mount options for FAT was iso8859-1.

 

Now on 9.X maybe the default mount options changed to utf8.

 

So ü is causing that crash.

 

I bielieve the problems comes from the default iocharset. ISO vs UTF8 nowadays.

 

FYI https://github.com/s...arset&type=code