subtitles problem

Started by imperia, 11 Oct 2015 15:36

Please log in to reply

12 replies to this topic

#1 imperia

Senior Member
140 posts

+2

Neutral

Posted 11 October 2015 - 15:36

Hello,

I was watching a movie in media player.

I tried to lower the volume, but I guess remote sent some other command and now my subtitles are corrupted. I tried restarting the STB but that doesn't helped!

How do I fix this?

Attached Files

20151011_172245.jpg 101.92KB 10 downloads

Re: subtitles problem #2 WanWizard

PLi® Core member
68,559 posts

+1,737

Excellent

Posted 11 October 2015 - 15:47

If you play this same movie on the PC (using VLC for example), are the subtitles correct there?

Currently in use: VU+ Duo 4K (2xFBC S2), VU+ Solo 4K (1xFBC S2), uClan Usytm 4K Pro (S2+T2), Octagon SF8008 (S2+T2), Zgemma H9.2H (S2+T2)

Due to my bad health, I will not be very active at times and may be slow to respond. I will not read the forum or PM on a regular basis.

Many answers to your question can be found in our new and improved wiki.

Re: subtitles problem #3 athoik

PLi® Core member
8,458 posts

+327

Excellent

Posted 11 October 2015 - 15:54

Make sure that if you are using external subtitles (.srt) they are encoded as UTF-8 with BOM. Use notepad++ if you like to change the encoding.

Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: subtitles problem #4 imperia

Senior Member
140 posts

+2

Neutral

Posted 11 October 2015 - 15:59

everything was ok before i touched the remote. (i have to say that my remote is kinda failing. even i only touch volume down button, remote may have send some other command)

I use .srt subtitles always converted with notepad++

but before i never converted them to UTF-8 BOM only UTF-8. never had problem before with UTF-8.

will try now with BOM version.

Re: subtitles problem #5 imperia

Senior Member
140 posts

+2

Neutral

Posted 11 October 2015 - 16:25

Now it's working. Strange is that it's was working with UTF8 before I touched the remote. Never had problems with UTF8 only, for years.

Re: subtitles problem #6 athoik

PLi® Core member
8,458 posts

+327

Excellent

Posted 11 October 2015 - 17:23

It's a known bug, sometimes it happens... https://bugzilla.gno...g.cgi?id=740784

Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: subtitles problem #7 imperia

Senior Member
140 posts

+2

Neutral

Posted 11 October 2015 - 17:45

Thank you for the help.

Re: subtitles problem #8 Erik Slagter

PLi® Core member
46,951 posts

+541

Excellent

Posted 12 October 2015 - 18:27

Athoik can you explain to me what's actually the problem, I can't see it. UTF-8=UTF-8, there is no "encoding" because UTF-8 IS the encoding. The only thing I can think of is byte order or even bit order. Byte order would surprise me because UTF-8 doesn't have a sense of "words", it's just "fifo", one byte at a time. Machines with reversed bit order, do they still exist? So?

* Wavefrontier T90 with 28E/23E/19E/13E via SCR switches 2 x 2 x 6 user bands
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: subtitles problem #9 athoik

PLi® Core member
8,458 posts

+327

Excellent

Posted 12 October 2015 - 19:15

Hi Erik,

When you have UTF-8 with BOM then encoding is detected by function detected_encoding and then used in gst_convert_to_utf8 without any issues.

static gchar *
detect_encoding (const gchar * str, gsize len)

{
if (len >= 3 && (guint8) str[0] == 0xEF && (guint8) str[1] == 0xBB
&& (guint8) str[2] == 0xBF)
return g_strdup ("UTF-8");
....
....
self->detected_encoding = detect_encoding ((gchar *) map.data, map.size);
....
....
static gchar *
convert_encoding (GstSubParse * self, const gchar * str, gsize len,
gsize * consumed)
{
....

/* First try any detected encoding */
if (self->detected_encoding) {
ret =
gst_convert_to_utf8 (str, len, self->detected_encoding, consumed, &err);
...

When the file is UTF-8 without BOM then detected_encoding is NULL and it tries to guess the encoding with g_utf8_validate. The problem is with g_utf8_validate. Note that g_utf8_validate() returns FALSE if max_len is positive and any of the max_len bytes are nul. (and there are nul bytes, so the patch discard them).

/* Otherwise check if it's UTF8 */
if (self->valid_utf8) {
if (g_utf8_validate (str, len, NULL)) {
GST_LOG_OBJECT (self, "valid UTF-8, no conversion needed");
*consumed = len;
return g_strndup (str, len);
}

Also we need to consume only the valid data from gst_utf8_validate. Eg we have 12 Greek characters (== 24 bytes), but buffer contains 23 bytes, only the first 22 bytes (== 11 characters) are valid UTF-8. The next time we are going to fill buffer we have 1 byte from previous run and one more from new read and it will be a valid Greek character now (== 2 bytes).

Here is the patch that solves those problems, but not accepted yet.

From: "Reynaldo H. Verdejo Pinochet" <reynaldo@osg.samsung.com>
Date: Fri, 28 Nov 2014 13:26:13 -0300
Subject: [PATCH] subparse: avoid false negatives dealing with UTF-8
g_utf8_validate() chokes at any NUL among max_len
bytes so we should avoid passing null character
terminators if present. Additionally, only part of
the available data might be valid UTF-8. For example
a byte at the end might be the start of a valid UTF-8
run (ie: d0) but not be a valid UTF-8 character by
itself. In this case, we consume only the valid portion
of the run.

https://bugzilla.gnome.org/show_bug.cgi?id=740784

Edited by athoik, 12 October 2015 - 19:17.

Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: subtitles problem #10 olive069

Member
14 posts

0

Neutral

Posted 12 October 2015 - 21:10

Some french subs have the same problem, wrong printed letters. Will it be fixed with this patch too?

Thanks.

Re: subtitles problem #11 Erik Slagter

PLi® Core member
46,951 posts

+541

Excellent

Posted 13 October 2015 - 13:44

Athoik, I still don't understand. What "encoding" are we talking about? Enigma and gstreamer do always assume UTF-8 unless overriden, right? Whenever UTF-8 is established, there is no further encoding, byte ordering (byte-ordering-mark?) etc.

* Wavefrontier T90 with 28E/23E/19E/13E via SCR switches 2 x 2 x 6 user bands
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Re: subtitles problem #12 athoik

PLi® Core member
8,458 posts

+327

Excellent

Posted 13 October 2015 - 19:11

Athoik, I still don't understand. What "encoding" are we talking about? Enigma and gstreamer do always assume UTF-8 unless overriden, right? Whenever UTF-8 is established, there is no further encoding, byte ordering (byte-ordering-mark?) etc.

Ok, let me explain, Enigma2 knows nothing about .srt encoding.

It's GStreamer subparse module responsible for that (http://cgit.freedesk...e/gstsubparse.c)

  g_object_class_install_property (object_class, PROP_ENCODING,
      g_param_spec_string ("subtitle-encoding", "subtitle charset encoding",
          "Encoding to assume if input subtitles are not in UTF-8 or any other "
          "Unicode encoding. If not set, the GST_SUBTITLE_ENCODING environment "
          "variable will be checked for an encoding to use. If that is not set "
          "either, ISO-8859-15 will be assumed.", DEFAULT_ENCODING,
          G_PARAM_READWRITE | G_PARAM_STATIC_STRINGS));

1. By default everything is UTF-8

self->valid_utf8 = TRUE;

2. Uness BOM is detected, if BOM detected, it uses detected encoding from BOM, and tries to convert to detected encoding without validating buffers.

self->detected_encoding = detect_encoding ((gchar *) map.data, map.size);
...
/* First try any detected encoding */
if (self->detected_encoding) {
ret = gst_convert_to_utf8 (str, len, self->detected_encoding, consumed, &err);

3. If No BOM detected, every UTF-8 buffer is validated

/* Otherwise check if it's UTF8 */
if (self->valid_utf8) {
  if (g_utf8_validate (str, len, NULL)) {
    GST_LOG_OBJECT (self, "valid UTF-8, no conversion needed");
     *consumed = len;
    return g_strndup (str, len);
  }
  GST_INFO_OBJECT (self, "invalid UTF-8!");
  self->valid_utf8 = FALSE;
}

4. If validation fails, it will never try again to validate UTF-8 (valid_utf8 = FALSE)
(step 3). It will try to convert every buffer to failback encoding.

  /* Else try fallback */
  encoding = self->encoding;
  if (encoding == NULL || *encoding == '\0') {
    encoding = g_getenv ("GST_SUBTITLE_ENCODING");
  }
  if (encoding == NULL || *encoding == '\0') {
    /* if local encoding is UTF-8 and no encoding specified
     * via the environment variable, assume ISO-8859-15 */
    if (g_get_charset (&encoding)) {
      encoding = "ISO-8859-15";
    }
  }

  ret = gst_convert_to_utf8 (str, len, encoding, consumed, &err);

  if (err) {
    GST_WARNING_OBJECT (self, "could not convert string from '%s' to UTF-8: %s",
        encoding, err->message);
    g_clear_error (&err);

    /* invalid input encoding, fall back to ISO-8859-15 (always succeeds) */
    ret = gst_convert_to_utf8 (str, len, "ISO-8859-15", consumed, NULL);
  }

  GST_LOG_OBJECT (self,
      "successfully converted %" G_GSIZE_FORMAT " characters from %s to UTF-8"
      "%s", len, encoding, (err) ? " , using ISO-8859-15 as fallback" : "");

So, if we have UTF-8 without BOM, and g_utf8_validate fails (Eg ends with NULL) and every valid UTF-8 is converted to failback encoding (ISO-8859-15****).

**** The only exception is for Greek, where it uses ISO-8859-7 (https://github.com/O...a72f2695c3a4de3)

Wavefield T90: 0.8W - 1.9E - 4.8E - 13E - 16E - 19.2E - 23.5E - 26E - 33E - 39E - 42E - 45E on EMP Centauri DiseqC 16/1
Unamed: 13E Quattro - 9E Quattro on IKUSI MS-0916

Re: subtitles problem #13 Erik Slagter

PLi® Core member
46,951 posts

+541

Excellent

Posted 14 October 2015 - 18:22

This sounds like very dirty workarounds for very dirty practises. I am with that you can't really determine the encoding of a SRT file (external or embedded). That is a major flaw in the design. So the problem is actually not UTF-8, but using encodings other than UTF-8. I mean, if everybody would use UTF-8, there would be no problem. I guess the option to run every SRT through iconv to UTF-8 is not user-friendly?

I really wish all these encoding would be gone for once and for all. Everyone (including Windows, please) should be using UTF-8.

* Wavefrontier T90 with 28E/23E/19E/13E via SCR switches 2 x 2 x 6 user bands
I don't read PM -> if you have something to ask or to report, do it in the forum so others can benefit. I don't take freelance jobs.
Ik lees geen PM -> als je iets te vragen of te melden hebt, doe het op het forum, zodat anderen er ook wat aan hebben.

Back to [EN] Enduser support

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

ATTENTION !!!

subtitles problem

#1 imperia

Attached Files

Re: subtitles problem #2 WanWizard

Re: subtitles problem #3 athoik

Re: subtitles problem #4 imperia

Re: subtitles problem #5 imperia

Re: subtitles problem #6 athoik

Re: subtitles problem #7 imperia

Re: subtitles problem #8 Erik Slagter

Re: subtitles problem #9 athoik

Re: subtitles problem #10 olive069

Re: subtitles problem #11 Erik Slagter

Re: subtitles problem #12 athoik

Re: subtitles problem #13 Erik Slagter

0 user(s) are reading this topic