Jump to content


Photo

eepgcache and title searching


  • Please log in to reply
203 replies to this topic

#1 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 7 January 2012 - 10:33

I am searching for title string in eepgcache and it seems that it does not work correctly with special or accented characters.
Could anyone confirm if there is something special I need to do for this to work?

Re: eepgcache and title searching #2 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 7 January 2012 - 15:59

I am searching for title string in eepgcache and it seems that it does not work correctly with special or accented characters.
Could anyone confirm if there is something special I need to do for this to work?

It looks like for searching certain title strings you need to re-encode the string to something like one of the ISO8859-1 through ISO8859-16 encodings.
Is this really needed? Should I not just be able to pass a UTF-8 encoded string and have it work?

Re: eepgcache and title searching #3 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 9 January 2012 - 08:30


I am searching for title string in eepgcache and it seems that it does not work correctly with special or accented characters.
Could anyone confirm if there is something special I need to do for this to work?

It looks like for searching certain title strings you need to re-encode the string to something like one of the ISO8859-1 through ISO8859-16 encodings.
Is this really needed? Should I not just be able to pass a UTF-8 encoded string and have it work?

Can anyone tell me what convertDVBUTF8 in estring.h does?
Does this convert to UTF8 or from UTF8?

Its use in epgcache search method would suggest it converts from UTF8
/* custom encoding */
title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0);


Re: eepgcache and title searching #4 ims

  • PLi® Core member
  • 13,781 posts

+214
Excellent

Posted 9 January 2012 - 09:44

convert strings from dvb to utf8
Kdo nic nedělá, nic nezkazí!

Re: eepgcache and title searching #5 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 9 January 2012 - 09:49

convert strings from dvb to utf8

In that case there could be a bug in the function convertDVBUTF8 or the data sent down by the provider is not completly correct.
Since it is needed to encode some strings to ISO8859 to get the search to return results, this could also be due to the fact that the strings are not normalized before comparison. Maybe someone could make a change to normalize the search string and the event string in the search function before comparing them.

Re: eepgcache and title searching #6 ims

  • PLi® Core member
  • 13,781 posts

+214
Excellent

Posted 9 January 2012 - 10:20

some providers must be converted with "recode" (multibyte) imho.
Kdo nic nedělá, nic nezkazí!

Re: eepgcache and title searching #7 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 9 January 2012 - 10:33

some providers must be converted with "recode" (multibyte) imho.

UTF-8 is a multibyte encoding and this should work for the searches and comparisons, but before you do a comparison of two UTF-8 strings, they need to be normalized, which I believe is missing in the search function.

I have also seen that titles returned from search are not properly encoded so that not all characters show up correctly. Where titles returned for other lookup methods do return correctly.

Edited by awx, 9 January 2012 - 10:34.


Re: eepgcache and title searching #8 ims

  • PLi® Core member
  • 13,781 posts

+214
Excellent

Posted 9 January 2012 - 10:37

not all providers broadcasting in utf8 ... for it is there in encoding.conf settings...
Kdo nic nedělá, nic nezkazí!

Re: eepgcache and title searching #9 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 9 January 2012 - 10:53

not all providers broadcasting in utf8 ... for it is there in encoding.conf settings...

Yes, but thats why I asked if the function converts to UTF-8
That would mean that the title is converted to UTF-8 before comparison.
Ofcourse I would think that all title retrieval should just return a UTF-8 string but for this I am only dealing with the search functions on the box.

Re: eepgcache and title searching #10 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 10 January 2012 - 12:50

Can anyone explain why the following scenario does not work?

1) Query epgcache for events.
2) Take a title (One with special characters) retrieved from one of the events and use this title to search epgcahce
3) epgcache returns no results
4) take the same title decode utf8 and encode to one of the iso character maps
5) searching using the encoded title
6) search returns results

Results have incorrect characters in them even though the query to get currently playing events returned correct results.
Even if we somehow get a match using another charater mapping, why do the results come back with a strange encoding?
Why does the first tile search not return at least the event the title was taken from?

I need help debugging this, if anyone has any ideas please let me know.

Re: eepgcache and title searching #11 MiLo

  • PLi® Core member
  • 14,055 posts

+298
Excellent

Posted 10 January 2012 - 13:09

Probably the epgcache converts all results to utf-8 before returning them, but upon searching, it doesn't do any conversion and looks for the exact byte sequense. EPG data is usually not utf-8 encoded (unless you got it from XMLTV).
Real musicians never die - they just decompose

Re: eepgcache and title searching #12 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 10 January 2012 - 13:14

Probably the epgcache converts all results to utf-8 before returning them, but upon searching, it doesn't do any conversion and looks for the exact byte sequense. EPG data is usually not utf-8 encoded (unless you got it from XMLTV).

Yes, but it is doing the conversion before the title comparison. Although maybe this is not correct as it differs from the one in serviceevent
eServiceEvent does the following for getting the title name
   case SHORT_EVENT_DESCRIPTOR:
   {
	const ShortEventDescriptor *sed = (ShortEventDescriptor*)*desc;
	std::string cc = sed->getIso639LanguageCode();
	std::transform(cc.begin(), cc.end(), cc.begin(), tolower);
	int table=encodingHandler.getCountryCodeDefaultMapping(cc);
	if ( lang == "---" || lang.find(cc) != -1)
	{
	 m_event_name = replace_all(replace_all(convertDVBUTF8(sed->getEventName(), table, tsidonid), "\n", " "), "\t", " ");
	 m_short_description = convertDVBUTF8(sed->getText(), table, tsidonid);
	 retval=1;
	}
	break;
   }

epgcache does the following
	 for (descriptorMap::iterator it(eventData::descriptors.begin());
	  it != eventData::descriptors.end() && descridx < 511; ++it)
	 {
	  __u8 *data = it->second.second;
	  if ( data[0] == 0x4D ) // short event descriptor
	  {
	   std::string title;
	   const char *titleptr = (const char*)&data[6];
	   int title_len = data[5];
	   if (data[6] < 0x20)
	   {
		/* custom encoding */
		title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0);
		titleptr = title.c_str();
		title_len = title.length();
	   }

So the comparisons are comparing utf8 to utf8, or at least should be
But the conversion to ut8 in epgcache search doesnt use tsidonid and table, just hardcodes them to 0x40 and 0

But I am not sure how to get the values and if that is really the problem.

Edited by awx, 10 January 2012 - 13:15.


Re: eepgcache and title searching #13 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 10 January 2012 - 14:20


Probably the epgcache converts all results to utf-8 before returning them, but upon searching, it doesn't do any conversion and looks for the exact byte sequense. EPG data is usually not utf-8 encoded (unless you got it from XMLTV).

Yes, but it is doing the conversion before the title comparison. Although maybe this is not correct as it differs from the one in serviceevent
eServiceEvent does the following for getting the title name
   case SHORT_EVENT_DESCRIPTOR:
   {
	const ShortEventDescriptor *sed = (ShortEventDescriptor*)*desc;
	std::string cc = sed->getIso639LanguageCode();
	std::transform(cc.begin(), cc.end(), cc.begin(), tolower);
	int table=encodingHandler.getCountryCodeDefaultMapping(cc);
	if ( lang == "---" || lang.find(cc) != -1)
	{
	 m_event_name = replace_all(replace_all(convertDVBUTF8(sed->getEventName(), table, tsidonid), "\n", " "), "\t", " ");
	 m_short_description = convertDVBUTF8(sed->getText(), table, tsidonid);
	 retval=1;
	}
	break;
   }

epgcache does the following
	 for (descriptorMap::iterator it(eventData::descriptors.begin());
	  it != eventData::descriptors.end() && descridx < 511; ++it)
	 {
	  __u8 *data = it->second.second;
	  if ( data[0] == 0x4D ) // short event descriptor
	  {
	   std::string title;
	   const char *titleptr = (const char*)&data[6];
	   int title_len = data[5];
	   if (data[6] < 0x20)
	   {
		/* custom encoding */
		title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0);
		titleptr = title.c_str();
		title_len = title.length();
	   }

So the comparisons are comparing utf8 to utf8, or at least should be
But the conversion to ut8 in epgcache search doesnt use tsidonid and table, just hardcodes them to 0x40 and 0

But I am not sure how to get the values and if that is really the problem.


So I think this could be the problem, and the nicest solution to this without rewriting everything would be to call convertDVBUTF8 on the short event descriptor when the data is arriving ( I think at that point we should still have the tsid and onid ). Also this should be done on epgData load. Then all places that call convertDVBUTF8 on the short event descriptor would no long be needed.
The alternative to this would be to have some sort of descriptor to tsid onid mapping.

Unless someone knows how to find the tsid onid from the descriptorMap, then none of this would be needed.

Re: eepgcache and title searching #14 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 11 January 2012 - 19:11

I have a solution to fix the title searching available.
However I would like to investigate modifying the eventData constructor to accept a tsid and onid value and convert all text to utf-8 when it comes in so that we later don't have to worry about the format.
The event constructor seems to only be called 4 times and one of which would not need to be changed as this is for cached data loading which if done correctly will already be in the format we need so no conversion needed.

Can anyone give any insight or help with this?

Re: eepgcache and title searching #15 pieterg

  • PLi® Core member
  • 32,766 posts

+245
Excellent

Posted 11 January 2012 - 21:00

one of the first steps for a better epg cache / storage would be to move away from the raw eit format.
Converting to utf-8 is indeed part of this.

There are a few locations where raw eit is retrieved from the epgcache, one of which is for the eit file stored with each recording.
However, an event description is also added to the meta file (single line though), and we could opt to store a plain text description file instead of the raw eit dump.

And perhaps other raw eit users, which need to be converted.

Re: eepgcache and title searching #16 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 11 January 2012 - 22:34

one of the first steps for a better epg cache / storage would be to move away from the raw eit format.
Converting to utf-8 is indeed part of this.

There are a few locations where raw eit is retrieved from the epgcache, one of which is for the eit file stored with each recording.
However, an event description is also added to the meta file (single line though), and we could opt to store a plain text description file instead of the raw eit dump.

And perhaps other raw eit users, which need to be converted.

Thanks, I will take a look as a first step if its possible to convert to UTF8 when the eventData is first created. This would at least fix text for most cases. I just have to get my dm build working as I only have my pc build working right now.

Also here is a patch, its a bit of a work around to get title searching working properly.

Attached Files



Re: eepgcache and title searching #17 pieterg

  • PLi® Core member
  • 32,766 posts

+245
Excellent

Posted 11 January 2012 - 22:43

thx. Could you please add a bit of comment in the commit, explaining what was broken, and how you fixed it?
As it is not a simple self-explanatory +1/-1 line patch ;)

Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?

Re: eepgcache and title searching #18 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 11 January 2012 - 23:06

thx. Could you please add a bit of comment in the commit, explaining what was broken, and how you fixed it?
As it is not a simple self-explanatory +1/-1 line patch ;)

Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?

I have attached a new patch with a more detailed description.
The code is duplicated from below, but I didn't find a nice way to add this to that code with out complicating things too much. My change may be able to be refactored into that code, but I haven't attempted to do that yet. If I manage to convert all titles in eventData to UTF-8 then this change will no longer be needed.

Attached Files



Re: eepgcache and title searching #19 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 12 January 2012 - 09:26


thx. Could you please add a bit of comment in the commit, explaining what was broken, and how you fixed it?
As it is not a simple self-explanatory +1/-1 line patch ;)

Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?

I have attached a new patch with a more detailed description.
The code is duplicated from below, but I didn't find a nice way to add this to that code with out complicating things too much. My change may be able to be refactored into that code, but I haven't attempted to do that yet. If I manage to convert all titles in eventData to UTF-8 then this change will no longer be needed.

Here is an untested, unbuilt, quick mock up of a patch that basically tries to do the same as I have done in the original patch but not duplicating the code.

Attached Files



Re: eepgcache and title searching #20 awx

  • Senior Member
  • 297 posts

+17
Neutral

Posted 12 January 2012 - 10:31

Can anyone tell me if ENABLE_PRIVATE_EPG is on by default in the openpli image?

I think instead of using the fix above we only need to modify the eventData constructor
and only call this modified constructor in two different places.

RIght now we call the eventData constructor in 4 places
twice in void eEPGCache::sectionRead(const __u8 *data, int source, channel_data *channel)
once in void eEPGCache::load()
once in void eEPGCache::privateSectionRead(const uniqueEPGKey &current_service, const __u8 *data)

since load only loads the data we have previously saved we dont need to do anything here as our saved data would already be in the correct format.
if privateSectionRead is only enabled when ENABLE_PRIVATE_EPG is used, then we just have the restriction that private data must have its text encoded in utf8 format.

Therefore we only need to make sectionRead call a modified version of the eventData constructor with tsid and onid so we can convert text when it first arrives.

Would someone be willing to help in trying this?
Currently I am having trouble getting my build for my dm working and can not properly test this.


5 user(s) are reading this topic

0 members, 5 guests, 0 anonymous users