eepgcache and title searching
Re: eepgcache and title searching #2
Posted 7 January 2012 - 15:59
It looks like for searching certain title strings you need to re-encode the string to something like one of the ISO8859-1 through ISO8859-16 encodings.I am searching for title string in eepgcache and it seems that it does not work correctly with special or accented characters.
Could anyone confirm if there is something special I need to do for this to work?
Is this really needed? Should I not just be able to pass a UTF-8 encoded string and have it work?
Re: eepgcache and title searching #3
Posted 9 January 2012 - 08:30
Can anyone tell me what convertDVBUTF8 in estring.h does?It looks like for searching certain title strings you need to re-encode the string to something like one of the ISO8859-1 through ISO8859-16 encodings.
I am searching for title string in eepgcache and it seems that it does not work correctly with special or accented characters.
Could anyone confirm if there is something special I need to do for this to work?
Is this really needed? Should I not just be able to pass a UTF-8 encoded string and have it work?
Does this convert to UTF8 or from UTF8?
Its use in epgcache search method would suggest it converts from UTF8
/* custom encoding */ title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0);
Re: eepgcache and title searching #4
Re: eepgcache and title searching #5
Posted 9 January 2012 - 09:49
In that case there could be a bug in the function convertDVBUTF8 or the data sent down by the provider is not completly correct.convert strings from dvb to utf8
Since it is needed to encode some strings to ISO8859 to get the search to return results, this could also be due to the fact that the strings are not normalized before comparison. Maybe someone could make a change to normalize the search string and the event string in the search function before comparing them.
Re: eepgcache and title searching #6
Re: eepgcache and title searching #7
Posted 9 January 2012 - 10:33
UTF-8 is a multibyte encoding and this should work for the searches and comparisons, but before you do a comparison of two UTF-8 strings, they need to be normalized, which I believe is missing in the search function.some providers must be converted with "recode" (multibyte) imho.
I have also seen that titles returned from search are not properly encoded so that not all characters show up correctly. Where titles returned for other lookup methods do return correctly.
Edited by awx, 9 January 2012 - 10:34.
Re: eepgcache and title searching #8
Re: eepgcache and title searching #9
Posted 9 January 2012 - 10:53
Yes, but thats why I asked if the function converts to UTF-8not all providers broadcasting in utf8 ... for it is there in encoding.conf settings...
That would mean that the title is converted to UTF-8 before comparison.
Ofcourse I would think that all title retrieval should just return a UTF-8 string but for this I am only dealing with the search functions on the box.
Re: eepgcache and title searching #10
Posted 10 January 2012 - 12:50
1) Query epgcache for events.
2) Take a title (One with special characters) retrieved from one of the events and use this title to search epgcahce
3) epgcache returns no results
4) take the same title decode utf8 and encode to one of the iso character maps
5) searching using the encoded title
6) search returns results
Results have incorrect characters in them even though the query to get currently playing events returned correct results.
Even if we somehow get a match using another charater mapping, why do the results come back with a strange encoding?
Why does the first tile search not return at least the event the title was taken from?
I need help debugging this, if anyone has any ideas please let me know.
Re: eepgcache and title searching #11
Posted 10 January 2012 - 13:09
Re: eepgcache and title searching #12
Posted 10 January 2012 - 13:14
Yes, but it is doing the conversion before the title comparison. Although maybe this is not correct as it differs from the one in serviceeventProbably the epgcache converts all results to utf-8 before returning them, but upon searching, it doesn't do any conversion and looks for the exact byte sequense. EPG data is usually not utf-8 encoded (unless you got it from XMLTV).
eServiceEvent does the following for getting the title name
case SHORT_EVENT_DESCRIPTOR: { const ShortEventDescriptor *sed = (ShortEventDescriptor*)*desc; std::string cc = sed->getIso639LanguageCode(); std::transform(cc.begin(), cc.end(), cc.begin(), tolower); int table=encodingHandler.getCountryCodeDefaultMapping(cc); if ( lang == "---" || lang.find(cc) != -1) { m_event_name = replace_all(replace_all(convertDVBUTF8(sed->getEventName(), table, tsidonid), "\n", " "), "\t", " "); m_short_description = convertDVBUTF8(sed->getText(), table, tsidonid); retval=1; } break; }
epgcache does the following
for (descriptorMap::iterator it(eventData::descriptors.begin()); it != eventData::descriptors.end() && descridx < 511; ++it) { __u8 *data = it->second.second; if ( data[0] == 0x4D ) // short event descriptor { std::string title; const char *titleptr = (const char*)&data[6]; int title_len = data[5]; if (data[6] < 0x20) { /* custom encoding */ title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0); titleptr = title.c_str(); title_len = title.length(); }
So the comparisons are comparing utf8 to utf8, or at least should be
But the conversion to ut8 in epgcache search doesnt use tsidonid and table, just hardcodes them to 0x40 and 0
But I am not sure how to get the values and if that is really the problem.
Edited by awx, 10 January 2012 - 13:15.
Re: eepgcache and title searching #13
Posted 10 January 2012 - 14:20
Yes, but it is doing the conversion before the title comparison. Although maybe this is not correct as it differs from the one in serviceevent
Probably the epgcache converts all results to utf-8 before returning them, but upon searching, it doesn't do any conversion and looks for the exact byte sequense. EPG data is usually not utf-8 encoded (unless you got it from XMLTV).
eServiceEvent does the following for getting the title namecase SHORT_EVENT_DESCRIPTOR: { const ShortEventDescriptor *sed = (ShortEventDescriptor*)*desc; std::string cc = sed->getIso639LanguageCode(); std::transform(cc.begin(), cc.end(), cc.begin(), tolower); int table=encodingHandler.getCountryCodeDefaultMapping(cc); if ( lang == "---" || lang.find(cc) != -1) { m_event_name = replace_all(replace_all(convertDVBUTF8(sed->getEventName(), table, tsidonid), "\n", " "), "\t", " "); m_short_description = convertDVBUTF8(sed->getText(), table, tsidonid); retval=1; } break; }
epgcache does the followingfor (descriptorMap::iterator it(eventData::descriptors.begin()); it != eventData::descriptors.end() && descridx < 511; ++it) { __u8 *data = it->second.second; if ( data[0] == 0x4D ) // short event descriptor { std::string title; const char *titleptr = (const char*)&data[6]; int title_len = data[5]; if (data[6] < 0x20) { /* custom encoding */ title = convertDVBUTF8((unsigned char*)titleptr, title_len, 0x40, 0); titleptr = title.c_str(); title_len = title.length(); }
So the comparisons are comparing utf8 to utf8, or at least should be
But the conversion to ut8 in epgcache search doesnt use tsidonid and table, just hardcodes them to 0x40 and 0
But I am not sure how to get the values and if that is really the problem.
So I think this could be the problem, and the nicest solution to this without rewriting everything would be to call convertDVBUTF8 on the short event descriptor when the data is arriving ( I think at that point we should still have the tsid and onid ). Also this should be done on epgData load. Then all places that call convertDVBUTF8 on the short event descriptor would no long be needed.
The alternative to this would be to have some sort of descriptor to tsid onid mapping.
Unless someone knows how to find the tsid onid from the descriptorMap, then none of this would be needed.
Re: eepgcache and title searching #14
Posted 11 January 2012 - 19:11
However I would like to investigate modifying the eventData constructor to accept a tsid and onid value and convert all text to utf-8 when it comes in so that we later don't have to worry about the format.
The event constructor seems to only be called 4 times and one of which would not need to be changed as this is for cached data loading which if done correctly will already be in the format we need so no conversion needed.
Can anyone give any insight or help with this?
Re: eepgcache and title searching #15
Posted 11 January 2012 - 21:00
Converting to utf-8 is indeed part of this.
There are a few locations where raw eit is retrieved from the epgcache, one of which is for the eit file stored with each recording.
However, an event description is also added to the meta file (single line though), and we could opt to store a plain text description file instead of the raw eit dump.
And perhaps other raw eit users, which need to be converted.
Re: eepgcache and title searching #16
Posted 11 January 2012 - 22:34
Thanks, I will take a look as a first step if its possible to convert to UTF8 when the eventData is first created. This would at least fix text for most cases. I just have to get my dm build working as I only have my pc build working right now.one of the first steps for a better epg cache / storage would be to move away from the raw eit format.
Converting to utf-8 is indeed part of this.
There are a few locations where raw eit is retrieved from the epgcache, one of which is for the eit file stored with each recording.
However, an event description is also added to the meta file (single line though), and we could opt to store a plain text description file instead of the raw eit dump.
And perhaps other raw eit users, which need to be converted.
Also here is a patch, its a bit of a work around to get title searching working properly.
Attached Files
Re: eepgcache and title searching #17
Posted 11 January 2012 - 22:43
As it is not a simple self-explanatory +1/-1 line patch
Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?
Re: eepgcache and title searching #18
Posted 11 January 2012 - 23:06
I have attached a new patch with a more detailed description.thx. Could you please add a bit of comment in the commit, explaining what was broken, and how you fixed it?
As it is not a simple self-explanatory +1/-1 line patch
Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?
The code is duplicated from below, but I didn't find a nice way to add this to that code with out complicating things too much. My change may be able to be refactored into that code, but I haven't attempted to do that yet. If I manage to convert all titles in eventData to UTF-8 then this change will no longer be needed.
Attached Files
Re: eepgcache and title searching #19
Posted 12 January 2012 - 09:26
Here is an untested, unbuilt, quick mock up of a patch that basically tries to do the same as I have done in the original patch but not duplicating the code.I have attached a new patch with a more detailed description.
thx. Could you please add a bit of comment in the commit, explaining what was broken, and how you fixed it?
As it is not a simple self-explanatory +1/-1 line patch
Also, the new code looks a lot like a similar block of code in the search fuction, at first glance.
If it is indeed duplicated, can't we share the code?
The code is duplicated from below, but I didn't find a nice way to add this to that code with out complicating things too much. My change may be able to be refactored into that code, but I haven't attempted to do that yet. If I manage to convert all titles in eventData to UTF-8 then this change will no longer be needed.
Attached Files
Re: eepgcache and title searching #20
Posted 12 January 2012 - 10:31
I think instead of using the fix above we only need to modify the eventData constructor
and only call this modified constructor in two different places.
RIght now we call the eventData constructor in 4 places
twice in void eEPGCache::sectionRead(const __u8 *data, int source, channel_data *channel)
once in void eEPGCache::load()
once in void eEPGCache::privateSectionRead(const uniqueEPGKey ¤t_service, const __u8 *data)
since load only loads the data we have previously saved we dont need to do anything here as our saved data would already be in the correct format.
if privateSectionRead is only enabled when ENABLE_PRIVATE_EPG is used, then we just have the restriction that private data must have its text encoded in utf8 format.
Therefore we only need to make sectionRead call a modified version of the eventData constructor with tsid and onid so we can convert text when it first arrives.
Would someone be willing to help in trying this?
Currently I am having trouble getting my build for my dm working and can not properly test this.
12 user(s) are reading this topic
0 members, 12 guests, 0 anonymous users