When I compare the both file I see at position 2E0000 starts the difference.
The "corrupt" file ends at Hex 33DF80 and the "good"file at Hex 35DF80 = Difference Hex 20000
(Dec 3399552 and Dec 3530624 = difference 131072)
Posted 24 April 2014 - 18:36
Yes, you're right. So nanddump thinks that there is a bad block and skips it (default setting). If you add "--bb=dumpbad" parameter the bad block is not skipped. And it seems that it don't contain bad data, because I could decompress that file. The smaller file threw an error.
$ gzip -d -t vmlinux.gz gzip: vmlinux.gz: decompression OK, trailing garbage ignored $ gzip -d -t vmlinux2.gz gzip: vmlinux2.gz: invalid compressed data--format violated
So why does nanddump think that the block is bad? The bootloader seems to see it in a different way or ignores it.
Can you execute this:
mtdinfo -M /dev/mtd1
And if you like you can test nand memory with (I don't know what happens, when it really marks the block as bad. Then perhaps your box don't boot!):
nandtest -k /dev/mtd1
-k means according to help "Restore existing contents after test"
Posted 24 April 2014 - 20:14
I have the same problem with backup on vu+ solo, and once i replace kernel_cfe_auto.bin with the bigger file prodused by the post 267, restoration work fine. This only happened on openpli image, backup and restore on black hole, openvix and original vu+ working fine without any file replacements. By the way i am using openpli 3.0.
Edited by nohimx, 24 April 2014 - 20:15.
Posted 25 April 2014 - 06:34
It could be that kernel of the other images is smaller. Then bad block is not used.
I already tried 5 different receivers from 2 different suppliers with same result. Also from the original black hole image kernel_cfe_auto.bin size is 4063232, and openpli original is 3507770.
Posted 25 April 2014 - 21:58
@scottyboyz & nohimx
Could you please try 1 more thing, it is a bit of a long shot but maybe it works
Could you try the command
/usr/sbin/nanddump -l 0x00400000 /dev/mtd1 > /tmp/kernel_cfe_auto.bin
and report the exact size?
Maybe the kernel will be dumped in it's full size, if so your problem could be relative easy solved
Edited by Pedro_Newbie, 25 April 2014 - 22:01.
Posted 26 April 2014 - 07:08
In the next version I will implement a check to compare the dumped kernelsize with real kernelsize, if the size isn't equal the kernel will be dumped again including badblocks.
But first I need to know what the result is of the command in the post above on the solo with problems.
Posted 26 April 2014 - 17:29
Fist I have to excuse that I only checked the ipk from beginning of thread and found the old nanddump binary and as you still use the old cli options I was asuming that you still use the old binaries.
But you didn't read my entire reply to follow my suggestion to use nand_check to see if and where the bad block is located which means you try to fix something without ailing down and verifying the likely root cause. If you don't like my binary the nand_check.c is easy to find and compiled yourself and it nicely display all bad blocks and their locations ot see if it is inside or ourside of the kernel data in flash.
Because just doing size check(s) of the resulting file as you intend to do is NOT really a proper solution, If you would use nand_check to see if kernel contains bad blocks and then extract differently it would make more sense in my opinion. Or do an unzip to see if it is OK to decide what to do.
But the whole way hpw you currenmtly extract the lkernel is not bullet proof either, as the kernel can have any size and is NOT always aligned to an erase block. This means you way of extracting works only most of the time because the kernel is zipped, and this compression will remove empty space you might have extracted or asume it as compressed to nothing (= all 0).
A correct extraction should get data only until empty space starts and this could be even within an erase block. Please check the -t(runcate) option that I added to nanddump to correctly extract the secondstage loader from Dreamboxes. I posted the patch adding this option in the DMM board.
And finally if you have something in a raw device and the bad block marking is wrong such strange things can happen as only when you use a flash filesystem like jffs2 or ubifs the filesystem layer will handle these situations, in your case you are relying on the flash tool and the bios as there is no further crc checking in the filesystem to detect and handle corruption beyond the basic mtdblock handling.
DMM implemented the recover bad blocks option in their bios to re-check the blocks by marking all good, re-writing and crc checking and then either keep as good or re-mark as bad, but most other boxes ignore this problem on their raw partitions. There are mtd tools which can mark blocks intentionally as good (or as bad) and verifying all flash blocks too. So there is a way of handling this from the operating system, but this will not really help you if the problem causes the kernel not to boot anymore. Then you are depending on the flash erase and write functionality of the bios.
Ciao
gutemine
Edited by gutemine, 26 April 2014 - 17:33.
Posted 26 April 2014 - 18:01
Thanks for your answer. I will look into it but as I said I'm absolute no coder/programmer so I'm not hindered with any knowledge (as you'll have seen in the IPK)
You'll have to help me a bit how to solve this (I think).
I will search for the binaries, study it and try to solve it, but I think I will report here back for some advise if you don't mind.
For me is it hard to test something when I don't have the machine which is causing the troubles, so I can't see if I'm on the right way.
Posted 26 April 2014 - 18:50
OK found the nand_check and the patched nanddump
All I see when I run nand_check is
root@vusolo2:~# /home/root/nand_check /dev/mtd2 Flash type is 4 Block size 131072, page size 2048, OOB size 64 7340032 bytes, 56 blocks ==========================.............. ................root@vusolo2:~#
Don't know what the output is on a "faulty" kernel, I also don't know the further working of this binary 'cause there is no --help available.
nanddump I also tried and this give's the following result on the Solo2
root@vusolo2:~# /home/root/nanddump -t /dev/mtd2 > /tmp/kernel.bin ECC failed: 0 ECC corrected: 0 Number of bad blocks: 0 Number of bbt blocks: 0 Block size 131072, page size 2048, OOB size 64 Dumping data starting at 0x00000000 and ending at 0x00700000... truncate at 0x003364e0... root@vusolo2:~#
This works nicely and the kerneldump is 3.368.160 bytes
But again what the output is on the faulty system I only can guess as I'm not able to test.
Could you enlighten me a bit on
- how to check the kernel mtd with nand_check and what the output is in case badblocks are detected, and is this automtic repaired?
- does the nanddump -t works on the solo with the probably effected mtd1 or does it only work after repair with nand_check?
Posted 27 April 2014 - 10:02
See dFlash help:
B Bad block
.. Empty block
- Partially filled block
= Full block
and there is also a mark for blocks with jffs2 summary data, but this is not needed in your case.
So normally the kernel will be just a column of = and maybe the last one will be - and now the question is if on the box with the problem a B is shown within the ====B====
But you could implement a simple check by just nand_checking kernel partition and grepping for B and if one is found you issue a warning that the kernel could be corrupt.
Then the next question is if it is really bad or just bad marked, because this triggers that it either should be unmarked to get proper behaviour, ignored as beeing bad as you tried (not a good idea in my opinion, as you would need to check at least the crc of this block to be sure that it is not really bad) or to be really ignored.
And if the kernel is erase block aligned it should not make any difference if you use -t or not on nanddump, BUT you can easily check this by mdsumming the extracted one, and an unzipped and re-zipped one and finally an original one where you know it is correct.
BUT I have to be a little carefull with my advice on how to repair maybe intentionally corrupted/bad blocks, as the VU clone attack partly uses this method to brick the boxes, and I willdefinitely NOT give anybody an escape out of this self caused problem.by buying a clone.
gutemine
Edited by gutemine, 27 April 2014 - 10:06.
Posted 27 April 2014 - 15:15
But in the case of the solo with troubles:
- Will nand_check report a badblock if it is not marked already as such or is nand_check able to detect a badblock by testing and mark this block as such, or has this to be done with nandtest?
- If there is a badblock in the mtd reported by nand_check shouldn't this block be unused/skipped in the first place.
But all of this has little use if there isn't someone with an effected Solo to test some things.
Maybe I'll have to dump the kernel with nanddump, check the size of the dumped file with the size of /dev/mtdx, if <> display a message and then test for badblocks in /dev/mtdx.
If no badblocks then dump again with maybe the length specified and the --dumpbad parameter AND the message that it could be possible that the image can't be restored with this back-up (and maybe the advise to check the /dev/mtdx with nandtest -m -k )
Posted 27 April 2014 - 16:15
If nobody with this problem is willing to provide inputs then why should you fix it?
I just tried to show you how to investigate and what are correct and what are incorrect owrkarounds.
And non nand_check doesn't test/fix anything it just scanns the blocks and displays what it finds. So this is just good for investigation and exactly why i ship it as part of dFlash to get usefull inputs in case people have problems.
nandwrite would have all the options to handle problems during write (like ---markbad or --noskipbad) but if the bios doesn't it is not your fault.
I would have binaries who can switch blocks intentionally bad to test such things out (and also to revert it) but as I said they could be misused so you will have to help yourself if you want ot test it on your own box.
Edited by gutemine, 27 April 2014 - 16:17.
Posted 27 April 2014 - 19:40
Hi gutemine and pedro,
I find this issue very interesting, but without having feedback, we currently don't know anything. Well only an own research could perhaps help... Or I'll ask the guys in the other forum whether they want to test some things.
@Gutemine: Nanddump says there is no bad block (see post http://openpli.org/f...ndpost&p=418785), but it don't dumps the whole partition. So I guess there is a bad block which is not marked as bad block. But I thought that you can only know that a block is bad when you write to that block and it then don't contains what you have written. And I guess that nanddump only reads the flash and don't write to it.
Posted 27 April 2014 - 20:42
I read a little bit in the code. The first statistic shown by nanddump is a ECC statistic. This shows in our case 0 errors. Afterwards for every erase block mtd_is_bad() is called. Because nanddump don't dumps one of the blocks I guess that it is marked as bad.
@scottyboyz & nohimx: Could you please execute:
mtdinfo -M /dev/mtd1
or use nand_check?
And I would really like to know whether the ECC statistic looks at blocks which are marked as bad blocks or not.
0 members, 41 guests, 0 anonymous users