On this topic of bit flips on reads,

The logic:

if (!bi->gc_prioritise) {

bi->gc_prioritise = 1;

dev->has_pending_prioritised_gc = 1;

Is going to tell the Garbage Collection routine(s) to GC this block.

1. Will that that process will result in the movement/refreshing of that bloc'ks data, correct?

2. If this is correct, when is GC performed?, Is it on any write operation , or does a separate thread have to be provided to call the GS routines?

If the act of doing GC on that block will perform the refresh operation, then the logic:

bi->chunk_error_strikes++;

if (bi->chunk_error_strikes > 3) {

bi->needs_retiring = 1; /* Too many strikes, so retire */

yaffs_trace(YAFFS_TRACE_ALWAYS,

"yaffs: Block struck out");

Is not valid here, as the operation was to refresh the block, not say that it is bad.

the check of

tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR

has to be changed to say:

if (tags->ecc_result == EUCLEAN )

- indicate to GC this block

else

if tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR

- this exceeded the threshold and the data read is bad.

The problem is, should another read from that location occur BEFORE the GC of the block happens, you may get a failure. That is why the block needs to get moves ASAP. (See question 2).

Can anyone answer how GC works and when?

Chris Gofforth / Pr Software Engineer

MS 131-102, Cedar Rapids, IA, USA

Phone: 319-295-0373 Fax: 319-295-8100

Chris.Gofforth@rockwellcollins.com

www.rockwellcollins.com

On Wed, Aug 6, 2014 at 2:26 AM, bpqw <bpqw@micron.com> wrote:

Hi Clarles,
We recommended if the bitflip over threshold we just need to refresh the block but not retire it.
So we doubt is it reasonable just according to the bitflips over

mtd->bitflip_threshold over three times to judge the block as bad block?

Br
White Ding
____________________________
EBU APAC Application Engineering
Tel:86-21-38997078
Mobile: 86-13761729112
Address: No 601 Fasai Rd, Waigaoqiao Free Trade Zone Pudong, Shanghai, China

-----Original Message-----
From: Charles Manning [mailto:cdhmanning@gmail.com]
Sent: Wednesday, August 06, 2014 8:21 AM
To: yaffs@lists.aleph1.co.uk
Cc: bpqw
Subject: Re: [Yaffs] bad block management

On Friday 25 July 2014 16:50:25 bpqw wrote:
> Hi
>
> I have review the yaffs2 source code and have a doubt. See the follow
>
>
>
> In Yaffs2 the read interface is yaffs_rd_chunk_tags_nand int
> yaffs_rd_chunk_tags_nand(struct yaffs_dev *dev, int nand_chunk,
>
> u8 *buffer, struct yaffs_ext_tags *tags) {
>
> .........
>
> result = dev->tagger.read_chunk_tags_fn(dev, flash_chunk,
> buffer, tags);
>
> if (tags && tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR) {
>
>
>
> struct yaffs_block_info *bi;
>
> bi = yaffs_get_block_info(dev,
>
> nand_chunk /
>
> dev->param.chunks_per_block);
>
> yaffs_handle_chunk_error(dev, bi);
>
> }
>
> return result;
>
> }
>
>
>
> The yaffs_rd_chunk_tags_nand will call the mtd interface mtd_read_oob
>
>
>
> int mtd_read_oob(struct mtd_info *mtd, loff_t from, struct mtd_oob_ops
> *ops) {
>
> int ret_code;
>
> ops->retlen = ops->oobretlen = 0;
>
> if (!mtd->_read_oob)
>
> return -EOPNOTSUPP;
>
> /*
>
> * In cases where ops->datbuf != NULL, mtd->_read_oob() has
> semantics
>
> * similar to mtd->_read(), returning a non-negative integer
>
> * representing max bitflips. In other cases, mtd->_read_oob()
> may
>
> * return -EUCLEAN. In all cases, perform similar logic to mtd_read().
>
> */
>
> ret_code = mtd->_read_oob(mtd, from, ops);
>
> if (unlikely(ret_code < 0))
>
> return ret_code;
>
> if (mtd->ecc_strength == 0)
>
> return 0; /* device lacks ecc */
>
> return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0; }
>
>
>
> So if the bitflips num over mtd->bitflip_threshold the mtd_read_oob
> will return -EUCLEAN and tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR.
>
> Then we will call yaffs_handle_chunk_error.
>
> void yaffs_handle_chunk_error(struct yaffs_dev *dev,
>
> struct yaffs_block_info *bi)
>
> {
>
> if (!bi->gc_prioritise) {
>
> bi->gc_prioritise = 1;
>
> dev->has_pending_prioritised_gc = 1;
>
> bi->chunk_error_strikes++;
>
>
>
> if (bi->chunk_error_strikes > 3) {
>
> bi->needs_retiring = 1; /* Too many stikes, so
> retire */
>
> yaffs_trace(YAFFS_TRACE_ALWAYS,
>
> "yaffs: Block struck out");
>
>
>
> }
>
> }
>
> }
>
>
>
> From the code we can see if bitflips num over mtd->bitflip_threshold
> we will mark this block as gc if bitflips num over
> mtd->bitflip_threshold over three times we will mark this block as bad block.
>
>
>
> We define bad block is if erase or program failed we can mark this
> block as bad block.
>
> So is it reasonable just according to the bitflips over
> mtd->bitflip_threshold over three times to judge the block as bad block?
>
> What's your opinion about my doubts?

Hello White Ding

I apologise for taking a while to get back to looking at this.

First let me explain the history behind what is there.

In the beginning, there was SLC and Yaffs only supported two levels:
* Good: No ECC errors.
* Single bit ECC error: data is recoverable, but we are worried about a future failure.
* Multi-bit ECC error: bad.

In the beginning, the concern was that the blocks with a single bit error were on their way to going bad, so we better retire it soon.

Then bits got a bit worse, so we modified the policy slightly. A block with a single bit error got rewritten but if too many errors were observed then we retire the block.

Then with MLC and multi-bit ECC errors we move up to a new step. Single bit errors became common. Yaffs kept the same basic policy, but the drivers (at mtd level) start telling "lies".

For example in a multi-bit ECC system that fixes 4 bits, we might see:
0-2 bit errors are reported as zero errors.
3-4 bit errors reported as -EUCLEAN,

This is essentially the logic you are talking about here, but I need to dig into the mtd terminology a bit better to understand this fully.

Some flash parts (eg Micron MT29F8Gxxx parts)with built in ECC do not report the number of bit errors, but just a "please refresh" indicator.

I think we are now getting to a point where increasing numbers of bit errors are expected and should not be treated as a failure.

Thus we probably need a new level that does a refresh, but does not apply the three strikes failure policy.

For example, say something that supports 6 bit correcting we might want something like this:
0-2: These are expected, do nothing.
3-4: Refresh. Do not retire.
5-6: It looks like the block is failing. Suck the data off and retire if this happens too often.
7+: Data is corrupted.

If there are enough bits to make bands like this then it makes sense. However parts that hide the bad bits behind an ONFI-like interface do not really give us the data we need to make fine grained decisions.

I hope that helps.

-- Charles

_______________________________________________
yaffs mailing list
yaffs@lists.aleph1.co.uk
http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs