Re: [Yaffs] bad block management

Top Page
Attachments:
Message as email
+ (text/plain)
+ (text/html)
Delete this message
Reply to this message
Author: Charles Manning
Date:  
To: Chris Gofforth
CC: bpqw, yaffs@lists.aleph1.co.uk
Subject: Re: [Yaffs] bad block management
On Thu, Aug 7, 2014 at 2:50 AM, Chris Gofforth <
> wrote:

> On this topic of bit flips on reads,
>
> The logic:
>
>       if (!bi->gc_prioritise) {

>
>             bi->gc_prioritise = 1;

>
>             dev->has_pending_prioritised_gc = 1;

>
> Is going to tell the Garbage Collection routine(s) to GC this block.
>
> 1.    Will that that process will result in the movement/refreshing of
> that bloc'ks data, correct?

>
>

That is correct.

Doing a gc on a block finds all the live data on the block and writes it
elsewhere. Thus, gc can be used to do the heavy lifting for data refreshing.

Prioritising this block tells the gc to select this blockas soon as it can.
With yaffs1 mode, the block can always be selected as the next block for
gc. With yaffs2 that is not always possible because there are certain rules
required to prevent "violating history". There is some text describing
"shrink headers" in the HowYaffsWorks doc that explains these rules.


> 2.    If this is correct, when is GC performed?, Is it on any write
> operation , or does a separate thread have to be provided to call the GS
> routines?

>
>

There are two paths to gc:
* gc happens parasitically as part of any operation which writes to the
flash ( file/directory creation/deletion, writing a file,...)
* [optional] gc can also be performed in a gc thread.
The benefit of using a gc thread in addition to the parasitic gc is that
this cleans up things in background so the normal flash operations run
faster.

>
>
> If the act of doing GC on that block will perform the refresh operation,
> then the logic:
>
>
>           bi->chunk_error_strikes++;

>
>
>
>             if (bi->chunk_error_strikes > 3) {

>
>                   bi->needs_retiring = 1; /* Too many strikes, so retire */

>
>                   yaffs_trace(YAFFS_TRACE_ALWAYS,

>
>                         "yaffs: Block struck out");
> Is not valid here, as the operation was to refresh the block, not say that
> it is bad.

>


The rationale here is that we're taking the cautious approach.

We do not want to wait until data is lost before we retire blocks.

Instead, we're trying to identify blocks that need a lot of refreshing and
weed them out.


>
> the check of
>
> tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR
>
> has to be changed to say:
>
> if (tags->ecc_result == EUCLEAN )
> - indicate to GC this block
> else
> if tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR
> - this exceeded the threshold and the data read is bad.
>
>
> The problem is, should another read from that location occur BEFORE the GC
> of the block happens, you may get a failure. That is why the block needs to
> get moves ASAP. (See question 2).
>


Yup that's correct.


>
> Can anyone answer how GC works and when?
>
>
> Chris Gofforth / Pr Software Engineer
>
> MS 131-102, Cedar Rapids, IA, USA
>
> Phone: 319-295-0373 Fax: 319-295-8100
>
>
>
> www.rockwellcollins.com
>
>
>
> On Wed, Aug 6, 2014 at 2:26 AM, bpqw <> wrote:
>
>> Hi Clarles,
>> We recommended if the bitflip over threshold we just need to refresh the
>> block but not retire it.
>> So we doubt is it reasonable just according to the bitflips over
>> mtd->bitflip_threshold over three times to judge the block as bad block?
>>
>> Br
>> White Ding
>> ____________________________
>> EBU APAC Application Engineering
>> Tel:86-21-38997078
>> Mobile: 86-13761729112
>> Address: No 601 Fasai Rd, Waigaoqiao Free Trade Zone Pudong, Shanghai,
>> China
>>
>> -----Original Message-----
>> From: Charles Manning [mailto:cdhmanning@gmail.com]
>> Sent: Wednesday, August 06, 2014 8:21 AM
>> To:
>> Cc: bpqw
>> Subject: Re: [Yaffs] bad block management
>>
>> On Friday 25 July 2014 16:50:25 bpqw wrote:
>> > Hi
>> >
>> > I have review the yaffs2 source code and have a doubt. See the follow
>> >
>> >
>> >
>> > In Yaffs2 the read interface is yaffs_rd_chunk_tags_nand int
>> > yaffs_rd_chunk_tags_nand(struct yaffs_dev *dev, int nand_chunk,
>> >
>> >                        u8 *buffer, struct yaffs_ext_tags *tags) {

>> >
>> >       .........

>> >
>> >       result = dev->tagger.read_chunk_tags_fn(dev, flash_chunk,
>> > buffer, tags);

>> >
>> >       if (tags && tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR) {

>> >
>> >
>> >
>> >             struct yaffs_block_info *bi;

>> >
>> >             bi = yaffs_get_block_info(dev,

>> >
>> >                                 nand_chunk /

>> >
>> >                                 dev->param.chunks_per_block);

>> >
>> >             yaffs_handle_chunk_error(dev, bi);

>> >
>> >       }

>> >
>> >       return result;

>> >
>> > }
>> >
>> >
>> >
>> > The yaffs_rd_chunk_tags_nand will call the mtd interface mtd_read_oob
>> >
>> >
>> >
>> > int mtd_read_oob(struct mtd_info *mtd, loff_t from, struct mtd_oob_ops
>> > *ops) {
>> >
>> >       int ret_code;

>> >
>> >       ops->retlen = ops->oobretlen = 0;

>> >
>> >       if (!mtd->_read_oob)

>> >
>> >             return -EOPNOTSUPP;

>> >
>> >       /*

>> >
>> >       * In cases where ops->datbuf != NULL, mtd->_read_oob() has
>> > semantics

>> >
>> >       * similar to mtd->_read(), returning a non-negative integer

>> >
>> >       * representing max bitflips. In other cases, mtd->_read_oob()
>> > may

>> >
>> >       * return -EUCLEAN. In all cases, perform similar logic to
>> mtd_read().

>> >
>> >       */

>> >
>> >       ret_code = mtd->_read_oob(mtd, from, ops);

>> >
>> >       if (unlikely(ret_code < 0))

>> >
>> >             return ret_code;

>> >
>> >       if (mtd->ecc_strength == 0)

>> >
>> >             return 0;   /* device lacks ecc */

>> >
>> >       return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0; }

>> >
>> >
>> >
>> > So if the bitflips num over mtd->bitflip_threshold the mtd_read_oob
>> > will return -EUCLEAN and tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR.
>> >
>> > Then we will call yaffs_handle_chunk_error.
>> >
>> > void yaffs_handle_chunk_error(struct yaffs_dev *dev,
>> >
>> >                         struct yaffs_block_info *bi)

>> >
>> > {
>> >
>> >       if (!bi->gc_prioritise) {

>> >
>> >             bi->gc_prioritise = 1;

>> >
>> >             dev->has_pending_prioritised_gc = 1;

>> >
>> >             bi->chunk_error_strikes++;

>> >
>> >
>> >
>> >             if (bi->chunk_error_strikes > 3) {

>> >
>> >                   bi->needs_retiring = 1; /* Too many stikes, so
>> > retire */

>> >
>> >                   yaffs_trace(YAFFS_TRACE_ALWAYS,

>> >
>> >                         "yaffs: Block struck out");

>> >
>> >
>> >
>> >             }

>> >
>> >       }

>> >
>> > }
>> >
>> >
>> >
>> > From the code we can see if bitflips num over mtd->bitflip_threshold
>> > we will mark this block as gc if bitflips num over
>> > mtd->bitflip_threshold over three times we will mark this block as bad
>> block.
>> >
>> >
>> >
>> > We define bad block is if erase or program failed we can mark this
>> > block as bad block.
>> >
>> > So is it reasonable just according to the bitflips over
>> > mtd->bitflip_threshold over three times to judge the block as bad block?
>> >
>> > What's your opinion about my doubts?
>>
>> Hello White Ding
>>
>> I apologise for taking a while to get back to looking at this.
>>
>> First let me explain the history behind what is there.
>>
>> In the beginning, there was SLC and Yaffs only supported two levels:
>> * Good: No ECC errors.
>> * Single bit ECC error: data is recoverable, but we are worried about a
>> future failure.
>> * Multi-bit ECC error: bad.
>>
>> In the beginning, the concern was that the blocks with a single bit error
>> were on their way to going bad, so we better retire it soon.
>>
>> Then bits got a bit worse, so we modified the policy slightly. A block
>> with a single bit error got rewritten but if too many errors were observed
>> then we retire the block.
>>
>> Then with MLC and multi-bit ECC errors we move up to a new step. Single
>> bit errors became common. Yaffs kept the same basic policy, but the drivers
>> (at mtd level) start telling "lies".
>>
>> For example in a multi-bit ECC system that fixes 4 bits, we might see:
>> 0-2 bit errors are reported as zero errors.
>> 3-4 bit errors reported as -EUCLEAN,
>>
>> This is essentially the logic you are talking about here, but I need to
>> dig into the mtd terminology a bit better to understand this fully.
>>
>> Some flash parts (eg Micron MT29F8Gxxx parts)with built in ECC do not
>> report the number of bit errors, but just a "please refresh" indicator.
>>
>> I think we are now getting to a point where increasing numbers of bit
>> errors are expected and should not be treated as a failure.
>>
>> Thus we probably need a new level that does a refresh, but does not apply
>> the three strikes failure policy.
>>
>> For example, say something that supports 6 bit correcting we might want
>> something like this:
>> 0-2: These are expected, do nothing.
>> 3-4: Refresh. Do not retire.
>> 5-6: It looks like the block is failing. Suck the data off and retire if
>> this happens too often.
>> 7+: Data is corrupted.
>>
>> If there are enough bits to make bands like this then it makes sense.
>> However parts that hide the bad bits behind an ONFI-like interface do not
>> really give us the data we need to make fine grained decisions.
>>
>> I hope that helps.
>>
>> -- Charles
>>
>>
>> _______________________________________________
>> yaffs mailing list
>>
>> http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs
>>
>
>