On this topic of bit flips on reads, The logic: if (!bi->gc_prioritise) { bi->gc_prioritise = 1; dev->has_pending_prioritised_gc = 1; Is going to tell the Garbage Collection routine(s) to GC this block. 1. Will that that process will result in the movement/refreshing of that bloc'ks data, correct? 2. If this is correct, when is GC performed?, Is it on any write operation , or does a separate thread have to be provided to call the GS routines? If the act of doing GC on that block will perform the refresh operation, then the logic: bi->chunk_error_strikes++; if (bi->chunk_error_strikes > 3) { bi->needs_retiring = 1; /* Too many strikes, so retire */ yaffs_trace(YAFFS_TRACE_ALWAYS, "yaffs: Block struck out"); Is not valid here, as the operation was to refresh the block, not say that it is bad. the check of tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR has to be changed to say: if (tags->ecc_result == EUCLEAN ) - indicate to GC this block else if tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR - this exceeded the threshold and the data read is bad. The problem is, should another read from that location occur BEFORE the GC of the block happens, you may get a failure. That is why the block needs to get moves ASAP. (See question 2). Can anyone answer how GC works and when? Chris Gofforth / Pr Software Engineer MS 131-102, Cedar Rapids, IA, USA Phone: 319-295-0373 Fax: 319-295-8100 Chris.Gofforth@rockwellcollins.com www.rockwellcollins.com On Wed, Aug 6, 2014 at 2:26 AM, bpqw wrote: > Hi Clarles, > We recommended if the bitflip over threshold we just need to refresh the > block but not retire it. > So we doubt is it reasonable just according to the bitflips over > mtd->bitflip_threshold over three times to judge the block as bad block? > > Br > White Ding > ____________________________ > EBU APAC Application Engineering > Tel:86-21-38997078 > Mobile: 86-13761729112 > Address: No 601 Fasai Rd, Waigaoqiao Free Trade Zone Pudong, Shanghai, > China > > -----Original Message----- > From: Charles Manning [mailto:cdhmanning@gmail.com] > Sent: Wednesday, August 06, 2014 8:21 AM > To: yaffs@lists.aleph1.co.uk > Cc: bpqw > Subject: Re: [Yaffs] bad block management > > On Friday 25 July 2014 16:50:25 bpqw wrote: > > Hi > > > > I have review the yaffs2 source code and have a doubt. See the follow > > > > > > > > In Yaffs2 the read interface is yaffs_rd_chunk_tags_nand int > > yaffs_rd_chunk_tags_nand(struct yaffs_dev *dev, int nand_chunk, > > > > u8 *buffer, struct yaffs_ext_tags *tags) { > > > > ......... > > > > result = dev->tagger.read_chunk_tags_fn(dev, flash_chunk, > > buffer, tags); > > > > if (tags && tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR) { > > > > > > > > struct yaffs_block_info *bi; > > > > bi = yaffs_get_block_info(dev, > > > > nand_chunk / > > > > dev->param.chunks_per_block); > > > > yaffs_handle_chunk_error(dev, bi); > > > > } > > > > return result; > > > > } > > > > > > > > The yaffs_rd_chunk_tags_nand will call the mtd interface mtd_read_oob > > > > > > > > int mtd_read_oob(struct mtd_info *mtd, loff_t from, struct mtd_oob_ops > > *ops) { > > > > int ret_code; > > > > ops->retlen = ops->oobretlen = 0; > > > > if (!mtd->_read_oob) > > > > return -EOPNOTSUPP; > > > > /* > > > > * In cases where ops->datbuf != NULL, mtd->_read_oob() has > > semantics > > > > * similar to mtd->_read(), returning a non-negative integer > > > > * representing max bitflips. In other cases, mtd->_read_oob() > > may > > > > * return -EUCLEAN. In all cases, perform similar logic to > mtd_read(). > > > > */ > > > > ret_code = mtd->_read_oob(mtd, from, ops); > > > > if (unlikely(ret_code < 0)) > > > > return ret_code; > > > > if (mtd->ecc_strength == 0) > > > > return 0; /* device lacks ecc */ > > > > return ret_code >= mtd->bitflip_threshold ? -EUCLEAN : 0; } > > > > > > > > So if the bitflips num over mtd->bitflip_threshold the mtd_read_oob > > will return -EUCLEAN and tags->ecc_result > YAFFS_ECC_RESULT_NO_ERROR. > > > > Then we will call yaffs_handle_chunk_error. > > > > void yaffs_handle_chunk_error(struct yaffs_dev *dev, > > > > struct yaffs_block_info *bi) > > > > { > > > > if (!bi->gc_prioritise) { > > > > bi->gc_prioritise = 1; > > > > dev->has_pending_prioritised_gc = 1; > > > > bi->chunk_error_strikes++; > > > > > > > > if (bi->chunk_error_strikes > 3) { > > > > bi->needs_retiring = 1; /* Too many stikes, so > > retire */ > > > > yaffs_trace(YAFFS_TRACE_ALWAYS, > > > > "yaffs: Block struck out"); > > > > > > > > } > > > > } > > > > } > > > > > > > > From the code we can see if bitflips num over mtd->bitflip_threshold > > we will mark this block as gc if bitflips num over > > mtd->bitflip_threshold over three times we will mark this block as bad > block. > > > > > > > > We define bad block is if erase or program failed we can mark this > > block as bad block. > > > > So is it reasonable just according to the bitflips over > > mtd->bitflip_threshold over three times to judge the block as bad block? > > > > What's your opinion about my doubts? > > Hello White Ding > > I apologise for taking a while to get back to looking at this. > > First let me explain the history behind what is there. > > In the beginning, there was SLC and Yaffs only supported two levels: > * Good: No ECC errors. > * Single bit ECC error: data is recoverable, but we are worried about a > future failure. > * Multi-bit ECC error: bad. > > In the beginning, the concern was that the blocks with a single bit error > were on their way to going bad, so we better retire it soon. > > Then bits got a bit worse, so we modified the policy slightly. A block > with a single bit error got rewritten but if too many errors were observed > then we retire the block. > > Then with MLC and multi-bit ECC errors we move up to a new step. Single > bit errors became common. Yaffs kept the same basic policy, but the drivers > (at mtd level) start telling "lies". > > For example in a multi-bit ECC system that fixes 4 bits, we might see: > 0-2 bit errors are reported as zero errors. > 3-4 bit errors reported as -EUCLEAN, > > This is essentially the logic you are talking about here, but I need to > dig into the mtd terminology a bit better to understand this fully. > > Some flash parts (eg Micron MT29F8Gxxx parts)with built in ECC do not > report the number of bit errors, but just a "please refresh" indicator. > > I think we are now getting to a point where increasing numbers of bit > errors are expected and should not be treated as a failure. > > Thus we probably need a new level that does a refresh, but does not apply > the three strikes failure policy. > > For example, say something that supports 6 bit correcting we might want > something like this: > 0-2: These are expected, do nothing. > 3-4: Refresh. Do not retire. > 5-6: It looks like the block is failing. Suck the data off and retire if > this happens too often. > 7+: Data is corrupted. > > If there are enough bits to make bands like this then it makes sense. > However parts that hide the bad bits behind an ONFI-like interface do not > really give us the data we need to make fine grained decisions. > > I hope that helps. > > -- Charles > > > _______________________________________________ > yaffs mailing list > yaffs@lists.aleph1.co.uk > http://lists.aleph1.co.uk/cgi-bin/mailman/listinfo/yaffs >