Welcome


Welcome to my blog for all things related to business quality (processes, systems and ways of working), products and product quality, manufacturing and operations management.

This blog is a mixture of real-world experience, ideas, comments and observations that I hope you'll find interesting.

Pages

July 2010
M T W T F S S
« Jun   Aug »
 1234
567891011
12131415161718
19202122232425
262728293031  

The real meaning of MTBF

Ignore some of the more disparaging descriptions of what ‘M.T.B.F.’ means; it actually stands for Mean Time Between Failures (or, for products that can’t be repaired, the term Mean Time To Failure is often used instead). It’s the inverse of the annual failure rate if the failure rate is constant.

And it isn’t quite what you might think.

What is the MTBF of an 25 year old human being? 70 years? 80? No, it’s actually over 800 years which highlights the difference between lifetime and MTBF. Take a large population of, say, 500,000 over a year, and seeing how many ‘failed’ (died) that year – e.g. 600 – so the failure rate is 600 per 500,000 ‘people-years’, i.e. 0.12% per year and the MTBF is the inverse of that which is 830 years. An individual won’t last that long, they will wear out long before then (unless they are Doctor Who), but for the population as a whole, in that ‘high reliability’ portion of their lifespan, it holds true – in a typical year you will only have to ‘replace’ 600 of them.

So why measure MTBF? “If you can’t measure it you can’t manage it” – knowing your MTBF allows you to benchmark yourself against competitors and can be a marketing asset; many customers expect you to know and disclose your figures. It also allows you improve the weak spots in your product range, and is useful feedback for the design process.

There are two main methods for calculating MTBF:

MTBF Prediction is a mathematical model of reliability, based on accumulating the individual MTBFs for the product’s constituent parts and subassemblies, gleaned from manufacturers data or libraries of standard figures and mathematically combining them into an overall figure. MIL-HDBK-217 (MIL-STD-217) was one of the first methods and is still very well known although other schemes have since come into common usage such as Telcordia’s SR-332, BT’s HRD5, and others; there are software tools available, from free to megabucks, that help you make the calculations.

These theoretical methods are supposedly based on empirical evidence but have a number of flaws, primarily that (a) the individual parts never actually have the MTBFs you expect of them, and (b) combining them mathematically ignores many of the real-world effects that dominate the MTBF of the whole product. I once designed a large audio mixing desk whose predicted MTBF according to MIL-STD-217 was less than 8 minutes; I’m glad to say that, in practice, it was a great deal longer than that!

MTBF Measurement sounds simple in principle; count how many failures you have in a given period of product usage and some easy maths gives you the MTBF. The Devil is in the detail, though – doing statistically meaningful averages over large volumes and long periods is easy, but what about small populations, and what if you need answers quickly rather than waiting for several years?

In practice you have to make some assumptions, the main one being that your failure rate is constant. Now this may not be true; if we take the classic bathtub reliability curve you may have a long drawn-out leading edge with a high level of infant mortality, or you may have a long trailing edge where products start to fail prematurely after relatively little life in the field, but both of these are problems that you would need to do something about urgently. The norm is to have a fairly long period of constant reliability – bumping along the bottom of the bathtub – and in this zone the failure rate over a short period can be extrapolated to the rate that would be achieved over a much longer period… as long as it is within the published lifetime of the product (the MTBF of an 80 year old human is not 830 years!).

So take the date that you shipped a unit to a customer, add a little time for the customer to put it into service, then open up a ‘sampling window’ in time of, say, 6 months to look for any failures. If the failure rate is constant then the annual failure rate is twice the number of failures in the 6 month window. If the units are used 24/7 the MTBF in years equals the number of units built divided by the annual failure rate (back to 500,000 25 year old humans, divided by 600 failures, equals 830 years MTBF). Periodic use, say 8 hours a day, would require the MTBF to be scaled down accordingly (because it has clocked up fewer operating hours per failure, hence a lower MTBF).

Don’t be too harsh on yourself, by the way; you wouldn’t normally expect to count units returned as faulty but that turned out to be No Fault Found, or units damaged by the customer or in transit, or units that were prototypes and not expected to have the performance and longevity of production units, or units that had not been properly serviced or maintained or had reached their published end of life, so you can normally exclude these from the calculations.

And how do you define a failure – does the malfunction of a single dashboard bulb in a car mean the whole vehicle has failed? You will want to have a sensible, defensible criteria for “fail”.

Now, I plead guilty to dramatically simplifying the subject; what about Mean Time To Repair, what about non-linear failure rates, what about the difference between constant failure rate and constant failure density, what about adding normalising or scaling factors to match different environments? All valid questions and, I’m sorry to say, beyond the scope of this short blog.

However, the key message is that you can calculate MTBF quite easily with a little patience and a simple spreadsheet, and it’s a very useful figure to have.

Share

18 comments to The real meaning of MTBF

  • dear sir,

    i have a FIT rate of 1 and so i converted this to annual failure rate in ppm which is 8.76 ppm. Can i multiply this by 10 if i need 10 year failure rate in ppm, so i get 87.6 ppm over 10 years. is this a correct way to make this calculation or is there another formula availabel somewhere.

  • Steve

    I’m not sure why but every other place I’ve read about these calculations made it incredibly difficult. Thank you for such a clear and concise explanation.

  • Tom G

    Thanks for the comments, Steve, very kind.

  • Lee

    Dear Tom,
    Amongst all the esoteric stuff in MIL-HDBK-217, Quality Factor (PiQ) for passives seems most obscure. I have not yet found a simple way of taking a 100nF 0603 Ceramics capacitor (used for decoupling) from Farnell and justifying a particular PiQ to use in calculations. Using the worst PiQ value for a aboard with over 300 such capacitors can yield very unrealistically poor MTBF figures.
    Any ideas ?

  • Tom G

    Hi Lee

    I’m afraid MIL-HDBK-217, although still used by some people, is a largely discredited process for producing MTBF. The component performance figures (including Quality Factor, PiQ) that are used are rarely accurate and the final result of the analysis often gives ridiculously high or ridiculously low figures as you have found. The component figures for large ICs are often particularly out of date.

    The production of MTBF figures is much more accurate and useful if it is based on actual field failure data, even if future products vary somewhat from past products for which you have actual MTBF measurements.

    Those organisations that do still use predictive MTBF sometimes use modified sets of component figures, e.g. variations of the Bellcore (Telcordia) figures / standard, that are specific to their type of applications rather than using the ‘standard’ published figures; there are specific variants for some different industry and product types. However, this makes it difficult for different industries to share component data. Initiatives such as Vita 51.1 have recently been taken to try to address the shortcomings of MIL-HDBK-217.

    So I don’t have a magic set of better numbers that will resolve your problem; 300 commercial capacitors using the standard figures will give a low MTBF.

    I will send you a few more details by email.

    Thanks for contacting the Quality and Products blog.

  • Neil Camargo

    Dear Tom,

    Knowing the lifetime of the product it is possible to obtain the MTBF (following your reasoning, but in reverse)?

    Thanks.

  • Tom G

    Hi Neil

    Unfortunately not or, at least, not under most circumstances. The reasoning is that MTBF usually measures faults whereas lifetime usually measures… lifetime! i.e. when does the product wear out? Sometimes they are the same. Often they are not.

    Take a domestic light bulb. One 40W mains bulb of a certain make and type lasts often about the same amount of time as another; if you replace two identical bulbs in a light fixture at the same time (assuming they both come on at the same time whenever its used) they will often blow within a short time of each other at their end of life. Light bulbs are pretty reliable; if they don’t blow on first turn-on they usually last for their prescribed number of hours. MTBF will be similar to lifetime and knowing one figure means you know the other.

    But take my human analogy, or even a hard disk drive, where the MTBF is very much higher than the lifetime unlike the light bulb. If you know that the lifetime of a human is, say, 80 years you can’t get from that figure to the MTBF of a 25 year old because the failure mechanisms are usually different. As the human ages, the failure mechanisms and figures converge (the MTBF of an 80 year old human is very similar to the lifetime), but at 25 years old the main failure mechanism is accidents not old age!

    Does this help?

    Tom

  • Neil Camargo

    Ok Tom, I understood perfectly, but when it comes to an electromechanical component?

    example:

    I am studying the life of a slip ring and the only information I have is:

    The operating life > = 50,000,000

    We use = 15 rpm = 55,555 hours life time

    I think I could not use an exponential distribution, ok? Which distribution would be the most appropriate? Weibull?

    Could consider that the life time is equal to MTBF?

    Thank you very much.

  • Tom G

    Hi Neil

    Well, a simple mechanical bearing or slip-ring will have quite a lot in common with my light bulb analogy as in there will be a fairly small % of early life fails (flaws in the material, contamination, assembly errors etc) but then a fairly predictable life. If you ignore the small % of early life fails then MTBF will be very similar to product lifetime; assuming competent manufacturing, even if you include the early fails they should be similar numbers.

    Complex products (electronic chips or circuit boards) have much more to go wrong between early infant mortality and wear-out, their MTBF is far less predictable and often quite different to product lifetime.

    With your slip-ring, do you have a maintenance regime? For instance, if the mean operating life is 55,000 hours you could instruct the user to replace the slip-ring every year, say, (9000 hours) and MTBF will shoot up as you will be left with only the early life fails and those that wear out unusually quickly (probably because of other manufacturing or material defects). This is how data centres get such high hard disk drive reliability, they don’t run them until they fail, they replace them after a planned number of hours operation (and monitor performance, and have redundancy, and run them cool, and… but I digress). It’s similar to the 830 years MTBF for a 25 year old human argument – if you replaced every factory worker when they reached 25 you would achieve a very high MTBF for your workforce, if you let them get older and older so they ran out of lifetime the net workforce MTBF wouldn’t reach 830, it would be somewhat less than their lifetime of 80 years.

    So what distribution should be used to model end-of-life failures? Well, as with everything it depends on the characteristics of the item you are modelling i.e. the nature of the data, but Weibull is often used as the preferred technique for lifetime modelling and failure analysis (and requires less data samples than some techniques) so is the place I would start; however, I’m sure you are aware that you can get all sorts of shapes and results from Weibull so make sure you understand the parameters and are applying it correctly. And remember, there are “lies damn lies and statistics”!

    I wish you every success with the study.

    Tom

  • Don Meaker

    If you have a complex equipment, and a maintenance concept in which failed parts are replaced with new parts, the ages of the components get mixed. In that situation, quite common for end items, but not at all common as of yet for humans or spacecraft, Drenick’s Theorem asserts that the failure rate quickly approaches an average, and the overall failure rate is exponential. The failure distribution of the various components don’t roll up to a failure distribution for the end item. Rudy Drenick proved it around 1960. The math is complex, but it sure simplifies the math for reliability engineers.

  • Tom G

    Interesting, thanks for the contribution Don. Yes, it could be quite difficult do conventional maths for a complex, random, ‘replace all parts as they fail (or reach end of nominal life)’ scenario.

    Tom

  • Pankaj Rana

    Hi Tom G,

    Thanks for sharing your views on this topic,MTBF is not as simple as it spells though. Great contribution I would say.
    I would like to know regarding MIL-217F, old though , and unreliable for today’s products,still it is preferred for the MTBF calculations. Can you please put some light on it.
    And furthermore can you please suggest an alternative?

    Best Regards
    Pankaj Rana

  • Tom G

    Hi Pankaj

    MIL-HDBK-217 (MIL-STD-217) was the original although, as I mention above, there are more recent approaches such as Vita 51.1 (which I know of, but am not an expert in by any means) that try to provide greater accuracy and relevance to today’s technology. A lot of the success of these predictive approaches comes down to the reliability models you use and whether there are any unknown / unquantified defect mechanisms that dominate reliability.

    An interesting approach is to run a predictive model then refine it over a long time-frame based on actual, real-world measured reliability data, to fine-tune its parameters or algorithms. But that takes a lot of time and a lot of data and really only works for big, long-established corporations or those with access to similar large data sets.

    I tend to just use actual MTBF figures then extrapolate to the new products (‘twice as complex’, or ‘much lower operating temperature’, or… etc.) to make an estimate of a comparable new product. I have clients who ask me from time to time about MIL-HDBK-217 because their customers require predictions based on it, but when we use actual achieved MTBF that usually satisfies them, even if some extrapolation is required, because it is based on real-world observations.

    But maybe someone reading this blog will take issue with me and tell us all about a high success rate MTBF predictor!

    Best regards

    Tom

  • ashu

    Dear friend, I want to know that What is the meaning of 55555 in environmental stress screening tests or with reliability of equipment ??

  • Tom G

    I believe that JSS 55555 is an Indian specification for Environmental Tests for Electronic and Electrical Equipment used in defence.

  • Brett F

    Hi Tom, I work in Healthcare and we are looking at developing some easy graphical information for owners of medical equipment to determine the health of their equipment. We want to explore MTBF but have some concerns with this. We want to look at a annual MTBF and a accumulative MTBF. Either of these are based solely on corrective maintenance by the way. The accumulative value will change over time as the years add up over the correctives per performed….this is easy. The one which we are concerned about is the annual average MTBF which would indicate when equipment is wearing out for example. The problem is that conducting the simple average over a fixed year. So if the equipment breaks down once that year and on day 182 then no problem, the MTBF is 182 day. What are your thoughts if the equipment break downs on day 2 but runs well the rest of the year (MTBF = 182), or if the equipment doesn’t break down for 2 years then breaks down 6 time in year three. Would year one and two have a MTBF of 365 and year three 60. On a graph this may be overshadowed by the 365, but would that be correct. If you think about it too hard it makes your head hurt 🙂

  • Tom G

    Hi Brett.

    Yes, you can have ‘fun’ with MTBF! I have been doing a lot of that recently.

    At the risk of over-simplifying it, why not think of it not in terms of the year in which it fails but the year in which it was built? So you count the time that the equipment has clocked up in operational use, year in and year out, and record the failures, then do the MTBF sums. So you can see the MTBF per year of build and spot any trends over time (albeit with a delay).

    So the equipment that breaks down several times in a short period will accumulate a lot of short MTBF figures and will show a poor net MTBF. But it will refer back to the year of build not the year of failure.

    Does this make any sense or have I just confused the situation further?!

    Tom

  • Brett

    Hi Tom, yes it does make sense. Our team met yesterday and we came up with the same conclusion. We decided that we will try and plot the MTBF every time there is a corrective on a line chart. Instead of doing yearly snap shots we will do a cumulative hours over cumulative correctives and see what that looks like. The X axis still may represent the year so that the users can understand the time line better. It will be interesting in the result.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>