In this article, Colin Hargis, chief engineer at Control Techniques, looks at the relationship between reliability and failure in variable speed drives.
This might seem obvious, but it is necessary to think carefully what we mean. This is not to cloud the issue, just to make sure we focus on what really matters. For example, there is often confusion between reliability and life expectancy, both of which are important but are not necessarily related.
If you purchase an item of equipment then you hope that it will work correctly for as long as it is required. Nothing is perfect, so you accept that there is a small chance that it might fail earlier than expected. It might be damaged, or it might fail for no apparent reason. If the latter occurs within the warranty period then you expect it to be replaced free of charge. You also accept that it will eventually wear out or become obsolete. Depending on the nature of the equipment, you might accept that it requires periodic maintenance, possibly including replacement of wearing parts.
The term “fail” means that the equipment no longer does what you need. It might simply stop working completely, or it might change its characteristics so that it no longer meets a necessary aspect of its specification.
It is possible that the equipment might stop working because it has encountered a situation where it is not designed to work. For example, the ambient temperature might be too high, or some other aspect of the environment be outside the intended range, so a protective trip occurs which can be reset when the situation is corrected. For a VSD, it might be that the load torque was too high, or the power supply was disturbed, or an unexpected control signal state might have occurred. If this happens frequently then the user might consider the equipment to be unreliable even though it is not faulty and meets its specification. The perceived reliability might depend to some extent on how clear and helpful is the diagnostic data generated by the equipment. It might also be that careful setting of the VSD could have avoided the trip by adjusting to the situation – for example, by generating and using an alarm warning of some parameter moving close to the trip level, or by changing its operating mode so that it automatically attempts to re-start after a trip. For applications where a high availability is needed it is important to review possible abnormal conditions which might occur and to ensure that the drive is set up to behave appropriately when they are encountered.
If you are responsible for a large plant with large numbers of equipment operating together then you have to accept that at any one time a certain proportion of equipment will be faulty and undergoing repair or replacement. In this case you probably have a target for the availability of the whole plant and each piece of equipment, and you plan the maintenance and repair process based on known statistics and the availability target. This is the situation where data such as Mean Time Between Failures (MTBF) is most often used.
The concept of MTBF implies a large number of identical equipment operating continuously, and replaced immediately if they fail. The MTBF is then simply the mean time between each failure event multiplied by the number of equipment, usually given in hours. If you also know the mean time to repair/replace then you can plan for a target mean availability, or arrange a degree of redundancy to give increased availability if required.
MTTF is the more strictly correct measure applicable to a single item of equipment, rather than a complete system which is repaired on failure. For electronic equipment where the repair/replace time is far shorter than the MTTF the difference between these measures is negligible.
Mature electronic equipment exhibits random failures at a constant rate during its working life. Every failure actually has a root cause, so it is not strictly random. However since the equipment typically contains large numbers of small components, each of which has a very low failure rate, the overall effect is of a low but random failure pattern.
MTTF/MTBF data is only useful if the failure rate is constant, which means that failures are random in time. If equipment are wearing out, or they exhibit a raised rate of early-life failure, or an outside event occasionally triggers multiple failures, then the simple statistics do not work. The calculation of MTTF is discussed further at MTTF/MTBF by calculation and by field failure analysis.
Some equipment has a life expectancy which is limited by one or more wear-out mechanisms. This is most common for moving parts which wear. In electronic drives this applies to cooling fans and possibly relays. Some electronic components have significant wear-out mechanisms; this particularly applies to electrolytic capacitors but it may also apply to power semiconductors and even connectors etc. where there is mechanical wear or fatigue caused by thermal cycling.
The life expectancy of individual items has random variation between samples, so a statistical measure is required. One measure of life expectancy for a device with a known wear-out mechanism is the L10 parameter, which is the time of operation until 10% of a large sample of devices has failed. Occasionally L1 data may be available. Sometimes a MTTF figure is also given to indicate life expectancy. There is great scope for confusion here, because devices of any complexity also exhibit random failures during their working life. It needs to be made clear then whether the MTTF figure refers to end of life or random failures in normal service.
The individual lifetimes of items with known wear-out mechanisms also vary between samples. Therefore it is possible for a device to have a good reliability (long random MTTF) and a short life expectancy, or the reverse of this.
Wear-out is also quite likely to depend on the operating environment. For example, the life expectancy of a fan is very dependent on the air temperature, the speed of operation and the presence of dust or other contaminants.
Equipment can exhibit a raised failure rate when it is new or little used. This is caused by parts which have flaws which were not revealed by testing but are revealed by use or time. One of the skills of the equipment manufacturer is to design a product test routine which reveals flaws as effectively as possible, but without using excessive stress which might actually cause failures or incipient failures.
Maintenance is an opportunity to manage parts with known limited life expectancy, at the cost of down-time and labour. Replacement can be according to a simple schedule or by measuring some indicative parameter (on-condition maintenance). The equipment should be designed to facilitate maintenance, either by easy access to life-limited parts or easy exchange of the whole equipment.
Equipment reliability is always sensitive to its environment. For electrical equipment, the temperature is important because many component degradation mechanisms are accelerated by increased temperature. Other critical parameters which have to be controlled are humidity, the presence of corrosive or electrically conducting substances, mechanical effects such as air blockages and shocks and vibration, and electromagnetic influences of many kinds. The equipment will have a specification for these parameters. The specification needs to be reasonable for the intended application.
There is a tradition in some industries for purchase specifications to require MTBF or MTTF data. Obviously from the explanation above it is necessary when planning the availability of large or critical installations or networks to have in place a maintenance plan which takes account of expected failure rates. The traditional “calculation” technique for electronic equipment uses a large database of failure rates for commonly used electronic components, together with stress factors which show up the influence of relevant stresses such as temperature, voltage etc. The database is compiled from industry analyses of failed equipment.
The MTBF calculation produces a figure which is sensitive to parts count. The databases are very mature in their coverage of conventional components and common integrated circuits. They do not really address special-purpose large-scale ASICs, nor recently released specialist devices such as advanced power semiconductors which are used in VSDs. They can be helpful to the designer by demanding stress calculations for every component, which might occasionally reveal unexpected high stresses and therefore trigger a design improvement. However the results are very far from reflecting reality. For example, equipment using a small number of ASICs and highly integrated devices such as Intelligent Power Modules (IPM), resulting in a low parts count, shows a superior MTBF to equipment using large numbers of mature simpler components and discrete power devices. In reality this difference is false. Either design could offer better reliability in practice, depending on the quality of the components and the design. The low parts count approach to product design can be very effective, as it has the obvious benefit of reducing the number of parts and solder joints, and also the number of manufacturing operations in the end product manufacture, any of which might fail. But it is risky because the ASICs are purpose-designed and complex so they are difficult to test fully and are usually not proven in use, whilst the IPM restricts the freedom of the designer to adjust and control the working conditions of the critical power semiconductors. The MTBF data will not distinguish the truly reliable design. The traditional calculated MTBF figure is really of very limited use, and has been discredited to some extent, as is illustrated by the USA military database having been discontinued. Control Techniques does not offer such data for its products.
A completely different approach to MTBF is for the manufacturer to track field failures by customer returns. This gives a very realistic picture of the overall quality of the product from the point of view of the customers’ experiences. Typically MTBFs derived from field failures are between 10 and 100 times longer than the “calculated” values.
Most reputable manufacturers track customer returns data closely and have targets for continuing improvement, as well as processes for detecting and reacting to any increase in the return rate. The actual return rates are commercially sensitive and manufacturers are understandably reluctant to reveal them. Control Techniques will supply long-term field failure rate data on special request.
Any equipment manufacturer has the experience that a certain proportion of customer returned products are found to be working correctly when tested. If they track return rates in order to improve their manufacturing quality then they will ignore the NFFs because they are irrelevant to manufacturing. In terms of customer satisfaction and perceived reliability the NFFs may be important. They mean that the product failed to meet customer expectations. Somewhere along the line there was a mismatch of the real requirements with the real capabilities.
Some failures occur because the customer has used the product wrongly, so that it has failed to operate correctly, or even been damaged. Sometimes this is caused by simple human carelessness. Sometimes the working conditions have turned out to be different from what was expected and this could not reasonably have been foreseen. Those cases have to be filtered out and ignored from the point of view of the manufacturer working to improve the quality of their manufacturing process. However the manufacturer always has to consider carefully whether the data and instructions were clear enough.
One example with variable speed drives is the small but persistent number of field failures caused by the installer connecting the mains supply to the output rather than the input. To a drive designer this is obviously a gross error which is most likely to result in major damage to the drive, and suggests incompetence. However if you consider an electrical installer working under time pressure, who is more familiar with simpler electrical devices like circuit breakers, contactors and motors, it might be more understandable. The manufacturer has to try to help the installer to avoid such an error. It is not possible to design a drive to be proof against this error without adding unacceptable cost, but at least it can be ensured that terminals are clearly labelled.
It is clear from the foregoing that it is important to specify whether given MTTF/MTBF data is by calculation or by field data, since the two cannot be compared. It also needs to be confirmed that it refers to random in-service failures and not to life expectancy.
If the product is correctly selected and the working environment is as expected then the practical failure rate should be similar to the field failure rate data. If the failure rate turns out to be much worse than this then it is likely that some unexpected aspect of the operating conditions or environment is affecting the reliability. For a new application it may be quite difficult to anticipate the wide range of unexpected effects which might result in reduced reliability, and it is the user’s responsibility to understand as well as possible all aspects of the place and mode of use which might affect reliability. The manufacturer has to try to ensure that the required operating conditions for the drive are specified as clearly and comprehensively as possible, and also that they match the reasonably expected actual conditions for the intended application area.
There is a special field of applications where drive functions are safety-related, i.e. they must work correctly in order to ensure personnel safety. Examples are the Safe Torque Off function and the SI-Safety system integration module. The design uses special high-integrity hardware and (usually) software. The integrity level is defined by the SIL or PL, which require calculations of the probability of failure of the safety function in the dangerous direction, caused by hardware faults. The failure data is expressed either as PFH (probability per hour of a failure of the function) or MTTFD (mean time to failure, in hours, in the dangerous direction). This data is calculated according to an approved protocol for the safety of control systems of machines and is not related to the reliability of the drive as discussed above.
Field failure MTTF data for current products is available on request from Technical Department in Newtown, UK. For the reasons explained above, the company does not generate calculated MTTF/MTBF data.