For every safety critical system, FTT requirements are specified by the OEM. What we often notice is that SW teams do not know the complete picture of how these timings are fulfilled. They tend to think that if testing proves that the FTT time is met, that is sufficient. Then they start to see that even though the Safety SW or HW of the System has not changed, suddenly FTT time measurements from testing start to change noticeably. In some cases that these measurements exceed the required FTT, and now suddenly the team is left wondering how this happened. "We did not change the Safety SW. Looks like the Non-safety part of the system is causing this change but we do not have control over it" they say.
If you look at it from the ISO26262 Part-6 V model perspective, on the left side of the V is the Architecture design and the right side of the V is the Verification and Testing.
The argument of proving that FTT was met via testing is in the right side of the V. However, what we really need to focus on is the left side of the V, which is to understand how FTT is influenced by various factors in design. Thereby, one will be able to not only gain control over FTT by controlling these factors, but also design better test cases for verifying FTT.
Let us begin by taking a simple FTT requirement as an example.
If a fault in safety output ‘X’ is detected, a safe state shall be reached within 2 seconds.
In this case, the Safety output is any safety relevant output generated by the System. For e.g., it may be a digital port output, a CAN output, a sound pulse, an image or a software output generated by the System. Let’s assume that to activate the Safety output, some input conditions must be met, such as a digital or analog input having a certain value, the state of Ignition or a CAN signal value. Once these conditions are met, the SW and HW components/subsystems within the system processes these conditions and triggers a data flow in the Safety path, all the way until the safety output is activated.
Figure 1: Safety output activation
Now, let’s assume that some components in this path are QM and may fail, and hence the System has implemented a monitoring of the actual safety output against the Expected output and triggers a Safe state in case of a mismatch. There are a series of SW/HW components involved in the monitoring path and the safe state path. Now the system can be represented like this.
Figure 2: System view of a monitoring concept
In a timeline view, the FTT of this system will look like this:
Figure 3: Breaking FTT into smaller parts
FTTI broadly has 3 parts in this case:
Safety output activation time (Tact): This is the time from when the Safety input conditions are activated until the Safety output is activated. Note that since we are now taking the case of a fault, we will assume that even though the Safety output was “attempted to be” activated, it is not actually active.
Fault detection time interval (FDTI) – This is the actual time available to detect the fault once the output is triggered. In order not to mistakenly trigger a safe state for short transient faults, the fault shall be confirmed or debounced by checking several continuous samples of the actual output
Fault reaction time (FRTI) – This is the time available to reach a safe state once the fault is confirmed
Required Fault Tolerant Time interval given by OEM (FTTI) – This is the maximum allowed fault tolerant time beyond which a hazard will occur. Hence, the safe state must be reached within this time.
Tact + FDTI + FRTI < FTTI
Ideally, Tact and FRTI must be much lesser FDTI than so that the system has time to read sufficient number of samples of the faulty output before it confirms the fault.
How do we realize this above timing concept in our system? To understand that, let us map the SW/HW components in the System in figure 2 to the timeline in figure 3. This would look like this:
Figure 4: Mapping the FTT timeline with its smaller parts to the System component view in figure 2
You may begin to see that each of the timing components in the above picture such as Tact, FDTI and FRTI are influenced by various factors such as:
- The run time of SW components in this chain (e.g., TRuntime_SW_CMP1 , TRuntime_SW_CMP2, TRuntime_SW_CMPn)
- The processing time required by HW (e.g, TADCprocessing, TGPIOprocessing)
- The latencies of every SW component and HW component in this chain (Tlatency_SW_CMP_1_to_SW_CMP_2, Tlatency_SW_CMP_2_to_SW_CMP_3 etc)
- The priorities of the tasks in which the SW components run
- How frequently these tasks run
- The priority, run time and frequency of interrupts if the activation or monitoring is implemented using interrupts
Figure 5: Factors influencing FTT
Ideally, you can and you should go deeper and breakdown the Tact, FDTI and FRTI as a summation of these 6 factors.
However, some of these factors may be relatively insignificant than some other factors. For example, let’s say the Tact is 20ms, and 2 of the components in this chain have a run time of 3ms and latency of 3ms, but the time required for HW to do the activation is only 1us. In this case, the HW activation time is trivial for achieving the overall timing requirement.
So, How do we design our system to be able to consistently achieve FTT?
- The first step is to decide how many faulty samples must be checked in order to confirm the fault. i.e., the FDTI. With this, you can work your way back to determine the maximum Tact and Tfr that you can allow. In some cases, Tact is also specified by the OEM.
- Design the ASIL SW Components in the Safety path such that they run in a sufficiently high priority and frequency.
- CPU performance should be deeply analyzed to understand where the CPU’s time is going. Based on it, design measures improvements on aspects such as process definition, methods of inter-task communication, priority of tasks/threads, priorities for inter-CPU communication (if relevant for that system) etc should be taken in order to minimize the latencies in the safety path.
- Worst case HW timings must be considered in the FTTI calculation. In cases where there is a HW output to be activated or a monitoring done by HW, there may be timing delays that might need to be provided or HW sequences that must be respected.
- Always design the system for a value lower than the FTTI required by the OEM, just in case of any unexpected behavior. For example, if the FTTI is 2000ms, design the system for 1700ms or 1800ms.
Verification of FTT
Now that we have discussed about how to design our system for FTT, i.e., left side of the V, let us complete the picture by going back to the right side of the V, Verification of FTT.
We should look at FTT Verification as not just a “requirement” verification, but as a design verification. We should verify if FTT was achieved by measuring FTT in scenarios with the worst case latencies and worst case processing times.
The question we should ask is “What is the scenario in which the worst-case timing will happen?”
Going back to figure 5, one should ask this question for each of the factors that influence FTT: “When will this SW component have its worst case run time?” “When will this HW have its worst case processing time?” “When will this SW component have its worst case latency?” and simulate such scenarios while measuring FTT.
Here are some examples:
- Triggering high-performance and/or hard-real-time requirement scenarios and letting them run on full load.
- In case there is communication between two micro-controllers or two hardware ICs in the Safety path, creating a full load traffic in this path such that the transfer of the Safety data is delayed as much as possible.
- Simulating scenarios to create the highest possible interrupt traffic.
Conclusion
We took one specific example of FTT requirement to explain the underlying principle. The FTT requirement that you have in your system may be different. For example, you may have an FTT requirement to detect Micro faults such as Memory or CPU faults, where you do not have the Tact activation time that we mentioned in the above example. However, the fundamentals of how to design the System for FTTI are the same. The key message is - Do not leave meeting FTT Requirements to chance! “I tested the FTT time and it was within the requirement” is not a sufficient argument. FTT must be designed very consciously by diving deep into every SW/HW component in the safety path and considering their worst case execution times and latencies. With a deep inspection at design stage, it is possible to come up with worst-case test scenarios for FTTI, i.e., worst case latencies and worst case run times can be created and FTT measured under these conditions.