Go Back

Fernando Fernandes dos Santos, computer scientist, working on computer fault tolerance

- Published on 11/01/22

Dr. Fernando Fernandes dos Santos, computer scientist, is beginning this January his project “Toward reliable deep neural network hardware for safety-critical applications” (TRELA) at Inria.

BIENVENÜE team: Hello Fernando, how did you become interest in your research area?

Dr Fernando Fernandes dos Santos: I became interested in my research topic during my Master’s degree, during which I worked on computer reliability and fault tolerance. Errors may happen, and we have to ensure that they do not affect users and the systems. Not all computers systems are equal. For example, if an error occurs during a video call, there are no big consequences, as you can simply restart it. But suppose this happens in a safety-critical system such as a plane or a self-driving car. In that case, it can generate an accident and harm or even kill people.

Of course, there are different types of computers, and therefore different characteristics and fault tolerance requirements. During my Ph.D., I focused on errors caused by ionizing radiation. Such errors are caused by energetic particles that hit the hardware, the physical parts of the computer, which impact the software, the application running on the hardware performing tasks.

What is the TRELA project?

The TRELA project will deal with DNN – Deep Neural Networks – which are used to perform simple tasks humans do everyday, such as voice recognition, object detection, and classification. I will work on the DNNs fault tolerance to make them more reliable. For example, an error in the DNN responsible for human detection in a self-driving car can lead to a dramatic accident we want to avoid.

The originality of my project lies in that I will work on a multilevel approach, including the hardware and software. How can we improve the system reliability at the software level, even if there is some issue with the hardware?

The coolest part of the project, in my view, is the radiation experiments I will lead. This is to simulate error scenarios with radiation. I will hit hardware with a neutron beam and see how the software is impacted. This will be done in the ChipIR facility in the Rutherford Appleton Laboratory, UK, where I already did a summer internship during my Ph.D. I also did another summer internship at Los Alamos National Laboratory, USA. I may also do some experiments there this year if the situation improves. With COVID, things are a bit delayed, but I should be able to do the experiments this year.

Beside the reliability issue, how does your project go towards less energy-consuming technologies?

All fault tolerance techniques come with an overhead. It can be in terms of more hardware or an increase in the application execution time. Consequently, there will be an increase in energy consumption. For example, in a safety-critical system, a very standard fault-tolerance technique is Duplication with Comparison. The software and the hardware of the system are duplicated. Both versions execute the same application. At the end of the computation, we compare the output of the two systems, and if there is a mismatch, we know that there is a fault, then we can make a decision based on that. It is a technique with high efficacy, but as you have noticed, it consumes double the energy as the system is doubled. So if we develop more efficient fault tolerance in terms of hardware and software, we can save energy in the end.

Why is it important for you to implement the project at Inria?

I have read several papers from Olivier Sentieys and Angeliki Kritikakou, now my supervisors, which I found particularly interesting. Their group has a strong background in computer architecture and hardware, while I have worked more on the software side. I thought we could have a good collaboration together, so I scheduled a meeting with them. That was the starting point of the TRELA project.

What is motivating you on a day-to-day basis?

I like to think that my research will someday be used to improve safety-critical system reliability. Of course, it will not be used alone, but with many other contributions from different researchers.

Do you have recommendations for a reader eager to know more on computer fault tolerance?

I have two recommendations. First, this video by Veritasium that is very understandable for the general public and a good introduction to the topic.

The second video is from the lab where I did the experiments during my Ph.D. You can even see my Ph.D. advisor, Pr Paolo Rech, and some of my work: I actually made the simulations of the pedestrian detection errors you can see in the video!

Thank you Fernando!

Sharing