We introduce the data source and the experimental environment configuration. Then, we compare RFDDN with the state-of-the-art segmentation algorithms and perform ablation experiments. Finally, we analyze and discuss the experimental results.
Data source and experimental environment
We use two groups of 3D retinal OCT data with the gold standard for experiments. The gold standard is marked by medical professional physicians. The resolution of each 3D data is (1024 times 512 times 128). Each 3D data consists of 128 2D slices, and the resolution of each 2D slice is (1024 times 512). The first group includes 43 3D retinal OCT data with the gold standard. The entire dataset contains 5504 2D slices, of which 2816 are training data, 1408 are validation data, and the rest are test data. The second group includes 40 3D retinal OCT data with the gold standard. The entire dataset contains 5120 2D slices, of which 2560 are training data, 1280 are validation data, and the rest are test data. There are four categories of regions in Oct fundus images, namely SRF, PED, retinal edema area (REA), and background. The area of SRF accounts for about 0.7% of the total area, the area of PED accounts for about 0.03% of the total area, the area of REA accounts for about 61% of the total area, and the rest area is the background. The abbreviations that appear in this section are listed in Table 1.
The hardware environment of experiments was a server with Intel Xeon CPU processor, 16GB memory, and NVIDIA Tesla V100 PCIe GPU (11GB video memory). The visualization, programming, simulation, testing, and numerical calculation processing of experiments were implemented in Python 3.8.
Evaluation of segmentation performance
We compare RFDDN with the five state-of-the-art OCT image segmentation algorithms, namely U-Net16, ReLayNet20, CE-Net25, MultiResUNet26, ISCLNet8. We use DSC, 95th-percentile Hausdorff Distance (HD95), Average symmetric Surface Distance (ASD), sensitivity, and specificity as evaluation indicators. The calculation formula of DSC is as follows:
$$begin{aligned} D S C=frac{2 times vert P cap T vert }{vert P cup T vert }, end{aligned}$$
(16)
where P represents the predicted segmentation result, T represents the real segmentation result, (vert bullet vert) is the cardinality of the set, and (0 le DSC le 1). The larger DSC, the better the segmentation effect. The calculation formula of HD95 is as follows:
$$begin{aligned} H D 95= & {} 95 % times max left{ d_{X Y}, d_{Y X}right} , end{aligned}$$
(17)
$$begin{aligned} d_{X Y}= & {} max _{x in X}left{ min _{y in Y}Vert x-yVert right} , end{aligned}$$
(18)
$$begin{aligned} d_{Y X}= & {} max _{y in Y}left{ min _{x in X}Vert y-xVert right} , end{aligned}$$
(19)
where X and Y respectively represent the point set of the real segmentation result and the point set of the predicted segmentation result, and (Vert bullet Vert) is a distance function between two points. The smaller HD95, the better the segmentation effect. The calculation formula of ASD is as follows:
$$begin{aligned} A S D=frac{sum _{x in X} min _{y in Y}Vert x-yVert +sum _{y in Y} min _{x in X}Vert y-xVert }{vert X vert + vert Y vert }. end{aligned}$$
(20)
The smaller ASD, the better the segmentation effect. We use a mini-batch stochastic gradient descent optimization algorithm with momentum to update the network parameters, and set the momentum to 0.9. The gradient threshold is 0.005. The L2 regularization parameter is 1e-4. The epoch is 20, and the number of iterations is 3k. The batch size is 16. The initial learning rate is 1e-3. We reduce the learning rate to 0.1 times of the current learning rate after 1k and 2.5k iterations, respectively. The segmentation results of the six methods on the first group of data are shown in Table 2.
RFDDN is superior to the other five methods on all evaluation indicators. DSC, HD95, ASD, sensitivity, and specificity of RFDDN are 0.97, 0.63, 0.13, 0.99, and 0.92, respectively. The results obtained by ISCLNet are the second only inferior to those obtained by RFDDN. DSC, sensitivity, and specificity of RFDDN are 3.2%, 3.1%, and 5.7% higher than those of ISCLNet, respectively. HD95 and ASD of RFDDN are 7.4% and 23.5% lower than those of ISCLNet, respectively. The segmentation results of RFDDN before and after feature discretization on the first group of data are shown in Table 3.
DSC, sensitivity, and specificity of RFDDN are 2.1%, 2.1%, and 3.4% higher than those of UnFD-Net, respectively. HD95 and ASD of RFDDN are 3.1% and 13.3% lower than those of UnFD-Net, respectively. The segmentation results of the six methods on the second group of data are shown in Table 4.
RFDDN is superior to the other five methods on all evaluation indicators. DSC, HD95, ASD, sensitivity, and specificity of RFDDN are 0.95, 0.65, 0.16, 0.97, and 0.89, respectively. The results obtained by ISCLNet are the second only inferior to those obtained by RFDDN. DSC, sensitivity, and specificity of RFDDN are 3.3%, 2.1%, and 8.5% higher than those of ISCLNet, respectively. HD95 and ASD of RFDDN are 5.8% and 15.8% lower than those of ISCLNet, respectively. The segmentation results of RFDDN before and after feature discretization on the second group of data are shown in Table 5.
DSC, sensitivity, and specificity of RFDDN are 2.2%, 1%, and 6% higher than those of UnFD-Net, respectively. HD95 and ASD of RFDDN are 3% and 11.1% lower than those of UnFD-Net, respectively. The results show that the feature discretization based on the rough fuzzy set can improve the segmentation precision of the deep neural network. The rough fuzzy set based discretization results of two groups of 3D retinal OCT data are shown in Fig. 6.
The number of breakpoints and data inconsistency are two major rough fuzzy set based discretization evaluation indicators. The smaller the number of breakpoints and the data inconsistency, the better the discretization effect. Each 3D retinal OCT data consists of 128 2D slices. The brightness value of each 2D slice is 8 bits (ranging from 0 to 255), and the number of breakpoints is 256. Thus, the initial number of breakpoints for each 3D retinal OCT data is 32768. After feature discretization based on the rough fuzzy set, the average number of breakpoints in the first group is 8826, which is reduced by 73.1%, and the average number of breakpoints in the second group is 7520, which is reduced by 77.1%. The overall data scale decrease by 75.1%. The data inconsistency of both groups is 0. Therefore, the computational efficiency of the model is improved. The significance of DSC on the two groups of 3D retinal OCT data is shown in Fig. 7.
We use one-way ANOVA to analyze the significance of DSC between the six methods on the two groups of 3D retinal OCT data. The threshold of significance level is 0.05. In the box plot, the lines represent the median, 25th, and 75th percentiles. **** indicates (P < 0.0001). The P value is less than 0.05, which indicates a statistically significant difference of DSC between the six methods on the two groups of 3D retinal OCT data.
Ablation experiment
The initial learning rate is an important hyperparameter of RFDDN. The segmentation results of RFDDN under different initial learning rates are shown in Table 6.
An initial learning rate that is extremely large will cause the gradient to oscillate around the minimum, while an initial learning rate that is extremely small will result in slow convergence. RFDDN can achieve the best evaluation indicator values when the initial learning rate is 0.001. The evaluation indicator values of RFDDN under different learning rates have little difference. Therefore, RFDDN is not sensitive to the initial learning rate. Then, we conduct the ablation experiment with four different models. The four models are (a) the baseline with only encoder and decoder; (b) the model with only deep supervised mechanism (DS); (c) the model with only double attention block (DAB); (d) the model with only attention refinement block (ARB). These models have the same pre-trained weights as RFDDN. The segmentation results of RFDDN and the four models are shown in Table 7.
The baseline has the worst DSC, HD95, and ASD at 0.78, 1.31, and 0.5, respectively. Although DS can alleviate the negative impact of data imbalance, the performance of the model with only DS is still unsatisfactory, and the obtained DSC, HD95, and ASD are 0.81, 1.13, and 0.36, respectively. The introduction of the attention mechanism enables both the model with only DAB and the model with only ARB to obtain better segmentation results than those of the baseline and the model with only DS. DSC of RFDDN is 7.8% better than that of the model with only ARB. HD95 and ASD of RFDDN are 17.1% and 53.6% lower than those of the model with only ARB, respectively. RFDDN has a strong ability to capture contextual information. Therefore, RFDDN can obtain the best segmentation results. The visual segmentation results obtained from the ablation experiment are shown in Fig. 8.
The baseline has the worst segmentation effect. Owing to the lack of ability to capture information, the model with only DS has the poor segmentation ability, which is prone to false segmentation. Although the segmentation performance of the model with only DAB and the model with only ARB has improved, it is still difficult to segment some regions with blurred boundaries. Baseline, DS, DAB, and ARB have artifacts in the area surrounded by green lines, and are not able to clearly display some detail characteristic of slices in the area surrounded by red lines. Compared with these four models, RFDDN generates smaller fuzzy regions. Obviously, RFDDN has the best segmentation effect.
Discussion
U-Net combines low-level detail information with high-level semantic information by concatenating feature maps from different levels to improve segmentation accuracy. However, U-Net is prone to overfitting during training because of shallower layers and fewer parameters. Furthermore, the consecutive pooling and strided convolutional operations led to the loss of some spatial information. ReLayNet uses a contracting path of encoders to learn a hierarchy of contextual features, followed by an expansive path of decoders for semantic segmentation. However, the convolutional blocks employed by these encoders and decoders have limited ability to capture important features. CE-Net adopts the pre-trained ResNet block in the feature encoder and integrates the dense atrous convolution block and the residual multi-kernel pooling into the ResNet modified U-Net structure to capture more high-level features and preserve more spatial information. Although CE-Net can achieve better segmentation accuracy than that of U-Net, it still faces the problem that the adopted convolutional block has the limited ability to capture important features. MultiResUNet uses Res paths to reconcile the two incompatible sets of features from the encoder and the decoder, and designs MultiRes blocks to augment U-Net with the ability of multi-resolutional analysis. However, it still suffers from the loss of spatial information caused by the consecutive pooling and strided convolutional operations. ISCLNet learns the intra-slice fluid-background similarity and the fluid-retinal layers dissimilarity within an OCT slice, and builds an inter-slice contrastive learning architecture to learn the similarity among adjacent OCT slices. However, it relies on complete OCT volumes that may be difficult to access in the clinic. In addition, none of the above methods has the special mechanism for dealing with noise and uncertain information. RFDDN introduces a deep supervised attention mechanism into the network, and greatly eliminates the negative impact of redundant information and noise through feature discretization based on the rough fuzzy set. Therefore, RFDDN can achieve higher segmentation accuracy.