next up previous
Next: Combining visual and textual Up: Enhancement of Textual Images Previous: Classification using textual features


Classification using visual features

Definitions of global and local visual features

Figure 1: The five visual features for one image

Figure 2: Selection of the ROI

We choosed to use the simplest visual features as possible. Therefore we used the color (red($A_1$), blue($A_2$) and green($A_3$)), the brightness($A_4$) and the direction histograms($A_5$)[*]. After normalisation, theses histograms are taken as visual vectors. In order to deal with image scale variations we extracted the visual features from the original image (called ``global level''), and from four local regions. The adopted segmentation approach, proposed in [16], performs an unsupervised and fast segmentation based on the Canny edge detection[4]. The local Regions Of Interest (ROI) of four different orders are automatically extracted from the global image as follow. After calculation of the edge matrice of the global image, the ROI of first order is extracted from the rectangle window of fixed size which contains the maximum number of edges. Then the ROI of second order is extracted using the edge matrice where edges corresponding to the first ROI have been removed. Other ROIs of third and fourth orders are processed iteratively. For this experiment we fixed the surface of each ROI equals to 25%[*] of the surface of the global image. The extraction of the two first ROI is illustrated in figure 2.

Visual classification

The visual classification follows the same criterion presented in the case of textual vectors, but simply using visual vectors instead textual ones. Let DKL $_{A_j}(r_t,r_e)$ be the distance for the visual features $A_j$ between ROI $r_t$ of image $d_T$ of the test set and the ROI $r_e$ of image $d_E$ of the reference set. We start by calculating the distance between the areas of interests of equal order. The table 3 shows the results. One notices that, in general, the distances on the global indices are better, except for the direction where the ROI of first order gives better results. Indeed, area 1 contains most edges, it is thus the most significant. For the green attribute, the good result obtained for the ROI 2 is explained by an artifact from the data (a class contain more green than the others). However, our assumption supposing that most descriptive local areas are those which contain most edge is checked, because areas 1 and 2 have the weakest error rates.

Table 3: Error rates (ER in %) between the areas of interests of equal order
  DKL$(r1,r1)$ DKL$(r2,r2)$ DKL$(r3,r3)$ DKL$(r4,r4)$ DKL$(g,g)$
ER Red 81.17 79.21 81.17 82.35 73.33
ER Green 83.13 78.03 86.66 80.78 78.43
ER Blue 82.35 80.39 83.92 84.70 74.50
ER Brightness 80.39 81.17 81.56 83.52 76.40
ER Direction 79.60 81.56 80.00 84.31 85.49

Early fusion of visual features: reduction of time computation

For a given $A$ visual attribute, each image has 5 histograms (r1, r2, r3, r4 and g(r5)). For an image $d_T$ of $B_{Test}$ and for an image $d_E$ of the reference set $B_{Ex}$, there exists $5\times 5$ distances. If one considers only the $L \in [1,5]$ first ROI, there exists $L \times L$ distances between possible areas of the image. In order to reduce the complexity of the system, we will define a distance between the visual features of two images which takes into account the best score among the smallest number calculation. Let $\mathrm{moymin_K}$ be the function:

\begin{displaymath}\mathrm{moymin_K}:\{\alpha_1,\alpha_2,\dots,\alpha_M\}\to (\alpha_{min1}+\alpha_{min2}+\dots+\alpha_{minK})/K.\end{displaymath}

To calculate the visual distance between an image $d_{T}$ of $B_{Test}$ and an image $d_{E}$ of $B_{Ex}$, we calculate the $L^2$ possible distances and we calculate the average of the $N$ smallest values ( $N \in [ 1, L^2]$). We obtain for each image the distance:

\gamma_A(d_T,d_E)=\mathrm{moymin_N}(\{DKL_A(i,j);\forall i,j\in L\}).

Now, if one considers the distances between an image $d_T$ of $B_{Test}$, and all images contained in a class $C_k$ of $B_{ex}$, one can calculate the final distance between $d_T$ and $C_k$ averaging only the $I$ first minimal distances. Then we have:

\begin{displaymath}\delta_A(d_T,C_k)=\mathrm{moymin_I}(\{\gamma_A(d_T,d_{E_k});\forall d_{E_k} \in C_k \})\end{displaymath}

where $d_{E_k}$ is an element of the class $C_k$ of the base of examples and $I \in [ 1,\mathrm{card(C_k)}]$ is the number of minimal values taken among the $\mathrm{card(C_k)}$ distances. Again the class of $d_{T}$ considering feature $A$ is given by:


This method allows to reject the too large distances which would penalize the system, and to keep the best distances which increases the probability of being in the good class.

Results of early fusion of visual features

Tables 4, 5 and 6 give the error rates obtained by the early fusion while varying the parameters $N$, $I$ and $L$. Results in table 4 gives the influence of the parameter $N$ for the values of $I$ and $L$ giving best results. It is noticed that the parameter $N$ has little influence for the attributes Red, Blue, Green, and Brightness. On the other hand, for the direction, one observes a real improvement of the ER when one takes large $N$. Results in table 5 shows that it is better to look at if the image test is similar to several images of the same class as to only one. Lastly, in table 6, one notices that the ROI of first order only is not sufficient ($L=1$) and that the ROI of 4th order brings only little of information as expected, because ER for $L=4$ are worse than for $L=3$.

Table 4: Influence of the parameter $N$ on the Error Rates (ER in %) $(I=4, L=5)$
N 1 2 3 4 5 6 7 8
ER Red 71.76 72.54 72.54 73.72 76.47 77.64 77.64 76.07
ER Green 76.07 77.64 77.64 76.86 76.86 76.47 78.82 78.82
ER Blue 77.64 77.25 79.60 80,00 79.60 81.56 81.96 81.96
ER Brightness 77.64 79.21 77.64 77.64 79.21 79.21 78.82 78.03
ER Direction 83.52 80.39 80.39 80,00 79.21 78.82 78.43 76.86

Table 5: Influence of the parameter $I$ on the Error Rates (ER in %) $(L=5)$
I 1 2 3 4
ER Red 75.68 74.50 71.76 71.76
ER Green 79.60 78.03 76.86 76.07
ER Blue 78.03 77.64 78.03 77.25
ER Brightness 79.21 78.03 76.07 77.64
ER Direction 84.70 78.03 76.86 76.86

Table 6: Influence of the parameter $L$ on the Error Rates (ER in %) $(I=4)$
L 1 2 3 4 4+g  
Dimension $L^2$ 1 4 9 16 25  
ER Red 81.17 78.82 76.07 76.07 71.76  
ER Green 83.13 78.82 75.68 79.60 76.07  
ER Blue 82.35 80.00 79.60 81.56 77.25  
ER Brightness 80.39 79.60 78.03 77.64 77.64  
ER Direction 79.60 78.03 76.07 76.47 76.86  

It is also noticed that, for $L=5$ (4+g), the global indices make a clear improvement of the ER, except in the case of the direction feature, which was foreseeable. If one compares these results with those of table 3, one notices a fall of about 5% to 10% of ER using the local indices, and an improvement of 2% on the global ones. Moreover, the early fusion reduces the time computation.

Figure 3: Error Rate of the different systems for various p factor : text only and combining of textual with various visual contents (see text for details).

next up previous
Next: Combining visual and textual Up: Enhancement of Textual Images Previous: Classification using textual features
Tollari Sabrina 2003-08-28