next up previous
Next: Discussion and Conclusion Up: Enhancement of Textual Images Previous: Classification using visual features

Combining visual and textual classifications

We now merge the textual and visual indices in order to improve the results obtained with textual classification. The main fusion strategies are early and late fusion. The fisrt one is usual in CBIR [17], the second allows more freedom for adaptive weighting in a stochastical framework [7]. We choose the second one in this study. For each image $d_T$ and each class $C_k$, one calculates the textual distance $DKL(\vec{d_T}^*, \vec{C_k}^*)$ as explained in section 3. Then, it is normalized and we estimate the probability of membership with the class $C_k$ as :

\begin{displaymath}
P^t_{d_T}(C_k)=1-\frac{DKL(\vec{d_T}^*,\vec{C_k}^*)}{\sum_{k}DKL(\vec{d_T}^*,\vec{C_k}^*)}.
\end{displaymath}

We use the same formula for the 5 visual features A:

\begin{displaymath}
P^v_{d_T}(C_k\vert A)=1-\frac{\delta_A(d_T,C_k)}{\sum_{k}\delta_A(d_T,C_k)}.
\end{displaymath}

Therefore, the combination of the posteriors is given by:

\begin{displaymath}
P^{v\vee t}_{d_T}(C_k)=\sum_{j=1}^{5}P^v_{d_T}(C_k\vert A_j) \times \omega'(A_j)+ P^t_{d_T}(C_k) \times \omega'(A_6)
\end{displaymath}

where $\omega'(A_j)=\frac{\omega(A_j)^p}{\sum_{i=1}^{6}\omega(A_i)^p}$ and $\omega(A_j)=\frac{1-TE(j)}{\sum^6_{i=1}1-TE(i)}$ and $TE(j)$ is the ER given by $A_j$. The parameter $p$ increases contrast. The final class is given by:

\begin{displaymath}
C^{v\vee t}(d_t)=\mathrm{argmax}_{k\in\{1,2,\dots,c\}}P^{v\vee t}_{d_T}(C_k).
\end{displaymath}

The figure 3 describes the results obtained for the fusion of textual classification without thesaurus (E.R. 13.72%) and several visual classifications. The first result (T+Vis[Local]) is obtained using only best classifications of early fusion of the ROI ($L\in[1,4]$) only. The second (T+Vis[Global]) considers only classifications on the global indices. The third (T+Vis[Local+Global]) uses the best parameters of early fusion of the local and global indices ($L \in [1,5]$). The last (T+Vis[Dir+Global]) takes into account the global features for the attributes red, green, blue and brightness, and the local direction calculated by DKL(r1,r1). On this figure, one notices that our simple ROI fonction generaly improves classification compared Global for the same $p$. Naturally, all method converge to the textual ER($p>8$). Table 7 summarize rising of textual classification by the visual classification.

Table 7: Result of the late fusion of visual and textual classification in %
Textual Fusion Gain  
without thesaurus visuo-textual    
13.72 6.27 +54.3  



next up previous
Next: Discussion and Conclusion Up: Enhancement of Textual Images Previous: Classification using visual features
Tollari Sabrina 2003-08-28