Notes on the construction of bag of words in CV image

Time:2020-5-21

Recently, we try to use matlab to realize the bag of words classification of images. Record some experience here, hope to communicate with my friends.


1、 Filterresponses

Put the picture into different filter bank and output the result

Input:
     W × h × 3 image
     N filterbank
Output:
     Matrix of W × h × n × 3

This step is relatively simple. You can get it by using convn

In addition, I first save the results of each filter into a cell, and finally use vertcat (vertical merge matrix) to change the results into aW×H×N×3Matrix of.


2、 Dictionary

This step isBuilding a dictionary

The dictionary is generated by kmeans based on the result of each pixel of a stack of pictures passing through n filters (60 outputs of each pixel as an element).

Because there are too many pictures, random pixel extraction is used.

The output dictionary size is:K (how many classes can I divide into)×N×3

BTW: this step takes a long time. If the computer has multiple cores, using parfor is much faster!

I used 20 filters and divided them into k = 300 categories. The size of the dictionary is 300 × 60. (in the end, I selected 200 randam pixels, and K was divided into 305 categories. This requires better digital selection in continuous testing. )


3、 Visual words

With dictionary, the next step is to turn every pixel of all the pictures into a dictionary “text”.

In short, it is to compare each row of the whole filterresponse (each row represents the result of a pixel, that is, each row has 60 columns, and there are w × h rows in total) obtained by each picture through the filter with the dictionary. The most similar row between the pixel and the dictionary is the visual word of that pixel. Instead of visual word encoding, the number of dictionary lines (ID) is used to represent each picture as a wordmap matrix of W × h and save it as. Mat.

The pdist2 function is used here.

D = pdist2(X,Y,distance)
    A row in the matrix represents an element, D (I, J) represents the distance between row I of X and row J of Y

Image c (wordmap) can be used to visualize the visual word representation of pictures.

At the beginning of this article, how to visualize? How to know the pixel RGB represented by each visual word? If you start to get tangled up with this problem, explain – Congratulations, you are also led away by me! In fact, it is not necessary to consider the pixel RGB represented by each visual word, only one visual word can be represented by one RGB, so it is OK to directly replace each word with its ID, which shows that.


4、 Image features

How to get the characteristics of each picture? Histogram can be used here.

After getting the wordmap of each picture in the previous step, you can count the times of each word in each picture, and the histogram is the characteristics of the picture.

Histogram is very easy to calculate, input: W × h wordmap matrix, output: K × 1 matrix. Remember to normalize.

★ however, there is no spatial information for histogram calculation of the whole map. So we use pyramid space to match SPM. In short, image is divided into n × n cells, histogram is calculated for each cell, and then histogram of each cell is spliced together, which becomesK([4^(LayerNum)-1]/3)×1Matrix of.

When splicing, level 0 and level 1 take1/2^(-L)Times, take the rest1/2^(l-L-1)Times.

Mat2cell can be used to directly segment wordmap. It should be noted that sometimes the size of the picture is not necessarily divided by 2 or 4, so it can be rounded down from the front, and the last piece is all the rest.

When layernum = 3, that is, l = 2, the result size is 6300 × 1. (I finally used layernum = 4)


5、 Recognize system

Some tutorials write the code of distance first here, but I find that before I do recognize, I really can’t understand the input and output of distance. So I wrote the recognize system first.

Recognize system is to get a vision.mat It includes:

        filterBank
        dictionary
        train_features
        train_labels

Among them, filterbank, dictionary and train_ Labels can be loaded directly from the previous data and the existing data. In fact, only train is needed_ features。

    train_ What are features? It is actually the set matrix of SPM histogram obtained by a series of calculations for each t picture used for training. The size isK([4^(LayerNum)-1]/3)×T, each column represents a picture.

It looks very simple, just from the existing wordMap.mat To import wordmap from wordMap.wordMap 。 Then each wordmap uses the histogram calculation function one to calculate, all splicing (horzcat matrix horizontal splicing), complete!


6、 Distance

Now you can look back at distance. Distance is used to judge the image and train of the test during the test_ Gaps in each column of features.

Use histogram intersection similarity. That is sum (min (h, train_ Of course, this just means that the code is not like this, I use the bsxfun operation.


7、 Testing

Select some test picture input test, and compare the predicted classification with picture label. My accuracy is about 58% – 64%.



I’m still trying to find out how to improve the accuracy, and I hope my friends will give me some suggestions.