As seen in previous post, deep nets read the correct letter with a correctness of 99%.

Now let’s go further to get precise position information about the license plate and its letters, as explained in Faster RCNN publication from Microsoft Research two weeks ago.

I will re-use the first 2 convolution layers to create a feature map over which I will slide a window of 3x3 on top of which will operate two new nets :

  • a box classification layer, giving the probability of the licence plate to be centered on this point

  • a box regression layer, giving the size of the box on that point

These two new nets are composed of a common innerproduct layer “ip1-rpn”, and another innerproduct layer specific to each of them.

Instead of training these nets on the feature map, I will train the full net composed of the first 2 convolution layers and the two new nets, but with a learning rate set to 0 for the first 2 layers. At the end, the effective receptive field on the input image will be of size XxX.

During testing and deployment, the sliding ‘inner product net’ on a window of 3x3 will be replaced with a simple ‘convolution net’ of kernel 3 with the same parameters.

A a dataset, I labeled the letters on each images and I can easily extract the plate zone. Statistics of license plates are :

  Average Max Min
Width 200 673 18
Height 40 127 15
Min orientation - 20 -26


So I will consider the output of the nets to predict the probability and regression of 5 anchors at 5 different scales / widths : 660x330 - 560x187 - 460x153 - 360x120 - 260x87 - 160x53 - 60x20.

Train net

Here are the different steps for the train net :

  1. re-use the previous net parameters for the shared layer that will have the same names : conv1, pool1, conv2, pool2.

  2. fix their learning rate at 0 :

     param {
       lr_mult: 0
       decay_mult: 0
     }
    
  3. keep the dropout layer after the convolution layers

  4. change the name of the innerproduct layer “ip1” for “ip1-rpn” to train with new random weight params and replace the innerproduct layer “ip2” with 2 sibling convolutional layers :

    • “cls_score” with 5 x 2 parameters (the probability or not to be a plate)

    • “bbox_pred” with 5 x 4 parameters : t_x, t_y, t_w (width) and t_o (orientation)

  5. add a “SoftmaxWithLoss” layer for cls_score and “SmoothL1Loss” layer for bounding box regression layer.

  6. data layer

Training set

I will feed the data layer with extracted rectangles, and for each rectangle, the label and the rectangle coordinates x, y, w and o. Since order is preserved, I can simply add 4 new repeated fields to the caffe::Datum message format :

message Datum {
  optional int32 channels = 1;
  optional int32 height = 2;
  optional int32 width = 3;
  // the actual image data, in bytes
  optional bytes data = 4;
  optional int32 label = 5;
  // Optionally, the datum could also hold float data.
  repeated float float_data = 6;
  // If true data contains an encoded image that need to be decoded
  optional bool encoded = 7 [default = false];
  // ROI
  repeated int32 roi_x = 8;
  repeated int32 roi_y = 9 ;
  repeated int32 roi_w = 10;
  repeated int32 roi_h = 11;
  repeated int32 roi_label = 12;
}

Since optional is compatible with repeated in the protobuf format, I could also have changed the label field as repeated but this would require more changes in the code.

With this configuration, I can use caffe::Datum either in the ‘old way’, without the previous fields / ROI information, or in a 1-ROI way, where I add one rectangle information to each image, or in the multiple-ROI-per-image way, where I add multiple rectangles to each image.

The input layer will produce the correspondent new fields labels, bbox_targets and bbox_loss_weights (initialiazed to one) for the SmoothL1Loss layer :

layer {
  name: "MyData"
  type: "ROIData"
  top: "data"
  top: "label"
  top: "labels"
  top: "bbox_targets"
  top: "bbox_loss_weights"
  include {
    phase: TRAIN  
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "train_lmdb"
    batch_size: 64
    backend: LMDB
  }
}

During the first training, I do not use a ROI Max pooling layer in the training net. I prefer not to send too big images but send the extracted rectangles rather than the full image.

#Feature map net

Let’s train with the previously learned parameters the new model :

~/technologies/caffe/build/tools/caffe train --solver=lenet_train_test_position.prototxt -weights=lenet_iter_2000.caffemodel -gpu 0

Once trained, I convert the innerproduct layers into the convolution layers to get a feature map, as seen in previous post.

Feature map

Test/deploy net

On top of the feature map layer, add a NMS layer and a Top-N layer and a ROI pooling layer at the place of the dropout layer :

layer {
  name: "roi_pool3"
  type: "ROIPooling"
  bottom: "pool2"
  bottom: "rois"
  top: "pool3"
  roi_pooling_param {
    pooled_w: 7
    pooled_h: 7
    spatial_scale: 0.0625 # 1/16
  }
}

Creating our own NMS and Top-N layer.