Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf
《Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf》由会员分享,可在线阅读,更多相关《Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf(14页珍藏版)》请在文库网上搜索。
1、arXiv:1409.1556v6 cs.CV 10 Apr 2015Published as a conference paper at ICLR 2015VERY DEEP CONVOLUTIONAL NETWORKSFOR LARGE-SCALE IMAGE RECOGNITIONKaren Simonyan Zeiler Sermanet et al., 2014;Simonyan Sermanet et al., 2014) utilised smaller receptive window size andsmaller stride of the first convolutio
2、nal layer. Another line of improvements dealt with trainingand testing the networks densely over the whole image and over multiple scales (Sermanet et al.,2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecturedesign its depth. To this end, we fix other parame
3、ters of the architecture, and steadily increase thedepth of the network by adding more convolutional layers, which is feasible due to the use of verysmall (3 3) convolution filters in all layers.As a result, we come up with significantly more accurate ConvNet architectures, which not onlyachieve the
4、 state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are alsoapplicable to other image recognition datasets, where they achieve excellent performance even whenused as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM withoutfine-tuning)
5、. We have released our two best-performing models1 to facilitate further research.The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations.The details of the image classification training and evaluation are then presented in Sect. 3, and thecurrent affiliatio
6、n: Google DeepMind +current affiliation: University of Oxford and Google DeepMind1http:/www.robots.ox.ac.uk/vgg/research/very_deep/1Published as a conference paper at ICLR 2015configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes thepaper. For completeness, we a
7、lso describe and assess our ILSVRC-2014 object localisation systemin Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.Finally, Appendix C contains the list of major paper revisions.2 CONVNET CONFIGURATIONSTo measure the improvement brought by the incre
8、ased ConvNet depth in a fair setting, all ourConvNet layer configurations are designed using the same principles, inspired by Ciresan et al.(2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNetconfigurations (Sect. 2.1) and then detail the specific conf
9、igurations used in the evaluation (Sect. 2.2).Our design choices are then discussed and compared to the prior art in Sect. 2.3.2.1 ARCHITECTUREDuring training, the input to our ConvNets is a fixed-size 224 224 RGB image. The only pre-processing we do is subtracting the mean RGB value, computed on th
10、e training set, from each pixel.The image is passed through a stack of convolutional (conv.) layers, where we use filters with a verysmall receptive field: 3 3 (which is the smallest size to capture the notion of left/right, up/down,center). In one of the configurations we also utilise 1 1 convoluti
11、on filters, which can be seen asa linear transformation of the input channels (followed by non-linearity). The convolution stride isfixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preservedafter convolution, i.e. the padding is 1 pixel for 3 3 conv.
12、layers. Spatial pooling is carried out byfive max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followedby max-pooling). Max-pooling is performed over a 2 2 pixel window, with stride 2.A stack of convolutional layers (which has a different depth in different arc
13、hitectures) is followed bythree Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer isthe soft-max layer. The configuration of the fully connected layers is the sam
14、e in all networks.All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012) non-linearity.We note that none of our networks (except for one) contain Local Response Normalisation(LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisationd
15、oes not improve the performance on the ILSVRC dataset, but leads to increased memory con-sumption and computation time. Where applicable, the parameters for the LRN layer are thoseof (Krizhevsky et al., 2012).2.2 CONFIGURATIONSThe ConvNet configurations, evaluated in this paper, are outlined in Tabl
16、e 1, one per column. Inthe following we will refer to the nets by their names (AE). All configurations follow the genericdesign presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A(8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 F
17、C layers). The widthof conv. layers (the number of channels) is rather small, starting from 64 in the first layer and thenincreasing by a factor of 2 after each max-pooling layer, until it reaches 512.In Table 2 we report the number of parameters for each configuration. In spite of a large depth, th
18、enumber of weights in our nets is not greater than the number of weights in a more shallow net withlarger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014).2.3 DISCUSSIONOur ConvNet configurations are quite different from the ones used in the top-performing entriesof t
19、he ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. lay-ers (e.g. 1111 with stride 4 in (Krizhevsky et al., 2012), or 77 with stride 2 in (Zeiler Sermanet et al., 2014), we use ve
20、ry small 3 3 receptive fields throughout the whole net,which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two33 conv. layers (without spatial pooling in between) has an effective receptive field of 55; three2Published as a conference paper at ICLR 20
21、15Table 1: ConvNet configurations (shown in columns). The depth of the configurations increasesfrom the left (A) to the right (E), as more layers are added (the added layers are shown in bold). Theconvolutional layer parameters are denoted as “convreceptive field size-number of channels”.The ReLU ac
22、tivation function is not shown for brevity.ConvNet ConfigurationA A-LRN B C D E11 weight 11 weight 13 weight 16 weight 16 weight 19 weightlayers layers layers layers layers layersinput (224224 RGB image)conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64LRN conv3-64 conv3-64 conv3-64 conv3-64maxpo
23、olconv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128conv3-128 conv3-128 conv3-128 conv3-128maxpoolconv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv1-256 conv3-256 conv3-256conv3-256maxpoolconv3-512 conv3-512 conv3-512
24、conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolconv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolFC-4
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
5人已下载
免费下载 | 加入VIP,免费下载 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Very deep convolutional networks for large scale image recognition 2014 Simonyan and
链接地址:https://www.wenkunet.com/p-3873.html