Christopher Bourez's blogDisrupting SASU. Christopher Bourez
//christopher5106.github.io/
Mon, 25 Mar 2019 11:53:01 +0000Mon, 25 Mar 2019 11:53:01 +0000Jekyll v3.7.4Understand batch matrix multiplication<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">2</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ValueError: Dimensions must be equal, but are 2 and 3 for 'MatMul' (op: 'MatMul') with input shapes: [60,2], [3,70].
</code></pre></div></div>
<p>It looks like <script type="math/tex">60 = 3 \times 4 \times 5</script> and <script type="math/tex">70 = 5 \times 3 \times 7</script>.</p>
<p>What’s happening ?</p>
<h3 id="matrix-multiplication-when-tensors-are-matrices">Matrix multiplication when tensors are matrices</h3>
<p>The matrix multiplication is performed with <code class="highlighter-rouge">tf.matmul</code> in Tensorflow or <code class="highlighter-rouge">K.dot</code> in Keras :</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>or</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns a tensor of shape (3,5) in both cases. Simple.</p>
<h3 id="keras-dot">Keras dot</h3>
<p>If I add a dimension:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">7</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns a tensor of shape (2, 3, 7, 5).</p>
<p>The matrix multiplication is performed along the 4 values of :</p>
<ul>
<li>
<p>the last dimension of the first tensor</p>
</li>
<li>
<p>the before-last dimension of the second tensor</p>
</li>
</ul>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span> <span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns a tensor of size</p>
<ul>
<li><code class="highlighter-rouge">a.shape</code> minus last dimension => (1,2,3)</li>
</ul>
<p>concatenated with</p>
<ul>
<li><code class="highlighter-rouge">b.shape</code> minus the before last dimension => (8,7,5)</li>
</ul>
<p>hence : (1, 2, 3, 8, 7, 5)</p>
<p>where each value is given by the formula :</p>
<script type="math/tex; mode=display">c_{a,b,c,i,j,k} = \sum_r a_{a,b,c,r} b_{i,j, r, k}</script>
<p>Not very easy to visualize when ranks of tensors are above 2 :).</p>
<p><strong>Note that this behavior is specific to Keras dot</strong>. It is a reproduction of Theano behavior.</p>
<p>In particular, it enables to perform a kind of dot product:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns a tensor of shape (1, 2, 8, 7, 5).</p>
<h3 id="batch-matrix-multiplication--tfmatmul-or-kbatch_dot">Batch Matrix Multiplication : tf.matmul or K.batch_dot</h3>
<p>There is another operator, <code class="highlighter-rouge">K.batch_dot</code> that works the same as <code class="highlighter-rouge">tf.matmul</code></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">batch_dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>or</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns a tensor of shape (9, 8, 7, 4, 5) in both cases.</p>
<script type="math/tex; mode=display">c_{a,b,c,i,j} = \sum_r a_{a,b,c,i,r} b_{a,b,c, r, j}</script>
<p>So, here the multiplication has been performed considering (9,8,7) as the batch size or equivalent. That could be a position in the image (B,H,W) and for each position we’d like to multiply two matrices.</p>
<p>In CNTK, the same operations will produce an array of dimensions (9, 8, 7, 4, 9, 8, 7, 5) which might not be desired. Here is the trick: in CNTK all operators can be batched, as soon as you declare the first dimension is the batch dimension (dynamic axis) with <code class="highlighter-rouge">C.to_batch()</code> and batch multiplication could be written this way:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="kn">import</span> <span class="nn">cntk</span> <span class="k">as</span> <span class="n">C</span>
<span class="k">def</span> <span class="nf">cntk_batch_dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="n">a_shape</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">int_shape</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">a_shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">:]))</span>
<span class="n">b_shape</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">int_shape</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">b</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">b_shape</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">:]))</span>
<span class="n">res</span> <span class="o">=</span> <span class="n">C</span><span class="o">.</span><span class="n">times</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">return</span> <span class="n">K</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">res</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="nb">list</span><span class="p">(</span><span class="n">a_shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span> <span class="o">+</span> <span class="p">[</span><span class="n">b_shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]])</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">C</span><span class="o">.</span><span class="n">to_batch</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">C</span><span class="o">.</span><span class="n">to_batch</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">cntk_batch_dot</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p><strong>Well done!</strong></p>
Sun, 28 Oct 2018 05:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/28/understand-batch-matrix-multiplication.html
//christopher5106.github.io/deep/learning/2018/10/28/understand-batch-matrix-multiplication.htmldeeplearningCourse 4: Encoder decoder architectures, generative networks and adversarial training!<p>Under construction:</p>
<p>Encoder decoder architectures:
<img src="//christopher5106.github.io/img/encoderdecoder.png" /></p>
<p>Adversarial training:
<img src="//christopher5106.github.io/img/advtraining.png" /></p>
<p>The example of unsupervised translation:
<img src="//christopher5106.github.io/img/unsupervisedtranslation.png" /></p>
<p>The discriminator works as a trained loss:
<img src="//christopher5106.github.io/img/trainedloss.png" /></p>
Sun, 21 Oct 2018 10:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/21/course-four-encoder-decoder-architectures-generative-networks-and-adversarial-training.html
//christopher5106.github.io/deep/learning/2018/10/21/course-four-encoder-decoder-architectures-generative-networks-and-adversarial-training.htmldeeplearningCourse 3: natural language and deep learning!<p>Here is my course of deep learning in 5 days only! You might first check <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0: deep learning!</a>, <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.html">Course 1: program deep learning!</a> and <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-two-build-deep-learning-networks.html">course 2: build deep learning networks!</a> if you have not read them.</p>
<p>In this article, I develop techniques for natural language. Tasks include text classification (finding the sentiment, positive or negative, the language, …), segmentation (POS taging, Named Entities extraction ,…), translation, … as in computer vision.</p>
<p>But, while computer vision deals with <strong>continuous</strong> (pixel values) and <strong>fixed-length</strong> inputs (images, usually resized or cropped to a fixed dimension), natural language consists of <strong>variable-length</strong> sequences (words, sentences, paragraphs, documents) of <strong>discrete</strong> inputs, either characters or words, belonging to a fixed size dictionary (the alphabet or the vocabulary respectively), depending if we work at character level or word level.</p>
<p>There are two challenges to overcome :</p>
<ul>
<li>
<p>transforming discrete inputs into continuous representations or vectors</p>
</li>
<li>
<p>transforming variable-length sequences into a fixed-length representations</p>
</li>
</ul>
<h1 id="the-dictionary-of-symbols">The dictionary of symbols</h1>
<p>Texts are sequences of characters or words, depending if we word at character level or word level. It is possible to work at both levels and concatenate the representations at higher level in the neural network. The <strong>dictionary</strong> is a fixed-size list of symbols found in the input data: at character level, we call it an alphabet, while at word level it is usually called a vocabulary. While the words incorporate a semantic meaning, the vocabulary has to be limited to a few tens thousands entries, so many out-of-vocabulary tokens cannot be represented.</p>
<p>There exists a better encoding for natural language, the <strong>Byte-Pair-Encoding (BPE)</strong>, a compression algorithm that iteratively replaces the most frequent pairs of symbols in sequences by a new symbol: initially, the symbol dictionary is composed of all characters plus the ‘space’ or ‘end-of-word’ symbol, then recursively count each pair of symbols (without crossing word boundaries) and replace the most frequent pair by a new symbol.</p>
<p>For example, if (‘T’, ‘H’) is the most frequent pair of symbols, we replace all instances of the pair by the new symbol ‘TH’. Later on, if (‘TH’, ‘E’) is most frequent pair, we create a new symbol ‘THE’. The process stops when the desired target size for the dictionary has been achieved. At the end, the symbols composing the dictionary are essentially characters, bigrams, trigrams, as well as most common words or word parts.</p>
<p>The advantage of BPE is to be capable of <strong>open-vocabulary</strong> through a fixed-size vocabulary though, while still working at a coarser level than characters. In particular, names, compounds, loanwords which do not belong to a language word vocabulary can still be represented by BPE. A second advantage is that the BPE preprocessing works for all languages.</p>
<p>In translation, better results are achieved by joint BPE, encoding both the target and source languages with the same dictionary of encoding. For languages using a different alphabets, characters are transliterated from one alphabet to the other. This helps in particular to copy Named Entities which do not belong to a dictionary.</p>
<p>Last, it is possible to relax the greedy and deterministic symbol replacement of BPE, by using a <strong>unigram language model</strong> that assumes that each symbol is an unobserved latent variable of the sequence and occurs independently in the sequence. Given this assumption, the probability of a sequence of symbols <script type="math/tex">(x_1, ..., x_M)</script> is given by :</p>
<script type="math/tex; mode=display">P(x_1, ..., x_M) = \prod_{i=1}^M p(x_i)</script>
<p>and the probability of a sentence or text S to occur is given by the sum of probabilities of each encodings:</p>
<script type="math/tex; mode=display">P(S) = \sum_{(x_1, ..., x_M)==S} P(x_1, ..., x_M)</script>
<p>So it is possible to compute a dictionary of the desired size that maximizes (locally) the likelihood by an iterative algorithm starting from a huge dictionary of most frequent substring, estimating the expectation as in the EM algorithm, and removing the subwords with less impact on the likelihood. Also, multiple decodings into a sequence of symbols are possible for a text and the model gives each of them a probability. At training time, it is possible to sample a decoding of the input given the symbol distribution. At inference, it is possible to compute the predictions using multiple decodings, and choose the most confident prediction. Such a technique, called <strong>subword regularization</strong>, augments the training data stochastically and improves accuracy and robustness in natural language tasks.</p>
<p>Have a look at <a href="https://github.com/google/sentencepiece">SentencePiece</a>. Note that ‘space’ character is treated as a symbol and pretokenization of the sequences is not necessary.</p>
<h1 id="distributed-representations-of-the-symbols">Distributed representations of the symbols</h1>
<p>Now we have a dictionary, each text block can be represented by a sequence of token ids. Such a representation is discrete and does not encode the semantic meaning of the token. In order to do so, we associate each token with a vector of dimension d to be learned. All tokens are represented by an embedding matrix</p>
<script type="math/tex; mode=display">W \in \mathbb{R}^{V \times d}</script>
<p>when V is the size of the dictionary.</p>
<p>Two architectures were proposed :</p>
<ul>
<li>
<p>the <strong>Continuous Bag-of-Words (CBOW)</strong> to predict the current word based on the context, and</p>
</li>
<li>
<p>the <strong>Continuous Skip-gram</strong> to predict the context words based on the current word,</p>
</li>
</ul>
<p>with a simple feedforward model :</p>
<script type="math/tex; mode=display">\text{Softmax}( (\hat{W} \times X') \cdot (W \times X))</script>
<p>where X and X’ <script type="math/tex">\in \mathbb{R}^V</script> are the one-hot encoding vector of the input and output words (with 1 if the word occur in the input and output respectively), W and <script type="math/tex">\hat{W}</script> are the input and output embedding matrices.</p>
<p><img src="//christopher5106.github.io/img/Learning-architecture-of-the-CBOW-and-Skip-gram-models-of-Word2vec-Mikolov-et-al.png" /></p>
<p>One of the difficulty is due to the size of the output, equal to the size of the dictionary. Some solutions have been proposed: hierarchical softmax with Huffman binary trees, avoiding normalization during training, or stochastic negative mining (Noise Contrastive Estimation and Negative Sampling with a sampling distribution avoiding frequent words).</p>
<p>Once trained, such weights W can be refined on high level tasks such as similarity prediction, translation, etc. They can be found under the names <strong>word2vec</strong>, <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>, Paragram-SL999, …, and used to initialize the first layer of neural networks. In any application, they can be either fixed or trained further.</p>
<p>The next step is to address representations for blocks of text, ie sentences, paragraphs or documents. Before that, I first present new layers initially designed for natural language.</p>
<h1 id="recurrent-neural-networks-rnn">Recurrent neural networks (RNN)</h1>
<p>A recurrent network can be defined as a feedforward (non recurrent) network with two inputs, the input at time step t and previous hidden state <script type="math/tex">h_{t-1}</script>, to produce a new state at time t, <script type="math/tex">h_t</script>.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL21.png" /></p>
<p>Recurrent models have been used in the past to transform a sequence of vectors (the learned word representations seen in previous section) into one vector, reducing variable-length representation into a fixed-length representation. Now they are outdated and been replaced by transformers.</p>
<p>It is very frequent to find some BiLSTM, where LSTM is a type of RNN applied two times to the sequence, in natural and reverse orders, both outputs being concatenated.</p>
<h1 id="attention-self-attention-and-transformer">Attention, self-attention and Transformer</h1>
<p>The <strong>attention mechanism</strong> takes as input a query, an input sequence transformed into keys and values . It consists in finding the relevant entries in the input sequence, given a <em>query</em> represented by a (512,) vector as presented in the following figure:
<img src="//christopher5106.github.io/img/attention.png" /></p>
<p>The query is compared with <em>keys</em>, a matrix of dimension (T, 512) for 1 sample, or (B, T, 512) for a batch of samples, and the dot product is used as comparison operator. Be careful to normalize output of the scalar product by <script type="math/tex">\frac{1}{\sqrt{d}}</script>, where d is the dimensionality of the data, in order to keep variance 1.</p>
<p>The result is passed through a softmax layer in order to get a probability <script type="math/tex">p_t</script> for each of the T positions in the sequence.
The <em>values</em>, another matrix of dimension (T, 512) or (B, T, 512), is used to computed the final result, of dimension (512,):</p>
<script type="math/tex; mode=display">r = \sum_{t=0}^T p_t v_t</script>
<p><strong>Self-attention</strong> is when each position of the input sequence is used a query. The output is of shape (T, 512), the same length as the input, T.
<img src="//christopher5106.github.io/img/selfattention.png" />
Such a mechanism enables to take into account each word of the context to compute the representation of a word.</p>
<p>The <strong>Transformer</strong> is an architecture using convolutions and self attention, described <a href="https://arxiv.org/abs/1706.03762">here</a>.</p>
<p>For classification tasks, such as <em>sentiment analysis</em>, the output at the start token (or first token) is taken to represent the sequence and be further processed by a simple classifier. For segmentation tasks, such as question answering (QA) or named entity recognition (NER), all other outputs are used as input to the classifier:
<img src="//christopher5106.github.io/img/nlptasks.png" /></p>
<p>In translation, both encoder and decoder incorporate a Transformer:
<img src="//christopher5106.github.io/img/translationtransformer.png" />
<img src="//christopher5106.github.io/img/translationtransformer2.png" /></p>
<p>The self-attention enables computation efficiency and reduces the depth that accounts in the vanishing gradient problem:
<img src="//christopher5106.github.io/img/transformerefficiency.png" /></p>
<h1 id="sentences-paragraphs-or-documents">Sentences, paragraphs or documents</h1>
<p>There is a long history of attempts to represent block of texts.</p>
<p>The first idea, the simplest one, was to compute the average of the word representations. First results were convincing, and improved by the introduction of a weighted average, reweighting each word vector by the <em>smooth inverse frequency</em> $$ \frac{a}{a+IDF}, so that very frequent words, stop words, are less important than rare words.</p>
<p>Other techniques include the computation of a <em>paragraph vector</em> which can be considered as a special embedding, as if one special word has been added to each sentence to supplement the word embeddings and account for the semantic information in the paragraph. While initial idea required lot’s of gradient descents during inference, new techniques, such as <a href="https://arxiv.org/abs/1708.00107">CoVe</a>, are computed directly by a forward pass of a two-layer BiLSTM on the sentence. The output of the trained BiLSTM at the word position is concatenated with the classical word representations (GloVe for example).</p>
<p>While CoVe biLSTM is trained on a translation task, <a href="https://arxiv.org/abs/1802.05365">Elmo</a>’s stack of BiLSTM is trained on monolingual language model (LM) task, and the output of each BiLSTM layer is taken into the concatenation with the GloVe word representations.</p>
<p>The current state of the art for deep contextualized representation is <a href="https://arxiv.org/abs/1810.04805">BERT</a>. Instead of BiLSTM, it uses state-of-art Transformer, and masks the words to predict at the output. This way, Transformers use all other words to perform the predictions, while LSTM could only use the previous ones (or the next ones for the reverse order). In the meantime, BERT is trained on two sentences, either consecutive in the text or randomly chosen, to predict also if sentence B follows sentence A. Last, a <a href="https://github.com/google-research/bert/blob/master/multilingual.md">multilingual BERT model</a> trained on 104 languages is just less 3% off the single-language model.</p>
<p><img src="//christopher5106.github.io/img/bert-Arch-2-300x271.png" /></p>
<p>By simply adding a linear layer on top of BERT’s output, state-of-the-art has been achieved on Squad Question Answering dataset, SWAG dataset, and GLUE datasets, demonstrating that these embeddings capture well the semantics of blocks of text.</p>
<p>BERT has a <a href="https://github.com/facebookresearch/XLM">cross-lingual version</a>, trained with translation parallel data to build representations independent of language:
<img src="//christopher5106.github.io/img/bertxlm.png" /></p>
<p>The training of BERT and XLM embedding is performed with one of the following losses:</p>
<ul>
<li>the causal language model (CLM): predicting the next word in the sentence</li>
<li>the masked language model (MLM): predicting the randomly masked or replaced words.</li>
<li>the translation language model (TLM) : it is like a CLM but for a sequence consisting of 2 sentences in 2 different languages, translation of each other, with reset position embeddings so that the model can use either both sentences indifferently to predict the masked or replaced words, leading to aligned representations for different languages.</li>
</ul>
<p><img src="//christopher5106.github.io/img/embeddingtraining.png" /></p>
<h1 id="metrics-for-translation">Metrics for translation</h1>
<p>Models are trained to maximize the log likelihood, but this does not give any idea of the final quality of the model. Since human evaluations is costly and long, metrics have been developped that correlates well with human judgments, on a given set of possible translations (reference translations) for the task of translation for example.</p>
<p>The ChrF3 is one of them, the most simple one, is the <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.html#training-curves-and-metrics">F3-score</a> for the character n-grams. The best correlations are obtained with <script type="math/tex">n=6</script>. The ChrF3 has a recall bias.</p>
<p>Another metric, BLUE, has a precision bias. It is the precision for the character n-grams, where the count for each n-gram matches is clipped, not to be bigger than the number of n-grams in the reference translation, in order to penalize translation systems that generate a same word multiple times. The metric can be case-sensitive, in order to take into account the Named Entities for example.</p>
<p>ROUGE-n is the recall equivalent metric.</p>
<p>There exists metrics on words (called unigrams) also, such as</p>
<ul>
<li>
<p>WER (Word Error Rate), the Levenstein distance at word level.</p>
</li>
<li>
<p>METEOR, the ChrF3 equivalent at word level, reweighted by a non-alignment penalty.</p>
</li>
</ul>
<p><a href="//christopher5106.github.io/deep/learning/2018/10/21/course-four-encoder-decoder-architectures-generative-networks-and-adversarial-training.html">Next course ! encoder decoder architectures, generative networks, and adversarial training !</a></p>
<p><strong>Well done!</strong></p>
Sat, 20 Oct 2018 10:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/20/course-three-natural-language-and-deep-learning.html
//christopher5106.github.io/deep/learning/2018/10/20/course-three-natural-language-and-deep-learning.htmldeeplearningCourse 2: build deep learning neural networks in 5 days only!<p>Here is my course of deep learning in 5 days only!</p>
<p>You might first check <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0: deep learning!</a> and <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.html">Course 1: program deep learning!</a> if you have not read them.</p>
<h1 id="common-layers-for-deep-learning">Common layers for deep learning</h1>
<p>After the <strong>Dense</strong> layer seen in Courses 0 and 1, also commonly called <strong>fully connected (FC)</strong> or <strong>Linear</strong> layers, let’s go further with new layer.</p>
<h4 id="convolutions">Convolutions</h4>
<p>Convolution layers are locally linear layers, defined by a kernel consisting of weights working on a local zone of the input.</p>
<p>On a 1-dimensional input, a 1D-convolution of kernel k=3 is defined by 3 weights <script type="math/tex">w_1, w_2, w_3</script>. The first output is computed:</p>
<script type="math/tex; mode=display">y_1 = w_1 x_1 + w_2 x_2 + w_3 x_3</script>
<p>Then next output is the result of shifting the previous computation by one:</p>
<script type="math/tex; mode=display">y_2 = w_1 x_2 + w_2 x_3 + w_3 x_4</script>
<p>And</p>
<script type="math/tex; mode=display">y_3 = w_1 x_3 + w_2 x_4 + w_3 x_5</script>
<p>If the input is of length <script type="math/tex">l_I</script>, a 1D-convolution of kernel 3 can only produce values for</p>
<script type="math/tex; mode=display">l_{Out} = l_{In} - \text{kernel_size} +1</script>
<p>positions, hence <script type="math/tex">(n -2)</script> positions in the case of kernel 3.</p>
<p>It is also possible to define the stride of the convolution, for example with stride 2, the convolution is shifted by 2 positions, leading to <script type="math/tex">l_{Out} = \text{floor}((l_{In} - 3) / 2) +1</script> output positions.</p>
<p>Last, a 1D-convolution can also be applied on matrices, where the first dimension is the length of the sequence and the second dimension is the dimensionality of the data, and modify the dimensionality of the data (number of channels):</p>
<script type="math/tex; mode=display">\text{shape}_{In}=(l_{In}, d_{In}) \rightarrow \text{shape}_{Out} = (l_{Out}, d_{Out})</script>
<script type="math/tex; mode=display">y_{1,j} = \sum_{0\leq i \leq d_{In}} w_{1,i,j} x_{1,i} + w_{2,i,j} x_{2,i} + w_{3,i,j} x_{3,i}</script>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL24.png" /></p>
<p>A 2D-convolution performs the same kind of computations on 2-dimensional inputs, with a 2-dimensional kernel <script type="math/tex">(k_1, k_2)</script>:</p>
<script type="math/tex; mode=display">\text{shape}_{In}=(h_{In}, w_{In}, d_{In}) \rightarrow \text{shape}_{Out} = (h_{Out}, w_{Out}, d_{Out})</script>
<p>Contrary to Linear/Dense layers, where there is one weight (or 2, with the bias) per couple of input value and output value, which can be huge, a 1D-convolution has <script type="math/tex">k \times d_{In} \times d_{Out}</script> weights, a 2D-convolution has <script type="math/tex">k_1 \times k_2 \times d_{In} \times d_{Out}</script> weights if d is the number of output values:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL26.png" /></p>
<p>One of the difficulty with convolutions is due to the fact the output featuremap shape depends on the kernel size and the stride:</p>
<script type="math/tex; mode=display">\text{floor}((\text{dimension} - \text{kernel}) / \text{stride}) +1</script>
<p>since we want</p>
<script type="math/tex; mode=display">\text{kernel} + (n-1) \times \text{stride} \leq \text{dim}</script>
<p>To avoid that, it is possible to pad the input with 0 to create a larger input so that input and output featuremaps keep the same size. This simplifies the design of architectures:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL25.png" /></p>
<p>Last, convolutions can be dilated, in order to change the sampling scheme and the reach / the receptive field of the network. A dilated convolution of kernel 3 looks like a normal convolution of kernel 5 for which 2/5 of the kernel weights have been set to 0:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL29.png" /></p>
<p>Note that a convolution of kernel 1 or (1x1) is called a <strong>pointwise convolution</strong> or <strong>projection convolution</strong> or <strong>expansion convolution</strong>, because it is used to change the dimensionality of the data (number of channels) without changing the length or the width and height of the data.</p>
<p>A convolution that applies convolutions on each channel without connections between channels is called a <strong>depthwise convolution</strong>. The combination of a depthwise convolution followed by projection convolution helps reduces the number of computations while maintaining the accuracy roughly.</p>
<p>Last, convolutions with kernel (k,1) or (1,k), processing only one dimension, either height or width, are called <strong>separable convolutions</strong>.</p>
<h4 id="pooling">Pooling</h4>
<p>Pooling operations are like Dense and Convolution Layers, but do not have weights. Instead, it performs a max or an averaging operation on the input.</p>
<p>There exists MaxPooling and AveragePooling, in 1D and 2D, as for convolutions:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL28.png" /></p>
<p>Usually used with a stride of 2, a MaxPooling of size 2 downsamples the input by 2, which helps summarizing the information towards the output, increases the invariance to small translations, while reducing the number of operations in the layers above.</p>
<p>There exists GlobalAveraging and GlobalMax, working the full input, as Dense layers do:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL27.png" /></p>
<p>When a computer vision network transforms an image of shape (h,w,3) to an output of shape (H,W,C) where <script type="math/tex">H \ll h</script> and <script type="math/tex">W \ll w</script>, then a global average pooling layer takes the average over all positions HxW : this helps build networks less sensitive to big translations of the object of interest in the image.</p>
<h4 id="normalization">Normalization</h4>
<p>Normalization layers are placed between other layers to ensure robustness of the trained neural network.</p>
<p>A first category of normalization layers aims at reducing <em>internal covariance shift</em> during training, i.e. that the statistics of the first layer below remains stable during training to help next layer better perform.</p>
<p>The mean and variance of the outputs are brought back to 0 and 1, by substraction and division, after which the normalization learn new scale and bias parameters.</p>
<p>The statistics can be computed on different set:</p>
<ul>
<li>
<p>per channel but for all data in batch: batch normalization</p>
</li>
<li>
<p>all channels but for one sample : layer normalization</p>
</li>
<li>
<p>per channel and sample : instance normalization</p>
</li>
<li>
<p>for a group of channels and per sample: group norm</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL60.png" /></p>
<p>Another type of normalization layers are <strong>dropout</strong> layers that drops some values by setting them to zero with a <em>dropout probability</em> in order to have the neural networks become more robust and regularized. In the <strong>stochastic depth training</strong>, some complete layers are dropped, a technique used in very deep networks such as ResNets and DenseNets.</p>
<h1 id="image-classification">Image classification</h1>
<p>I’ll present some architectures of neural networks for computer vision. The primary trend was to build deeper networks with convolutional and maxpooling layers. Then, began the search for new efficient structures or modules to stack, rather than single layers. While the number of parameters and operations grew, another trend emerged to search for light-weight architectures for mobile devices and embedded applications.</p>
<h4 id="alexnet-2012">AlexNet (2012)</h4>
<p>The network was made up of 5 convolution layers, max-pooling layers, dropout layers, and 3 fully connected layers (60 million parameters and 500,000 neurons).</p>
<p>It was the first neural network to outperform the state of the art image classification of that time and it won the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge).</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL44.png" /></p>
<p>Main Points:</p>
<ul>
<li>
<p>Used ReLU (Rectified Linear Unit) for the nonlinearity functions (Found to decrease training time as ReLUs are several times faster than the conventional tanh function).</p>
</li>
<li>
<p>Used data augmentation techniques that consisted of image translations, horizontal reflections, and patch extractions.</p>
</li>
<li>
<p>Implemented dropout layers in order to combat the problem of overfitting to the training data.</p>
</li>
<li>
<p>Trained the model using batch stochastic gradient descent, with specific values for momentum and weight decay.</p>
</li>
</ul>
<h4 id="vggnet-2014">VGGNet (2014)</h4>
<p>VGG neural architecture reduced the size of each layer but increased the overall depth of the network (up to 16 - 19 layers) and reinforced the idea that convolutional neural networks have to be deep in order to work well on visual data.</p>
<p>It finished at the first and the second places in the localisation and classification tasks respectively at 2014 ILSVRC.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL48.png" /></p>
<p>Main Points</p>
<ul>
<li>
<p>The use of only 3x3 sized filters is quite different from AlexNet’s 11x11 filters in the first layer: the combination of two 3x3 conv layers has an effective receptive field of 5x5.</p>
</li>
<li>
<p>It simulates a larger filter while decreasing in the number of parameters.</p>
</li>
<li>
<p>As the spatial size of the input volumes at each layer decrease (result of the conv and pool layers), the depth of the volumes increase due to the increased number of filters as you go down the network.</p>
</li>
<li>
<p>Interesting to notice that the number of filters doubles after each maxpool layer. This reinforces the idea of shrinking spatial dimensions, but growing depth.</p>
</li>
</ul>
<h4 id="googlenet--inception-2015">GoogLeNet / Inception (2015)</h4>
<p>GoogLeNet increases the depth of networks with much lower complexity: it is one of the first models that introduced the idea that CNN layers didn’t always have to be stacked up sequentially and creating smaller networks or “modules” :</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL45.png" /></p>
<p>that could be repeated inside the network:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL45.jpg" /></p>
<p>GoogLeNet is a 22 layer CNN and was the winner of ILSVRC 2014</p>
<p>Main Points</p>
<ul>
<li>
<p>Used 9 Inception modules in the whole architecture, with over 100 layers in total!</p>
</li>
<li>
<p>No use of fully connected layers : it use an average pool instead, to go from a 7x7x1024 volume to a 1x1x1024 volume. This saves a huge number of parameters.</p>
</li>
<li>
<p>Uses 12x fewer parameters than AlexNet.</p>
</li>
<li>
<p>During testing, multiple crops of the same image are fed into the network, and the softmax probabilities are averaged to give the final prediction.</p>
</li>
<li>
<p>There are updated versions of the Inception module.</p>
</li>
</ul>
<p><strong>This work was a first step towards the development of grouped convolutions in the future</strong>.</p>
<h4 id="resnet-2015">ResNet (2015)</h4>
<p>ResNet utilizes efficient bottleneck structures and learns them as residual functions with reference to the layer inputs, instead of learning unreferenced functions :</p>
<script type="math/tex; mode=display">x \rightarrow x + R(x)</script>
<p>where R is a residual to learn.</p>
<p>Residuals are stacks of convolution.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL46.png" /></p>
<p>That structure is then sequentially stacked several tenth of time (or more).</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL47.png" /></p>
<p>The winner of the classification task of ILSVRC 2015 is a ResNet network.</p>
<p>Main Points</p>
<ul>
<li>
<p>On the ImageNet dataset residual nets of a depth of up to 152 layers were used being 8x deeper than VGG nets while still having lower complexity.</p>
</li>
<li>
<p>Residual networks are easier to optimize than traditional networks and can gain accuracy from considerably increased depth. In fact, they do not suffer from the vanishing gradients when the depth is too important.</p>
</li>
<li>
<p>Residual networks have a natural limit : a 1202-layer network was trained but got a lower test accuracy, presumably due to overfitting.</p>
</li>
</ul>
<h4 id="squeezenet-2016">SqueezeNet (2016)</h4>
<p>Up to this point, research focused on improving the accuracy of neural networks, the SqueezeNet team took the path of designing smaller models while maintaining the accuracy unchanged.</p>
<p>It resulted into SqueezeNet, which is a convolutional neural network architecture that has 50 times fewer parameters than AlexNet while maintaining its accuracy on ImageNet.</p>
<p>3 main strategies are used while designing that new architecture:</p>
<ul>
<li>
<p>Replace the majority of 3x3 filters with 1x1 filters (to fit within a budget of a certain number of convolution filters).</p>
</li>
<li>
<p>Decrease the number of input channels to 3x3 filters, using dedicated filters named squeeze layers.</p>
</li>
<li>
<p>Downsample late in the network so that convolution layers have large activation maps.</p>
</li>
</ul>
<p>The first 2 strategies are about judiciously decreasing the quantity of parameters in a CNN while attempting to preserve accuracy.</p>
<p>The last one is about maximizing accuracy on a limited budget of parameters.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL46.jpg" /></p>
<h4 id="preactresnet-2016">PreActResNet (2016)</h4>
<p>PreActResNet stands for Pre-Activation Residuel Net and is an evolution of ResNet described above.</p>
<p>The residual unit structure is changed from (a) to (b) (BN: Batch Normalization) :</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL49.png" /></p>
<p>The activation functions ReLU and BN are now seen as “pre-activation” of the weight layers, in contrast to conventional view of “post-activation” of the weighted output.</p>
<p>This changes allowed to increase further the depth of the network while improving its accuracy on ImageNet for instance.</p>
<h4 id="densenet-2016">DenseNet (2016)</h4>
<p>DenseNet is a network architecture where each layer is directly connected to every other layer in a feed-forward fashion (within each dense block). For each layer:</p>
<ul>
<li>
<p>the feature maps of all preceding layers are treated as separate inputs,</p>
</li>
<li>
<p>its own feature maps are passed on as inputs to all subsequent layers.</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL49.jpeg" /></p>
<p>This connectivity pattern yields on CIFAR10/100 accuracies as good as its predecessors (with or without data augmentation) and SVHN. On the large scale ILSVRC 2012 (ImageNet) dataset, DenseNet achieves a similar accuracy as ResNet, but using less than half the amount of parameters and roughly half the number of FLOPs.</p>
<h4 id="resnext-2016">ResNeXt (2016)</h4>
<p>The neural network architecture ResNeXt is based upon 2 strategies :</p>
<ul>
<li>
<p>stacking building blocks of the same shape (strategy inherited from VGG and ResNet)</p>
</li>
<li>
<p>the “split-transform-merge” strategy that is derived from the Inception models and all its variations (split the input, transform it, merge the transformed signal with the original input).</p>
</li>
</ul>
<p>That design introduces group convolutions and exposes a new dimension, called “cardinality” (the size of the set of parallel transformations), as an essential factor in addition to the dimensions of depth and width.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL51.png" /></p>
<p>It is empirically observed that :</p>
<ul>
<li>
<p>when keeping the complexity constant, increasing cardinality improves classification accuracy.</p>
</li>
<li>
<p>increasing cardinality is more effective than going deeper or wider when we increase the capacity.</p>
</li>
</ul>
<p>ResNeXt finished at the second place in the classification task at 2016 ILSVRC.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL50.png" /></p>
<h4 id="xception-2016">Xception (2016)</h4>
<p>Xception introduces depthwise separable convolutions that generalize the concept of separable convolutions in Inception.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL50.jpeg" /></p>
<h4 id="dpn-2017">DPN (2017)</h4>
<p>DPN is a family of convolutional neural networks that intends to efficiently merge residual networks and densely connected networks to get the benefits of both architecture:</p>
<ul>
<li>
<p>residual networks implicitly reuse features, but it is not good at exploring new ones,</p>
</li>
<li>
<p>densely connected networks keep exploring new features but suffers from higher redundancy.</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL52.png" /></p>
<p>Consequently, DPN (Dual Path Network) contains 2 path :</p>
<ul>
<li>
<p>a residual alike path (green path, similar to the identity function),</p>
</li>
<li>
<p>a densely connected alike path (blue path, similar to a dense connection within each dense block).</p>
</li>
</ul>
<h4 id="nasnet-2017">NASNet (2017)</h4>
<p>NASNet learns to architect a the neural network itself, by the Neural Architecture Search (NAS) framework, which uses a reinforcement learning search method to optimize architecture configuration.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL53.png" /></p>
<p>Each network proposed by the RI network is further trained and its accuracy is our reward. To keep the computational effort affordable, the training is performed on a subset of the complete dataset.</p>
<p>The key principle of this model is the design of a new search space (named the “NASNet search space”) which enables transferability from the smallest dataset to the complete one.</p>
<p>Two new modules have been found to achieve state-of-the-art accuracy.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL54.png" /></p>
<h4 id="senet-2017">SENet (2017)</h4>
<p>Typical convolutional neural networks builds a description of its input image by progressively capturing patterns in each of its layers. For each of them, a set of filters are learned to express local spatial connectivity patterns from its input channels. Convolutional filters captures informative combinations by fusing spatial and channel-wise information together within local receptive fields.</p>
<p>SENet (Squeeze-and-Excitation Networks) focuses on the relation between channels and recalibrates at transformation step its features so that informative features are emphazised and less useful ones suppressed (independently of their spatial location).</p>
<p>To do so, SENet uses a new architectural unit that consists of 2 steps :</p>
<ul>
<li>
<p>First, squeeze the block input (typically the output of any other convolutional layer) to create a global information using global average pooling,</p>
</li>
<li>
<p>Then, “excite” the most informative features using adaptive recalibration.</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL55.png" /></p>
<p>The adaptative recalibration is done as follows :</p>
<ul>
<li>
<p>reduce the dimension of its input using a fully connected layer (noted FC below),</p>
</li>
<li>
<p>go through a non-linearity function (ReLU function),</p>
</li>
<li>
<p>restore the dimension of its data using another fully connected layer,</p>
</li>
<li>
<p>use sigmoid function to transform each output into a scale parameter between 0 and 1,</p>
</li>
<li>
<p>linearly rescale each original input of the SE unit according to the scale parameters.</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL55.jpg" /></p>
<p>The winner of the classification task of ILSVRC 2017 is a modified ResNeXt integrating SE blocks.</p>
<h4 id="mobilenets-and-mobilenetv2">MobileNets and Mobilenetv2</h4>
<p>There has been also recently some effort to adapt neural networks to less powerful architecture such as mobile devices, leading to the creation of a class of networks named MobileNet, with a new module optimized for mobility:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL56.png" /></p>
<p>The diagram below illustrates that accepting a (slightly) lower accuracy than the state of the art, it is possible to create networks much less demanding in terms of resources (note that the multiply/add axis is on a logarithmic scale).</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL59.png" /></p>
<h4 id="shufflenet">ShuffleNet</h4>
<p>ShuffleNet goes one step further:</p>
<ul>
<li>
<p>by replacing the pointwise group convolutions that become the main cost, with pointwise group convolutions</p>
</li>
<li>
<p>shuffling the output of group convolutions so that information can flow from each to group to each group of the next group convolution:</p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL61.png" /></p>
<p>While (a) is the bottleneck unit from Xception, grouping the features in the first pointwise convolution requires to shuffle the data to the next (b). When the module is asked to reduce the featuremap size, the depthwise convolution has a stride 2, an average pooling of stride 2 is applied to the shortcut connection and the output channels of both paths are concatenate rather than added, in order to augment the channel dimension (c).</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL57.png" /></p>
<p>Group convolutions reduces the budget so featuremaps can be extended with more channels, bringing a better accuracy in smaller models.</p>
<p><strong>Exercise</strong>: use a Pytorch model to predict the class of an image.</p>
<h1 id="object-detection">Object detection</h1>
<p>For object detection, the first layers of the Image classification networks serve as a basis as “features”, on top of which new neural network parts are learned, using different techniques: Faster-RCNN, R-FCN, SSD, …. The pretrained layers of Image classification networks have learned a “representation” of the data on high volume of images that helps train object detection neural architectures on specialized datasets. Below is a diagram presenting differente object detection techniques with different features. When the feature network is more efficient for image classification, results in object detection are also better.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL58.png" /></p>
<h1 id="segmentation">Segmentation</h1>
<h1 id="audio">Audio</h1>
<p><strong>Well done!</strong></p>
<p><a href="//christopher5106.github.io/deep/learning/2018/10/20/course-three-natural-language-and-deep-learning.html">Next course ! natural language and deep learning</a></p>
Sat, 20 Oct 2018 08:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/20/course-two-build-deep-learning-networks.html
//christopher5106.github.io/deep/learning/2018/10/20/course-two-build-deep-learning-networks.htmldeeplearningCourse 1: learn to program deep learning in Pytorch, MXnet, CNTK, Tensorflow and Keras!<p>Here is my course of deep learning in 5 days only!</p>
<p>You might first check <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0: deep learning!</a> if you have not read it. A great article about cross-entropy and its generalization.</p>
<p>In this article, I’ll go for the introduction to deep learning and programming, coding few functions under deep learning technologies: Pytorch, Keras, Tensorflow, MXNet, CNTK.</p>
<h1 id="your-programming-environment">Your programming environment</h1>
<p>Deep learning demands heavy computations so all deep learning libraries offer the possibility of parallel computing on GPU rather CPU, and distributed computed on multiple GPUs or instances.</p>
<p>The use of specific hardwares such as GPUs requires to install an up-to-date driver in the operating system first.</p>
<p>While OpenCL (not to confuse with OpenGL for graphics or OpenCV for images) is an open standard for scientific GPU programming, the most used GPU library is CUDA, a private library by NVIDIA, to be used on NVIDIA GPUs only.</p>
<p>CUDNN is a second library coming with CUDA providing you with more optimized operators.</p>
<p>Once installed on your system, these libraries will be called by higher level deep learning frameworks, such as Caffe, Tensorflow, MXNet, CNTK, Torch or Pytorch.</p>
<p>The command <code class="highlighter-rouge">nvidia-smi</code> enables you to check the status of your GPUs, as with <code class="highlighter-rouge">top</code> or <code class="highlighter-rouge">ps</code> commands for CPUs.</p>
<p>Most recent GPU architectures are Pascal and Volta architectures. The more memory the GPU has, the better. Operations are usually performed with single precision <code class="highlighter-rouge">float16</code> rather than double precision <code class="highlighter-rouge">float32</code>, and on new Volta architectures offer Tensor cores specialized with half precision operations.</p>
<p>One of the main difficulties come from the fact that different deep learning frameworks are not available and tested on all CUDA versions, CUDNN versions, and even OS. CUDA versions are not available for all driver versions and OS as well.</p>
<p>Solutions are:</p>
<ul>
<li>
<p>use Docker containers, which limit the choice of the driver version in the host operating system. For the compliant CUDA and CUDNN versions as well as the deep learning frameworks, you install them in the Docker container.</p>
</li>
<li>
<p>or use environment managers such as <code class="highlighter-rouge">conda</code> or <code class="highlighter-rouge">virtualenv</code>. A few commands to know:</p>
</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#create an environment named pytorch</span>
conda create <span class="nt">-n</span> pytorch <span class="nv">python</span><span class="o">=</span>3.4
<span class="c"># activate it</span>
<span class="nb">source </span>activate pytorch
<span class="c"># install pytorch in this environment</span>
conda install pytorch
<span class="c"># you are done!</span>
<span class="c"># Run either a python shell</span>
python
<span class="c"># or jupyter notebook</span>
conda install jupyter
jupyter notebook
</code></pre></div></div>
<p>Jupyter UI proposes to choose the Conda environment inside the notebook.</p>
<p>It is possible to combine CUDA and OpenGL for <a href="//www.nvidia.com/content/gtc/documents/1055_gtc09.pdf">graphical applications requiring deep learning predictions: the image data is fully processed on GPU</a>.</p>
<h1 id="the-batch">The batch</h1>
<p>When applying the update rule, the best is to compute the gradients on the whole dataset but it is too costly. Usually we use a batch of training examples, it is a trade-off that performs better than a single example and is not too long to compute.</p>
<p>The learning rate needs to be adjusted depending on the batch size. The bigger the batch is, the bigger the learning rate can be.</p>
<p>All deep learning programms and frameworks consider the first dimension in your data as the batch size. All other dimensions are the data dimensionality. For an image, it is <code class="highlighter-rouge">BxHxWxC</code>, written as a shape <code class="highlighter-rouge">(B, H, W, C)</code>. After a few layers, the shape of the data will change to <code class="highlighter-rouge">(B, h, w, c)</code> : the batch size remains constant, the number of channels usually increases <script type="math/tex">c \geq C</script> with network depth while the feature map decreases <script type="math/tex">h \leq H, w \leq W</script> for top layers’ outputs’ shapes.</p>
<p>This format is very common and is called <em>channel last</em>. Some deep learning frameworks work with <em>channel first</em>, such as CNTK, or enables to change the format as in Keras, to <code class="highlighter-rouge">(B, C, W, H)</code>.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL14.png" /></p>
<p>To distribute the training on multiple GPU or instances, the easiest way is to split along the batch dimension, which we call <em>data parallellism</em>, and dispatch the different splits to their respective instance/GPU. The parameter update step requires to synchronize more or less the gradient computations. NVIDIA provides fast multi-gpu collectives in its library NCCL, and fast hardware connections between GPUs with NVLINK2.0.</p>
<p>At the basis of the training is the sample (the example, the datapoint). The batch or minibatch is the training of multiple samples in one iteration or step. An epoch is usually seen as the number of iteration to see the whole training dataset, although in some training programs, for very huge dataset, the epoch is defined as a number of iterations after which the model is evaluated to monitor the training and follow metrics. A shuffle operation of the dataset is required at each epoch.
<img src="//christopher5106.github.io/img/epochs.png" /></p>
<h1 id="training-curves-and-metrics">Training curves and metrics</h1>
<p>As we have seen on <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0</a>, we use a <em>cost function</em> to fit the model to the goal.</p>
<p>So, during training of a model, we usually plot the <strong>training loss</strong>, and if there is no bug, it is not surprising to see it decreasing as the number of training steps or iterations grows.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL16.png" /></p>
<p>Nevertheless, we usually keep 2 to 10 percent of the training set aside from the training process, which we call the <strong>validation dataset</strong> and compute the loss on this set as well. Depending if the model has enough capacity or not, the <strong>validation loss</strong> might increase after a certain step: we call this situation <strong>overfitting</strong>, where the model has too much learned the training dataset, but does not generalize on unseen examples. To avoid this situation to happen, we monitor the validation metrics and stop the training process when the validation metrics increase, after which the model will perform less.</p>
<p>On top of the loss, it is possible to monitor other metrics, such as for example the accuracy. Metrics might not be differentiable, and minimizing the loss might not minimize the metrics. In image classification, a very classical one is the <strong>accuracy</strong>, that is the ratio of correctly classified examples in the dataset. The opposite is the <strong>error rate</strong>.</p>
<p>We also usually compute the <strong>precision/recall curve</strong>: precision defines the number of true positive in the examples predicted as positive by the model (true positives + false positives) while the recall is the number of true positives of the total number of positives (true positives + false negatives). While for some applications, such as document retrieval, we prefer to have higher recall, for some other applications, such as automatic document classification, we prefer to have a high precision for automatically classified documents, and leave ambiguities to human operators.</p>
<p>In order to summarize the quality of the model into one value, one can compute :</p>
<ul>
<li>
<p>either the <strong>Area Under the Curve (AUC)</strong> instead of the full precision/recall curve,</p>
</li>
<li>
<p>or the <strong>F1-score</strong>, which is <script type="math/tex">2 \times \frac{\text{precision} \times \text{recall}}{ \text{precision} + \text{recall} }</script></p>
</li>
<li>
<p>or more generally the <script type="math/tex">F_\beta = (1+\beta^2) \times \frac{\text{precision} \times \text{recall}}{ \beta^2 \times \text{precision} + \text{recall} }</script></p>
</li>
</ul>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL11.png" /></p>
<h1 id="a-library-for-deep-learning">A library for deep learning</h1>
<p>A deep learning library offers the following characteristics :</p>
<ol>
<li>
<p>It works very well with Numpy arrays. Arrays are called <strong>Tensors</strong>. Moreover, operations on Tensors follow lot’s of Numpy conventions.</p>
</li>
<li>
<p>It provides abstract classes, such as Tensors, for parallel computing on GPU rather CPU, and distributed computing on multiple GPUs or instances, since Deep learning demands huge computations.</p>
</li>
<li>
<p>Operators have a ‘backward’ implementation, computing the gradients for you, with respect to the inputs or parameters.</p>
</li>
</ol>
<p>Let’s load Pytorch module into a Python shell, as well as Numpy library, check the Pytorch version is correct and the Cuda library is correctly installed (if you have a GPU only):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">__version__</span><span class="p">)</span> <span class="c"># 0.4.1</span>
<span class="k">print</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">is_available</span><span class="p">())</span> <span class="c"># True</span>
</code></pre></div></div>
<h4 id="1-numpy-compatibility">1. Numpy compatibility</h4>
<p>You can easily check the following commands in Pytorch and Numpy:</p>
<table>
<thead>
<tr>
<th>####### Command #######</th>
<th>####### Numpy #######</th>
<th>####### Pytorch #######</th>
</tr>
</thead>
<tbody>
<tr>
<td>Array conversion</td>
<td>np.array([[1,2]])</td>
<td>torch.Tensor([[1,2]])</td>
</tr>
<tr>
<td>5x3 matrix, uninitialized</td>
<td>x = np.empty((5,3))</td>
<td>x = torch.Tensor(5,3)</td>
</tr>
<tr>
<td>initialized with ones</td>
<td>x = np.ones((5,3))</td>
<td>x = torch.ones(5,3)</td>
</tr>
<tr>
<td>initialized with zeros</td>
<td>x = np.zeros((5,3))</td>
<td>x = torch.zeros(5,3)</td>
</tr>
<tr>
<td>uniformly randomly initialized matrix</td>
<td>x = np.random.rand(5,3)</td>
<td>x = torch.rand(5, 3)</td>
</tr>
<tr>
<td>normal randomly initialized matrix</td>
<td>x = np.random.randn(5,3)</td>
<td>x = torch.randn(5, 3)</td>
</tr>
<tr>
<td>Shape/size</td>
<td>x.shape</td>
<td>x.size</td>
</tr>
<tr>
<td>Elementwise Addition</td>
<td>+/np.add</td>
<td>+/torch.add</td>
</tr>
<tr>
<td>In-place multiplication</td>
<td>*</td>
<td>*</td>
</tr>
<tr>
<td>In-place addition</td>
<td>x+=</td>
<td>x.add_()</td>
</tr>
<tr>
<td>First column</td>
<td>x[:, 1]</td>
<td>x[:, 1]</td>
</tr>
<tr>
<td>Matrix multiplication</td>
<td>.matmul()</td>
<td>.mm()</td>
</tr>
<tr>
<td>Matrix-Vector multiplication</td>
<td>-</td>
<td>.mv()</td>
</tr>
<tr>
<td>Reshape</td>
<td>.reshape(shape)</td>
<td>.view(size)</td>
</tr>
<tr>
<td>Transpose</td>
<td>np.transpose(,(1,0))</td>
<td>torch.transpose(,0,1)</td>
</tr>
<tr>
<td>Concatenate</td>
<td>np.concatenate([])</td>
<td>torch.cat([])</td>
</tr>
<tr>
<td>Stack</td>
<td>np.stack([], 1)</td>
<td>torch.stack([], 1)</td>
</tr>
<tr>
<td>Add a dimension</td>
<td>np.expand_dims(, axis)</td>
<td>.unsqueeze(axis)</td>
</tr>
<tr>
<td>Squeeze a dimension</td>
<td>np.squeeze(, axis)</td>
<td>.squeeze(axis)</td>
</tr>
<tr>
<td>Range of values</td>
<td>np.arange()</td>
<td>torch.arange()</td>
</tr>
<tr>
<td>Maximum of the array</td>
<td>np.amax(, axis)</td>
<td>torch.max(, axis)</td>
</tr>
<tr>
<td>Elementwise max</td>
<td>np.maximum(a,b)</td>
<td>torch.max(a,b)</td>
</tr>
</tbody>
</table>
<p>.</p>
<p>You can link Numpy array and Torch Tensor, either with</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">numpy_array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="n">torch_array</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">numpy_array</span><span class="p">)</span>
</code></pre></div></div>
<p>or</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">torch_array</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="n">numpy_array</span> <span class="o">=</span> <span class="n">torch_array</span><span class="o">.</span><span class="n">numpy</span><span class="p">()</span>
</code></pre></div></div>
<p>which will keep the pointers to the original values:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">numpy_array</span><span class="o">+=</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="n">torch_array</span><span class="p">)</span> <span class="c"># tensor([2., 2., 2., 2., 2.])</span>
<span class="n">torch_array</span><span class="o">.</span><span class="n">add_</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">numpy_array</span><span class="p">)</span> <span class="c"># [3. 3. 3. 3. 3.]</span>
</code></pre></div></div>
<p><strong>Exercise</strong>: find the equivalent operations under Tensorflow, Keras, CNTK, MXNET</p>
<p><strong>Solution</strong>: <a href="//christopher5106.github.io/img/deeplearningcourse/tensorflow_commands.txt">tensorflow</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/keras_commands.txt">keras</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/mxnet_commands.txt">mxnet</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/cntk_commands.txt">cntk</a></p>
<h4 id="2-gpu-computing">2. GPU computing</h4>
<p>It is possible to transfer tensors between devices, ie RAM memory and each GPUs’ memory:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">,)</span> <span class="c"># tensor([1., 1., 1., 1., 1.])</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">,)</span>
<span class="n">a_</span> <span class="o">=</span> <span class="n">a</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span> <span class="c"># tensor([1., 1., 1., 1., 1.], device='cuda:0')</span>
<span class="n">b_</span> <span class="o">=</span> <span class="n">b</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">a_</span> <span class="o">+</span> <span class="n">b_</span> <span class="c"># tensor([2., 2., 2., 2., 2.], device='cuda:0')</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="c"># tensor([3., 3., 3., 3., 3.], device='cuda:0')</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">cuda</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c"># tensor([3., 3., 3., 3., 3.], device='cuda:1')</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">z</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span> <span class="c"># tensor([3., 3., 3., 3., 3.])</span>
</code></pre></div></div>
<p>but keep in mind that synchronization is lost (contrary to Numpy Arrays and Torch Tensors, it cannot be pointers since the values are not anymore on the same device):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="o">.</span><span class="n">add_</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">a_</span><span class="p">)</span> <span class="c"># tensor([1., 1., 1., 1., 1.], device='cuda:0')</span>
</code></pre></div></div>
<p>Tensors behave as classical programming non-reference variables and their content is copied from device to the other. This way, you decide when to transfer the data.</p>
<p>Contrary to other frameworks, Pytorch does not require to build a graph of operators and execute the graph on a device. Pytorch programming is as normal Python programming.</p>
<h4 id="3-automatic-differentiation">3. Automatic differentiation</h4>
<p>To compute the gradient automatically, you need to wrap the tensors in Variable objects:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">torch.autograd</span> <span class="kn">import</span> <span class="n">Variable</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c">#tensor([2.])</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># None</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad_fn</span><span class="p">)</span>
<span class="c"># None</span>
</code></pre></div></div>
<p>The Variable contains the original data in the <code class="highlighter-rouge">data</code> attribute. The API for Tensors is also available for Variables.</p>
<p>When you add an operation such as for example the square operator, the newly created Variable is populated with the result in the <code class="highlighter-rouge">data</code> attribute as well as an history <code class="highlighter-rouge">grad_fn</code> function :</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c"># tensor([4.])</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># None</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">grad_fn</span><span class="p">)</span>
<span class="c"># <PowBackward0 object at 0x7ffaf1ae0160></span>
</code></pre></div></div>
<p>The <code class="highlighter-rouge">grad_fn</code> Function contains the history or the link to x, which enables to compute the derivative thanks to the Variable’s <code class="highlighter-rouge">backward()</code> method:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c"># tensor([2.])</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([4.])</span>
</code></pre></div></div>
<p>The gradient of the cost y with respect to the input x is placed in <code class="highlighter-rouge">x.grad</code> and since <script type="math/tex">\frac{\partial x^2}{\partial x} = 2 x</script>, its value is 4 at x=2.</p>
<p><strong>Exercise</strong>: with Torch, compute the gradient of <script type="math/tex">z = 3 x + 2 y</script> with respect to x=5 and y=2.</p>
<p>Calling <code class="highlighter-rouge">y.backward()</code> a second time will lead to a RunTime Error. In order to accumulate the gradients into <code class="highlighter-rouge">x.grad</code>, you need to set <code class="highlighter-rouge">retain_graph=True</code> during the first backward call.</p>
<p>Let’s confirm this in a case where the input is multi-dimensional and reduced into a scalar with a sum operator:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">(</span><span class="n">retain_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([1., 1.])</span>
<span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([2., 2.])</span>
<span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([3., 3.])</span>
</code></pre></div></div>
<p>Since <script type="math/tex">\frac{\partial}{\partial x_1} (x_1 + x_2) = 1</script> and <script type="math/tex">\frac{\partial}{\partial x_2} (x_1 + x_2) = 1</script>, the <code class="highlighter-rouge">x.grad</code> tensor is populated with ones.</p>
<p>Applying the <code class="highlighter-rouge">backward()</code> method multiple times accumulates the gradients.</p>
<p>It is also possible to apply the <code class="highlighter-rouge">backward()</code> method on something else than a cost (scalar), for example on a layer or operation with a multi-dimensional output, as in the middle of a neural network, but in this case, you need to provide as argument to the <code class="highlighter-rouge">backward()</code> method <script type="math/tex">\Big( \nabla_{I_{t+1}} \text{cost} \Big)</script>, the gradient of the cost with respect to the output of the current operator/layer (which is written here as the input of the operator/layer above), which will be multiplied by <script type="math/tex">\Big( \nabla_{\theta_t} L_t \Big)</script>, the gradient of the current operator/layer’s output with respect to its parameters, in order to produce the gradient of the cost with respect to the current layer’s parameters:</p>
<script type="math/tex; mode=display">\nabla_{\theta_t} \text{cost} = \nabla_{\theta_t} \Big[ ( \text{cost} \circ S \circ ... \circ L_{t+1} ) \circ L_t \Big] = \Big( \nabla_{I_{t+1}} \text{cost} \Big) \times \nabla_{\theta_t} L_t</script>
<p>as given by the chaining rule seen in <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c"># tensor([1., 1.])</span>
<span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">([</span><span class="o">-</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">]))</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([-2., 2.])</span>
</code></pre></div></div>
<p>The gradient of the final cost with respect to the output of the current operator/layer indicates how to combine the derivatives of different output values in the current layer in the production of a derivative with respect to each parameter.</p>
<p>As Pytorch does not require to introduce complex graph operators as in other technologies (switches, comparisons, dependency controls, scans/loops… ), it enables you to program as normally, and gradients are well propagated through your Python code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span><span class="mi">1</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">while</span> <span class="n">y</span> <span class="o"><</span> <span class="mi">10</span><span class="p">:</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="o">**</span><span class="mi">2</span>
<span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="c"># tensor([16.])</span>
<span class="n">y</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">grad</span><span class="p">)</span>
<span class="c"># tensor([32.])</span>
</code></pre></div></div>
<p>which is fantastic. In this case, <script type="math/tex">y\vert_{x=2} = ( x^2 )^2 = x^4</script> and <script type="math/tex">\frac{\partial y}{\partial x} \big\vert_{x=2} = 4 x^3 = 32</script>.</p>
<p>Note that gradients are computed by retropropagation until a Variable has no <code class="highlighter-rouge">graph_fn</code> (an input Variable set by the user) or a Variable with <code class="highlighter-rouge">requires_grad</code> set to <code class="highlighter-rouge">False</code>, which helps save computations.</p>
<p><strong>Exercise</strong>: compute the derivative with Keras, Tensorflow, CNTK, MXNet</p>
<h1 id="training-loop">Training loop</h1>
<p>Let’s take back our <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0</a>’s perceptron and implement its training directly with Pytorch tensors and operators, without other packages.
Let’s consider the input is 20 dimensional, and the number of outputs for each dense layer is 32.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL15.png" /></p>
<p>Pytorch only requires to implement the forward pass of our perceptron. Each Dense layer is composed of two learnable parameters or weights:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta1</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">20</span><span class="p">)</span> <span class="o">*</span><span class="mf">0.1</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">bias1</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span><span class="o">*</span><span class="mf">0.1</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">theta2</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">)</span><span class="o">*</span><span class="mf">0.1</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">bias2</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span><span class="o">*</span><span class="mf">0.1</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="c"># affine operation of the first Dense layer</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">theta1</span><span class="o">.</span><span class="n">mv</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">bias1</span>
<span class="c"># ReLu activation</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">([</span><span class="mi">0</span><span class="p">])))</span>
<span class="c"># affine operation of the second Dense layer</span>
<span class="k">return</span> <span class="n">theta2</span><span class="o">.</span><span class="n">mv</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="o">+</span> <span class="n">bias2</span>
</code></pre></div></div>
<p>As first loss function, let’s use the square of the sum of the outputs. We can take whatever we want, as soon as it returns a scalar value to minimize:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cost</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">z</span><span class="p">))</span> <span class="o">**</span> <span class="mi">2</span>
</code></pre></div></div>
<p>A training loop iterates over a dataset of training examples and each iteration consists in</p>
<ul>
<li>
<p>a forward pass : propagate the input values through layers from bottom to top, until the cost/loss</p>
</li>
<li>
<p>a backward pass : propagate the gradients from top to bottom and into each of the parameters</p>
</li>
<li>
<p>apply the parameter update rule <script type="math/tex">\theta \leftarrow \theta - \lambda \nabla_{\theta_L} \text{cost}</script> for each layer L</p>
</li>
</ul>
<p>Let’s train this network on random inputs, one sample at a time:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
<span class="n">lr</span> <span class="o">=</span> <span class="mf">0.001</span> <span class="o">*</span> <span class="p">(</span><span class="o">.</span><span class="mi">1</span> <span class="o">**</span> <span class="p">(</span> <span class="nb">max</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">500</span> <span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">//</span> <span class="mi">100</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">cost</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"cost {} - learning rate {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">(),</span> <span class="n">lr</span><span class="p">))</span>
<span class="c"># compute the gradients</span>
<span class="n">c</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c"># apply the gradients</span>
<span class="n">theta1</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">theta1</span><span class="o">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">theta1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span>
<span class="n">bias1</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">bias1</span><span class="o">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">bias1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span>
<span class="n">theta2</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">theta2</span><span class="o">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">theta2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span>
<span class="n">bias2</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">bias2</span><span class="o">.</span><span class="n">data</span> <span class="o">-</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">bias2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span>
<span class="c"># clear the grad</span>
<span class="n">theta1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">bias1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">theta2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">bias2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
</code></pre></div></div>
<p><strong>Exercise</strong>: check the norm of the parameters at every iteration.</p>
<p>In a classification task,</p>
<ul>
<li>
<p>training is usually performed on batch of samples, instead of 1 sample, at each iteration, to get a faster training, but also to reduce the cost of the transfer of data when the data is moved to GPU</p>
</li>
<li>
<p>the model is required to output a number of values equal to the number of classes, normalized by softmax activation</p>
</li>
<li>
<p>the loss is the cross-entropy</p>
</li>
</ul>
<p>as we have seen in <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html">Course 0</a>.</p>
<p>Let us consider a toy data, in which the label of a sample depends on its position in 2D, with 3 labels corresponding to 3 zones:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL18.png" /></p>
<p>The dataset creation or preprocessing is usually performed with Numpy:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">dataset_size</span> <span class="o">=</span> <span class="mi">200000</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">)</span>
<span class="n">labels</span><span class="p">[</span><span class="n">x</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="n">x</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]]</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">labels</span><span class="p">[</span><span class="n">x</span><span class="p">[:,</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">x</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">></span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">x</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">labels</span> <span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL42.png" /></p>
<p>Let’s convert the Numpy arrays to Torch Tensors:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">FloatTensor</span><span class="p">)</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">from_numpy</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span><span class="o">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">LongTensor</span><span class="p">)</span>
</code></pre></div></div>
<p>The input is defined by a position, ie a vector of dimension 2 for each sample, leading to a Tensor of size <code class="highlighter-rouge">(B, 2)</code>, where B is the batch size.</p>
<p>Let’s choose as hidden dimension (number of outputs of first layer/ inputs of second layer) 12:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta1</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">12</span><span class="p">)</span> <span class="o">*</span><span class="mf">0.01</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">bias1</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">12</span><span class="p">)</span><span class="o">*</span><span class="mf">0.01</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">theta2</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span><span class="o">*</span><span class="mf">0.01</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">bias2</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span><span class="o">*</span><span class="mf">0.01</span><span class="p">,</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">theta1</span><span class="p">)</span> <span class="o">+</span> <span class="n">bias1</span> <span class="c"># (B, 2) x (2, 12) + (B, 12) => (B, 12)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">Variable</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">([</span><span class="mi">0</span><span class="p">])))</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">theta2</span><span class="p">)</span> <span class="o">+</span> <span class="n">bias2</span> <span class="c"># (B, 12) x (12, 3) + (B, 3) => (B, 3)</span>
<span class="k">return</span> <span class="n">y</span>
<span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="n">e</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">e</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">e</span> <span class="o">/</span> <span class="n">s</span>
<span class="k">def</span> <span class="nf">crossentropy</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">l</span><span class="p">):</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">unsqueeze</span><span class="p">(</span><span class="n">l</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">torch</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">v</span><span class="p">)</span>
</code></pre></div></div>
<p>For more efficiency, let’s train 20 samples at each step, hence a batch size of 20:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">20</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span> <span class="p">):</span>
<span class="n">lr</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="o">.</span><span class="mi">1</span> <span class="o">**</span> <span class="p">(</span> <span class="nb">max</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">100</span> <span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">//</span> <span class="mi">1000</span><span class="p">))</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="c"># size (batchsize, 2)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">forward</span><span class="p">(</span><span class="n">Variable</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">crossentropy</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">Variable</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"iter {} - cost {} - learning rate {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">(),</span> <span class="n">lr</span><span class="p">))</span>
<span class="c"># compute the gradients</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c"># apply the gradients</span>
<span class="n">theta1</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">sub_</span><span class="p">(</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">theta1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span> <span class="p">)</span>
<span class="n">bias1</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">sub_</span><span class="p">(</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">bias1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span><span class="p">)</span>
<span class="n">theta2</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">sub_</span><span class="p">(</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">theta2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span> <span class="p">)</span>
<span class="n">bias2</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">sub_</span><span class="p">(</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">bias2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span> <span class="p">)</span>
<span class="c"># clear the grad</span>
<span class="n">theta1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">bias1</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">theta2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="n">bias2</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
<span class="c"># iter 4999 - cost 0.07717917114496231 - learning rate 5.000000000000001e-05</span>
</code></pre></div></div>
<p>The network converges.</p>
<p>When <code class="highlighter-rouge">loss.backward()</code> is called, the derivatives are propagated through all Variables in the graph, and their .grad attribute accumulated with the gradient (except those with <code class="highlighter-rouge">requires_grad</code> set to False):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">grad_fn</span><span class="p">)</span> <span class="c"># NegBackward</span>
<span class="k">print</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">grad_fn</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="c"># MeanBackward1</span>
<span class="k">print</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">grad_fn</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="c"># LogBackward</span>
<span class="k">print</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">grad_fn</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">next_functions</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="c"># GatherBackward</span>
</code></pre></div></div>
<p>To check everything is fine, one might compute the accuracy, a classical metric for classification problems:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">accuracy</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">nb</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="n">nb</span><span class="p">)):</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">forward</span><span class="p">(</span><span class="n">Variable</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">if</span> <span class="n">l</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">numpy</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">labels</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">accuracy</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="s">"accuracy {}</span><span class="si">%</span><span class="s">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">accuracy</span> <span class="o">/</span> <span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="n">nb</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span>
<span class="c"># accuracy 99.46%</span>
</code></pre></div></div>
<p><strong>Exercise</strong>: compute precision, recall, and AUC.</p>
<p>Note that convergence is strongly influenced</p>
<ul>
<li>
<p>by the art of choosing the learning rate
<img src="//christopher5106.github.io/img/deeplearningcourse/DL20.png" /></p>
</li>
<li>
<p>by the art of choosing the right layer initialization: a small variance, with positive and negative values to dissociate the neural outputs (neurons that fire together wire together) helps. In fact, we’ll see in the next section Pytorch packages that provide a correct implementation of the variance choice given the number of input and output connections:
<img src="//christopher5106.github.io/img/deeplearningcourse/DL32.png" /></p>
</li>
</ul>
<p>To improve the results, it is possible to train multiple times the network from scratch, and average the predictions coming from the ensemble of trained networks.</p>
<p>Note also that, if there is a backward function for every operation, there is no forward function: evaluation is performed when the operator is applied to the variable as in a classical program. It is up to you to create your own forward function as in a classical program. The backward function works as a kind of history of the operations in order to retropropagate the gradient. So, it is very different from the concept of “graph of operators”.</p>
<p><strong>Exercise</strong>: program a training loop with Keras, Tensorflow, CNTK, MXNet</p>
<p><strong>Solution</strong>: <a href="//christopher5106.github.io/img/deeplearningcourse/cntk_training.txt">cntk training</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/mxnet_training.txt">mxnet training</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/keras_training.txt">keras training</a>, <a href="//christopher5106.github.io/img/deeplearningcourse/tensorflow_training.txt">tensorflow training</a></p>
<p>Pytorch and MXNet work about the same. In MXNet, use <code class="highlighter-rouge">attach_grad()</code> on the <code class="highlighter-rouge">NDarray</code> with respect to which you’d like to compute the gradient of the cost, and start recording the history of operations with <code class="highlighter-rouge">with mx.autograd.record()</code>, then you can use directly <code class="highlighter-rouge">backward()</code>. No wrapping in a Variable object as in Pytorch.</p>
<p>In CNTK, Tensorflow and Keras, you build a graph, so you do not get instantly the result of your operation, for example an addition does not give a result but an object that will evaluated in a session on a device by feeding data into the inputs.</p>
<ul>
<li>in CNTK, the <code class="highlighter-rouge">Parameter</code> and <code class="highlighter-rouge">input_variable</code> are subclasses of the Variable class so you do not need to wrap them into a Variable object as in Pytorch, but since you build a graph, so you have to call <code class="highlighter-rouge">eval</code> or <code class="highlighter-rouge">grad</code> methods on any element of the graph with input values to evaluate them.</li>
<li>Tensorflow is the most complex, but leaves lot’s of freedom. You need to instantiate a session on the device yourself and call the initialization of your variables in the session. The gradient as well as the assignation of values are available operators so that everything can be designed in the graph, but the benefits are extremely small.</li>
<li>Keras is an abstraction over Tensorflow and CNTK, so you retrieve the points discussed above in the implementation.</li>
</ul>
<p>Tensorflow has an <code class="highlighter-rouge">eager</code> mode option, which enables to get the results of the operator instantly as in Pytorch and MXNet.</p>
<h1 id="modules">Modules</h1>
<p>A module is an object that encapsulates learnable parameters and is specifically suited to design deep learning neural networks.</p>
<p>A layer is the smallest module, it has weights and a forward function.:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL31.png" /></p>
<p>The composition of multiple modules builds a new module:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL30.png" /></p>
<p>that can be reused at multiple places in the network architecture.</p>
<p>The organization into modules helps interoperability and reuse of modules into a deep neural network definition.</p>
<p>Then, calling the forward or backward propogations, transfering the module to GPU, saving or loading weights, is applied to all submodules without extra code.</p>
<p>Let’s rewrite the previous model as a module, an interface provided by the <code class="highlighter-rouge">torch.nn</code> module package, and compose it with prebuilt submodules from the <code class="highlighter-rouge">torch.nn.function</code> package:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="k">class</span> <span class="nc">SimpleNetTest</span><span class="p">(</span><span class="n">nn</span><span class="o">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">SimpleNetTest</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lin1</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">12</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lin2</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">input</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">lin1</span><span class="p">(</span><span class="nb">input</span><span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">lin2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
<span class="n">net</span> <span class="o">=</span> <span class="n">SimpleNetTest</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">net</span><span class="p">)</span>
<span class="c"># SimpleNetTest(</span>
<span class="c"># (lin1): Linear(in_features=2, out_features=12, bias=True)</span>
<span class="c"># (lin2): Linear(in_features=12, out_features=3, bias=True)</span>
<span class="c"># )</span>
</code></pre></div></div>
<p>The learnable parameters are returned by net.parameters():</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">params</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
<span class="c"># [Parameter containing:</span>
<span class="c"># tensor([[ 0.5115, -0.1418],</span>
<span class="c"># [-0.5533, -0.1273],</span>
<span class="c"># [-0.2584, -0.0393],</span>
<span class="c"># [-0.6614, -0.3380],</span>
<span class="c"># [-0.1831, 0.4581],</span>
<span class="c"># [ 0.3085, -0.6811],</span>
<span class="c"># [-0.4236, -0.6968],</span>
<span class="c"># [ 0.2943, -0.2573],</span>
<span class="c"># [ 0.4532, -0.3313],</span>
<span class="c"># [ 0.0415, -0.6035],</span>
<span class="c"># [ 0.0736, -0.0780],</span>
<span class="c"># [ 0.3948, 0.4727]], requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([ 0.3995, 0.2957, 0.4611, -0.6316, -0.4317, 0.3888, -0.2252, 0.2357,</span>
<span class="c"># 0.0351, -0.0223, -0.2179, -0.0943], requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([[-0.1178, 0.0759, 0.2238, -0.1543, 0.2471, 0.2617, 0.0897, -0.1238,</span>
<span class="c"># -0.2371, 0.2220, -0.2427, -0.0141],</span>
<span class="c"># [ 0.2623, 0.2131, 0.0291, -0.1194, -0.1685, -0.1901, -0.0905, 0.1825,</span>
<span class="c"># -0.0384, 0.2694, 0.0682, -0.0157],</span>
<span class="c"># [ 0.2674, 0.0229, -0.0429, 0.1274, 0.1928, 0.1575, 0.2514, -0.1529,</span>
<span class="c"># -0.0460, 0.0187, -0.1481, -0.1473]], requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([-0.2622, 0.0747, -0.2832], requires_grad=True)]</span>
</code></pre></div></div>
<p>In place of our previous <code class="highlighter-rouge">forward(batch)</code> function, we simply apply the batch to the module with <code class="highlighter-rouge">net(batch)</code> and loop over the parameters to update them:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch_size</span> <span class="o">=</span> <span class="mi">20</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="mi">100000</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span> <span class="p">):</span>
<span class="n">lr</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="o">.</span><span class="mi">1</span> <span class="o">**</span> <span class="p">(</span> <span class="nb">max</span><span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="mi">100</span> <span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">//</span> <span class="mi">1000</span><span class="p">))</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span> <span class="c"># size (batchsize, 2)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">Variable</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">crossentropy</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="n">Variable</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"iter {} - cost {} - learning rate {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">(),</span> <span class="n">lr</span><span class="p">))</span>
<span class="c"># compute the gradients</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c"># apply the gradients</span>
<span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">():</span>
<span class="n">param</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">sub_</span><span class="p">(</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">param</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">data</span> <span class="p">)</span>
<span class="n">param</span><span class="o">.</span><span class="n">grad</span><span class="o">.</span><span class="n">zero_</span><span class="p">()</span>
</code></pre></div></div>
<p>For the same training on GPU, let’s move our datasets as well the module to GPU:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">Y</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
<span class="n">net</span><span class="o">.</span><span class="n">cuda</span><span class="p">()</span>
<span class="n">params</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
<span class="c"># [Parameter containing:</span>
<span class="c"># tensor([[ 2.9374, -2.9921],</span>
<span class="c"># [-1.8221, -0.3139],</span>
<span class="c"># [-2.7522, -0.8614],</span>
<span class="c"># [-0.6614, -0.3380],</span>
<span class="c"># [-0.1840, 0.4384],</span>
<span class="c"># [-1.9955, -3.9197],</span>
<span class="c"># [-0.4236, -0.6968],</span>
<span class="c"># [ 2.0039, 0.7968],</span>
<span class="c"># [ 2.5727, -3.4312],</span>
<span class="c"># [ 0.0331, -0.6036],</span>
<span class="c"># [ 0.0736, -0.0780],</span>
<span class="c"># [ 3.9596, 5.2725]], device='cuda:0', requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([ 0.2908, 1.7123, 2.7156, -0.6316, -0.4515, 3.9756, -0.2252, -0.6254,</span>
<span class="c"># 0.6207, -0.0313, -0.2179, -3.2287],</span>
<span class="c"># device='cuda:0', requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([[-2.8967, 1.5755, 2.2599, -0.1543, 0.2469, 0.7531, 0.0897, -1.1861,</span>
<span class="c"># -3.3350, 0.2220, -0.2427, -2.1395],</span>
<span class="c"># [ 2.1709, -0.9695, -1.7979, -0.1194, -0.1682, -3.0841, -0.0905, 1.1191,</span>
<span class="c"># 1.7238, 0.2693, 0.0682, 4.0227],</span>
<span class="c"># [ 1.1377, -0.2940, -0.2520, 0.1274, 0.1927, 2.5601, 0.2514, -0.0273,</span>
<span class="c"># 1.2897, 0.0188, -0.1481, -2.0602]],</span>
<span class="c"># device='cuda:0', requires_grad=True), Parameter containing:</span>
<span class="c"># tensor([ 0.7422, -0.9477, -0.2653], device='cuda:0', requires_grad=True)]</span>
</code></pre></div></div>
<p>All the parameters appear on the first GPU (cuda:0). Note that we transfered the full dataset to the GPU, while in most applications, it is not possible since the memory of the GPU is limited, we only transfer the batch at each iteration.</p>
<p>When the GPU has been used for training, it is a good practice to use it for inference on the test data as well, so we need to rewrite it to train batches of samples rather than samples indiviually:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">accuracy</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">nb</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="n">nb</span><span class="p">)):</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">Variable</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">if</span> <span class="n">l</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">cpu</span><span class="p">()</span><span class="o">.</span><span class="n">numpy</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span> <span class="o">==</span> <span class="n">Y</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">accuracy</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">print</span><span class="p">(</span><span class="s">"accuracy {}</span><span class="si">%</span><span class="s">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">accuracy</span> <span class="o">/</span> <span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="n">nb</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="c"># accuracy 99.2</span>
</code></pre></div></div>
<p><strong>Exercise</strong>: program the training loop with packages in Keras, Tensorflow, CNTK, MXNet</p>
<p><strong>Solution</strong>: <a href="https://github.com/christopher5106/exercices/blob/master/cntk-ex_with_packages.py">cntk training</a>, <a href="https://github.com/christopher5106/exercices/blob/master/cntk-ex_with_packages-step2.py">cntk further packaging</a>, <a href="https://github.com/christopher5106/exercices/blob/master/keras-ex_with_packages.py">keras training</a>, <a href="https://github.com/christopher5106/exercices/blob/master/keras-ex_with_packages-step2.py">keras further packaging</a> …</p>
<h1 id="packages">Packages</h1>
<p>Packages help you reuse common functions for deep learning. We already introduced the <strong>torch.nn</strong> package containing the module interface as well as prebuilt modules.</p>
<p>Let us rewrite the training loop using the <code class="highlighter-rouge">torch.optim</code> package (zeroing gradients + applying the gradients with an update rule), plot the training curves (loss,…) and try different update rules/optimizers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">criterion</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">CrossEntropyLoss</span><span class="p">()</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">momentum</span><span class="o">=</span><span class="mf">0.9</span><span class="p">)</span>
</code></pre></div></div>
<p>The <code class="highlighter-rouge">optimizer</code> provides a method <code class="highlighter-rouge">zero_grad()</code> to clear the previous gradient values and a <code class="highlighter-rouge">step()</code> method to apply the update rule to the parameters:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loss_curve</span> <span class="o">=</span> <span class="nb">list</span><span class="p">()</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">500</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">dataset_size</span><span class="p">,</span> <span class="mi">1000000</span><span class="p">)</span> <span class="o">//</span> <span class="n">batch_size</span> <span class="p">):</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)]</span>
<span class="n">batch</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">batch</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">batchLabel</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="n">batch_size</span><span class="o">*</span><span class="n">i</span><span class="p">:</span><span class="n">batch_size</span><span class="o">*</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)])</span>
<span class="c"># zero the parameter gradients</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c"># forward</span>
<span class="n">outputs</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">batch</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">criterion</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">batchLabel</span><span class="p">)</span>
<span class="c"># backward</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c"># update network parameters</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
<span class="c"># print("iter {} - cost {}".format(i, loss.data.item()))</span>
<span class="n">loss_curve</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">"final cost {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">loss</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">item</span><span class="p">(),</span> <span class="mi">2</span><span class="p">)))</span>
<span class="c"># final cost 0.04</span>
</code></pre></div></div>
<p>Let’s plot the training loss:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">loss_curve</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">),</span> <span class="n">loss_curve</span><span class="p">,</span> <span class="s">'ro'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL43.png" /></p>
<p>To compute the accuracy, we can also forward the full dataset and use efficient matrix operations on the final tensors, removing the for loop:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">accuracy</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">net</span><span class="p">(</span><span class="n">Variable</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">z</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">ll</span> <span class="o">=</span> <span class="n">Variable</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">eq</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">ll</span><span class="p">)</span><span class="o">.</span><span class="nb">type</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">cuda</span><span class="o">.</span><span class="n">LongTensor</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"accuracy {}</span><span class="si">%</span><span class="s">"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">accuracy</span> <span class="o">/</span> <span class="n">dataset_size</span> <span class="o">*</span> <span class="mi">100</span><span class="p">,</span><span class="mi">2</span><span class="p">)))</span>
<span class="c"># accuracy 96.73%</span>
</code></pre></div></div>
<p><strong>Exercise</strong>: try various optimizers and learning rates</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">())</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">optim</span><span class="o">.</span><span class="n">Adadelta</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">parameters</span><span class="p">())</span>
</code></pre></div></div>
<p>and confirm that Adadelta achieves best accuracy: 99.62%.</p>
<p><strong>Exercise</strong>: replace your functions with package functions in MXNet, Keras, Tensorflow, CNTK</p>
<p><strong>Well done!</strong></p>
<p>Now let’s go to next course: <a href="//christopher5106.github.io/deep/learning/2018/10/20/course-two-build-deep-learning-networks.html">Course 2: building deep learning networks!</a></p>
Sat, 20 Oct 2018 06:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.html
//christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.htmldeeplearningCourse 0: Why targets 0 and 1 in machine learning ?<p>Here is my course of deep learning in 5 days only!</p>
<p>First, we’ll begin by the basic concepts in this article, because main deep learning concepts come from datascience. This is a great article for an experienced datascientist also.</p>
<p>I hope you’ll get some feelings about deep learning you cannot get from reading else where.</p>
<p>First, let’s recap the basics of machine learning.</p>
<h1 id="loss-functions">Loss functions</h1>
<p>When we fit a model, we use <em>loss functions</em>, or <em>cost functions</em>, or <em>objective functions</em>. The main purpose of machine learning is to be able to predict given data. Let’s say predict the <script type="math/tex">y</script> given some observations, let’s say the <script type="math/tex">x</script>, through a function:</p>
<script type="math/tex; mode=display">f : x \rightarrow y</script>
<p>Usually, f is called a model and is parametrized, let’s say by a list of parameters <script type="math/tex">\theta</script></p>
<script type="math/tex; mode=display">f = f_\theta</script>
<p>and the goal of machine learning is to find the best parameters.</p>
<p>In supervised learning we know the real value we want to predict <script type="math/tex">\tilde{y}</script>, for example the class of the object we want to predict, what we call the <strong>ground truth</strong> or the <strong>target</strong>.</p>
<p>We want to be able to measure the difference between what we are predicted with the model <script type="math/tex">f</script> and what it should predict. For that, there are some cost functions, depending on the problem you want to address, to measure how far our predictions are from their targets.</p>
<script type="math/tex; mode=display">L(y, \tilde{y})</script>
<p>In order to minimize the cost, we prefer differentiable functions which have simpler optimization algorithms.</p>
<p>The two most important loss functions are:</p>
<ul>
<li>
<p><strong>Mean Squared Error (MSE)</strong>, usually used for regression: <script type="math/tex">\sum_i (y_i - \tilde{y_i})^2</script></p>
</li>
<li>
<p><strong>Cross Entropy for probabilities</strong>, in particular for classification where the model predicts the probability of the observed object x for each class or “label”:</p>
</li>
</ul>
<center> $$ x \rightarrow \{p_c\}_c $$ with $$ \sum_c p_c = 1 $$ </center>
<p>Coming from the theory of information, cross-entropy measures the distance between two probabilities by :</p>
<script type="math/tex; mode=display">\text{CrossEntropy}(p, \tilde{p}) = - \sum_c \tilde{p}_c \log(p_c)</script>
<p>which can be defined as the expected log-likelihood of the predicted label under the true label distribution, and we want this cost to be the lowest possible (minimization).</p>
<p>Note that <strong>a loss function always outputs a scalar value</strong>. This scalar value is a measure of fit of the model with the real value. Since a loss function outputs a scalar, the shape of its Jacobian with respect to an input is the same as the shape of the input, for example for an input of rank 4:</p>
<script type="math/tex; mode=display">\nabla f = \Big[ \frac{\partial f}{\partial I_{a,b,c,d}} \Big]_{a,b,c,d}</script>
<p>In <strong>conclusion</strong> of this section, the goal of machine learning is to have a function fit with the real world; and to have this function fit well, we use a loss function to measure how to reduce this distance.</p>
<p><strong>Exercise</strong>: discover many more loss functions with my <a href="//christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html">full article about loss functions</a>.</p>
<h1 id="cross-entropy-in-practice">Cross entropy in practice</h1>
<p>Two problems arise: first, most mathematical functions do not output a probability vector which values sum to 1. Second, how do I evaluate the true distribution <script type="math/tex">\tilde{p}</script> ?</p>
<h4 id="normalizing-model-outputs">Normalizing model outputs</h4>
<p>Usually, transforming any model output into a probability that sums to 1 is usually performed thanks to a softmax function set on top of the model outputs :</p>
<script type="math/tex; mode=display">x \xrightarrow{f} o = f(x) = \{o_c\}_c \xrightarrow{softmax} \{p_c\}_c</script>
<p>The softmax normalization function is defined by:</p>
<script type="math/tex; mode=display">\text{Softmax}(o) = \Big\{ \frac{ e^{o_i} }{ \sum_c e^{o_c}} \Big\}_i</script>
<p>Note that for the softmax to predict probability for C classes, it requires the output <script type="math/tex">o = f(x)</script> to be C-dimensional. <script type="math/tex">\{o_c\}_c</script> are called the <strong>logits</strong>.</p>
<p>Softmax is the equivalent of the <code class="highlighter-rouge">sigmoid()</code> in binary classification:</p>
<script type="math/tex; mode=display">x \rightarrow \frac{1}{1+e^{-x}}</script>
<p>in the multi-class case. If you do not remember which one between the softmax and the sigmoid has a negative sign in the exponant, ie <script type="math/tex">e^x</script> or <script type="math/tex">e^{-x}</script>, remember that Softmax and Sigmoid are both <strong>monotonic</strong> functions.</p>
<h4 id="estimating-the-true-distribution">Estimating the true distribution</h4>
<p>Cross entropy is usually mentioned without explanations.</p>
<p>In fact, to understand cross-entropy, you need to rewrite its theoretical definition (1):</p>
<script type="math/tex; mode=display">\text{CrossEntropy} = - \sum_c \tilde{p_c} \log p_c = - \mathbb{E} \Big( \log p_c \Big)</script>
<p>because <script type="math/tex">\tilde{p_c}</script> is the true label distribution, so cross entropy is the expectation of the negative log-probability predicted by the model under the true distribution.</p>
<p>Then, we use the formula for the empirical estimation of the expectation:</p>
<script type="math/tex; mode=display">\text{CrossEntropy} \approx - \frac{1}{N} \sum_{x \sim D} \log p_{\hat{c}(x)}(x) = \text{EmpiricalCrossEntropy}(p)</script>
<p>where D is the real sample distribution, N is the number of samples on which the cross entropy is estimated (<script type="math/tex">N \gg 1</script>) and <script type="math/tex">\hat{c}(x)</script> is the true class of x.</p>
<p>When we compute the cross-entropy, we set an empirical cross-entropy for one sample (N=1) to</p>
<script type="math/tex; mode=display">\text{CrossEntropy} \approx - \log p_\hat{c}(x)</script>
<p>so that when we average the individual losses over more samples for stability, we find back to the desired empirical estimation:</p>
<script type="math/tex; mode=display">\frac{1}{N} \sum_{x \sim D} L(x) = - \frac{1}{N} \sum_{x \sim D} \log p_{\hat{c}(x)}(x) = \text{EmpiricalCrossEntropy}(p)</script>
<p>In the future, we adopt the following formulation for a single sample:</p>
<script type="math/tex; mode=display">\text{CrossEntropy}(p) = - \log p_\hat{c}(x)</script>
<p>which is equivalent to setting the <script type="math/tex">\tilde{p}</script> probability in the theoretical cross-entropy definition (1) with:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tilde{p}_c(x) = \begin{cases}
1, & \text{if } c = \hat{c}(x), \\
0, & \text{otherwise}.
\end{cases} %]]></script>
<p>which we will write <script type="math/tex">\tilde{p}_c(x) = \delta(c,\hat{c})</script>. That is why the target for the predicted probability for the true class is 1.</p>
<p><script type="math/tex">\tilde{p}</script> is a vector of zero values except for the true class, where it has a one: we name <script type="math/tex">\tilde{p}</script> the <strong>one-hot encoding</strong>.</p>
<p>In <strong>conclusion</strong>, for the case of classification, we compute the cross-entropy with values that are either 0 or 1 at the sample level for the “true” probability: x being the image of a cat, we want the model output <script type="math/tex">\{p_c\}_c</script> to fit the empirical class probability <script type="math/tex">\{\tilde{p}_c\}_c</script> at the sample level, that we set to <script type="math/tex">\tilde{p}_\hat{c} = 1</script> for the real object class <script type="math/tex">\hat{c}</script> “cat” and <script type="math/tex">\tilde{p}_c = 0</script> for all other classes <script type="math/tex">c \neq \hat{c}</script>: then at the dataset level, averaging these values lead to the empirical estimates of the true probabilities we are used to.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL1.png" /></p>
<h1 id="the-gradient-descent">The Gradient Descent</h1>
<p>To minimize the cost function, the most used technique is the Gradient Descent.</p>
<p>It consists in following the gradient to descend to the minima:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL2.png" /></p>
<p>In other words, we follow the negative slope of the mountain to find the bottom/lowest position of the valley. In order to avoid a local minima, initialization (the choice of initial values for the parameters, where to start the descent from) is very important, and multiple runs can also help find the best solution.</p>
<p>It is an iterative process in which the update rule simply consists in:</p>
<script type="math/tex; mode=display">\theta_{t+1} = \theta_{t} - \lambda \nabla_\theta \text{cost}</script>
<p>where our cost is defined as a result of the previous section</p>
<script type="math/tex; mode=display">\text{cost}_\theta (x, \tilde{p}) = \text{CrossEntropy} ( \text{Softmax}( f_\theta (x) ) , \tilde{p})</script>
<p>We call <script type="math/tex">\tilde{p}</script> the target (in fact the target distribution). We usually omit the fact the cost is a function of the input, the target and the model parameters and write it directly as “cost”. We can also write with the composition symbol:</p>
<script type="math/tex; mode=display">\text{cost} = \text{CrossEntropy} ( \cdot, \tilde{p}) \circ \text{Softmax} \circ f_\theta</script>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL3.png" /></p>
<p>An <strong>update rule</strong> is how to update the parameters of the model to minimize the cost function.</p>
<p>Since the cost is a scalar, the Jacobian has the same shape as the parameters, and gives a derivative value with respect to every parameter, telling us how to update it.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL9.png" /></p>
<p><script type="math/tex">\lambda</script> is the learning rate and has to be set carefully: Effect of various learning rates on convergence (Img Credit: cs231n)</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL4.png" /></p>
<p>This simple method is named SGD, after <em>Stochastic Gradient Descent</em>. There are many improvements around this simple rule: ADAM, ADADELTA, RMS Prop, … All of them are using the first order only. Some are adaptive, such as ADAM or ADADELTA, where the learning rate is adapted to each parameter automatically.</p>
<p>There exists some second order optimization methods also but they are not very common in practice.</p>
<p><strong>Conclusion</strong>: When we have the loss function, the goal is to minimize it. For this, we have a very simple update rule which is the gradient descent.</p>
<h1 id="backpropagation">Backpropagation</h1>
<p>Computing the gradients to use in the SGD update rule is known as <em>backpropagation</em>.</p>
<p>The reason for this name is the <em>chaining rule</em> in computing gradients of function compositions.</p>
<p>First, adding the softmax and cross-entropy to the model outputs is a composition of 3 functions:</p>
<script type="math/tex; mode=display">\text{cost} = \text{CrossEntropy} \circ \text{Softmax} \circ f_\theta</script>
<p>where the composition means</p>
<script type="math/tex; mode=display">\text{cost}(x) = \text{CrossEntropy} (\text{Softmax} ( f_\theta(x) ) )</script>
<p>But models are also composed of multiple functions, for example let us consider a model composed of 2 dense layers:</p>
<script type="math/tex; mode=display">\text{cost} = \text{CrossEntropy} \circ \text{Softmax} \circ \text{Dense}_{\theta_2}^2 \circ \text{ReLu} \circ \text{Dense}_{\theta_1}^1</script>
<script type="math/tex; mode=display">\text{cost}(x) = \text{CrossEntropy} (\text{Softmax} (\text{Dense}_{\theta_2}^2 ( \text{ReLu} ( \text{Dense}_{\theta_1}^1 (x) ) ) ) )</script>
<p>This model is called <strong>perceptron</strong> and the output of the first Dense layer is a <strong>hidden representation</strong> of the data, while the second layer is used to reduce the number of outputs to the number of classes, to predict class probabilities with a softmax normalization:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL15.png" /></p>
<p>Without a ReLu activation in the middle, the two Dense layers would be mathematically equivalent to only 1 Dense layer.</p>
<p>The model has two sets of parameters:</p>
<script type="math/tex; mode=display">\theta = [ \theta_1, \theta_2 ]</script>
<p>The chaining rule is a rule for gradient computation of the composition of two functions f and g:</p>
<script type="math/tex; mode=display">\frac{ \partial }{ \partial x_i} (f \circ g )_k = \sum_c \frac{\partial f_k}{\partial g_c} \cdot \frac{\partial g_c}{\partial x_i}</script>
<p>which is a simple matrix multiplication</p>
<script type="math/tex; mode=display">\nabla (f \circ g) = \nabla f \times \nabla g</script>
<p>if</p>
<script type="math/tex; mode=display">\nabla f = \Big\{ \frac{\partial g_i}{\partial x_j} \Big\}_{i,j}</script>
<p>What does that mean for deep learning and gradient descent ? In fact, for each layer, to follow the negative slope, we need to compute</p>
<script type="math/tex; mode=display">\nabla_{\theta_{\text{Layer}}} \text{cost}</script>
<p>to update each layer’s parameters <script type="math/tex">\theta_{\text{Layer}}</script>.</p>
<p>So, let’s consider a two layer network composed of g and f. The functions g and f have to be considered as functions of both inputs and parameters:</p>
<script type="math/tex; mode=display">g(x) = g_{\theta_g} (x) = g(x,\theta_g)</script>
<script type="math/tex; mode=display">f(y) = f_{\theta_f} (y) = f(y,\theta_f)</script>
<p>And the composition becomes</p>
<script type="math/tex; mode=display">(f \circ g)(x) = f_{\theta_f} ( g(x,\theta_g), \theta_f)</script>
<p>It is possible to compute the derivatives of g and f with respect to, either the parameters, or the inputs, which we’ll differentiate in notation the following way:</p>
<script type="math/tex; mode=display">\nabla_{\theta_g} g = \Big\{ \frac{\partial g_i}{\partial {\theta_g}_j } \Big\}_{i,j}</script>
<script type="math/tex; mode=display">\nabla_I g = \Big\{ \frac{\partial g_i}{\partial x_j} \Big\}_{i,j}</script>
<p>To update the parameters of g, we need to compute the derivative of the cost with respect to the parameters of g, that is :</p>
<script type="math/tex; mode=display">\nabla_{\theta_g} (f \circ g_{\theta_g}) = \nabla_I f \times \nabla_{\theta_g} g</script>
<p>all other parameters (<script type="math/tex">\theta_f</script>) and inputs (x) being constant.</p>
<p>For example, for the layer <script type="math/tex">\text{Dense}^2</script>,</p>
<script type="math/tex; mode=display">x \xrightarrow{ \text{Dense}^1 } \xrightarrow{ \text{ReLu} } y \xrightarrow{ \text{Dense}^2 } \xrightarrow{ \text{Softmax}} \xrightarrow{ \text{CrossEntropy} } \text{cost}</script>
<p>the gradient is given by</p>
<script type="math/tex; mode=display">\nabla_{\theta_2} \text{cost} = \nabla_{\theta_2} \Big( \text{CrossEntropy} \circ \text{Softmax} \circ \text{Dense}^2 \Big)</script>
<script type="math/tex; mode=display">= \Big(\nabla_I \text{CrossEntropy} \times \nabla_I \text{Softmax}\Big) \times \nabla_{\theta_2} \text{Dense}^2</script>
<p>and for the layer <script type="math/tex">\text{Dense}^1</script>,</p>
<script type="math/tex; mode=display">\nabla_{\theta_1} \text{cost} = \Big(\nabla_I \text{CrossEntropy} \times \nabla_I \text{Softmax}\Big) \times \nabla_I \text{Dense}^2 \times \nabla_I \text{ReLu} \times \nabla_{\theta_1} \text{Dense}^1</script>
<p>We see that <script type="math/tex">\Big(\nabla_I \text{CrossEntropy} \times \nabla_I \text{Softmax}\Big)</script> is common to the computation of <script type="math/tex">\nabla_{\theta_1} \text{cost}</script> and <script type="math/tex">\nabla_{\theta_2} \text{cost}</script>, and it is possible to compute them once.</p>
<p>So, to reduce the number of matrix mulplications, it is better to compute the gradients from the top layer to the bottom layer and reuse previous computations of matrix multiplication from earlier layers.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL10.png" /></p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL8.png" /></p>
<p>In <strong>conclusion</strong>, gradient computation with respect to each layer’s parameters is performed via matrix multiplications of gradients of the layers above, so it is more efficient to begin to compute gradients from the top layers, a process we call <em>retropropagation</em> or <em>backpropagation</em>.</p>
<h1 id="cross-entropy-with-softmax">Cross Entropy with Softmax</h1>
<p>Let’s come back to the global view: an observation X, a model f depending on parameters <script type="math/tex">\theta</script>, a softmax to normalize the model’s output, and last, our cross entropy outputting a final scalar, measuring the distance between the prediction and the expected value:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL5.png" /></p>
<p>The cross entropy is working very well with the softmax function and is usually implemented as one layer, for numerical efficiency.</p>
<p>Let us see why and study the combination of softmax and cross entropy</p>
<script type="math/tex; mode=display">o = f(x) = \{o_c\}_c \xrightarrow{ \text{Softmax}} \xrightarrow{ \text{CrossEntropy} } \text{cost}</script>
<p>Mathematically,</p>
<script type="math/tex; mode=display">\text{cost} = - \log(p_\hat{c}) = - \log \Big( \frac{ e^{o_\hat{c}} }{ \sum_c e^{o_i}} \Big)</script>
<script type="math/tex; mode=display">= - \log e^{o_\hat{c}} + \log \sum_c e^{o_i}</script>
<script type="math/tex; mode=display">= - o_\hat{c} + \log \sum_c e^{-o_i}</script>
<p>Let’s take the derivative with respect to the model output (before softmax normalization):</p>
<script type="math/tex; mode=display">\frac{\partial \text{cost}}{\partial o_c} = - \delta_{c,\hat{c}} + \frac{ e^{o_i} }{\sum_c e^{o_i}}</script>
<script type="math/tex; mode=display">= - \delta_{c,\hat{c}} + p_c</script>
<p>which is very easy to compute and can simply be rewritten:</p>
<script type="math/tex; mode=display">\nabla_o \text{cost} = p - \tilde{p}</script>
<p>Note that the gradient is between -1 and 1: for a negative class (<script type="math/tex">\tilde{p} = 0</script>), the derivative is positive, and the higher the prediction has been positive, the higher the derivative will be; for a positive class, the derivative will always be negative, and the lower the prediction to be positive, the lower the derivative will be.</p>
<p>Since Softmax is monotonic, Softmax output computation is not required for inference, the highest logit corresponds to the highest probability… except if you need the probability to estimate the confidence of the predicted class.</p>
<p><strong>Conclusion</strong>: it is easier to backprogate gradients computed on Softmax+CrossEntropy together rather than backpropagate separately each : the derivative of the Softmax+CrossEntropy with respect to the output of the model for the right class, let’s say the “cat” class, will be 0.8 - 1 = - 0.2, if the model has predicted a probability of 0.8 for this class, and the update will follow the negative slope to encourage to increase the prediction ; the derivative of the Softmax+CrossEntropy with respect to an output for a different class will be 0.4 is it has predicted a probability of 0.4, encouraging the model to decrease this value.</p>
<p><strong>Exercise</strong>: Compute the derivative of Sigmoid+BinaryCrossEntropy combined.</p>
<p><strong>Exercise</strong>: Compute the derivative of Sigmoid+MSE combined.</p>
<h1 id="example-with-a-dense-layer">Example with a Dense layer</h1>
<p>Let’s take, as model, a very simple one with only one Dense layer with 2 filters in one dimension and an input of dimension 2:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL7.png" /></p>
<p>This is the smallest model we can ever imagine. Such a layer produces a vector of 2 scalars:</p>
<script type="math/tex; mode=display">f_\theta : \{x_j\} \rightarrow \Big\{ o_i = \sum_j \theta_{i,j} x_j + b_i \Big\}_i</script>
<p>Please keep in mind that it is not possible to descend the gradient directly on this output because it is composed of two scalars. We need a loss function, that returns a scalar, and tells us how to combine these two outputs. For example, the Softmax+CrossEntropy we have seen previously:</p>
<script type="math/tex; mode=display">L :o \rightarrow \text{CrossEntropy}(\text{Softmax}(o))</script>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL6.png" /></p>
<p>Now we can retropropagate the gradients. Since we are in the case of 2 outputs, we call this problem a binary classification problem, let’s say: cats and dogs.</p>
<p>Let us compute the derivatives of the dense layer:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\partial o_i}{\partial \theta_{k,j}} = \begin{cases}
x_j, & \text{if } k = i, \\
0, & \text{otherwise}.
\end{cases} %]]></script>
<p>so</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial \theta_{i,j}} ( L \circ f_\theta )= \sum_c \frac{\partial L}{\partial o_c} \cdot \frac{\partial o_c}{\partial \theta_{i,j}} = \frac{\partial L}{\partial o_i} \cdot \frac{\partial o_i}{\partial \theta_{i,j}} = ( \delta_{ i, \hat{c}} - L(o_i)) \cdot x_j</script>
<p>A Dense layer with 4 outputs:</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL17.png" /></p>
<h1 id="generalize-beyond-cross-entropy">Generalize beyond cross entropy</h1>
<h4 id="re-weighting-probabilities">Re-weighting probabilities</h4>
<p>Cross-entropy is built upon the probability of the label:</p>
<script type="math/tex; mode=display">\text{CrossEntropy} = - \sum_c \tilde{p_c} \log p_c</script>
<p>In our section on practical cross-entropy, we have considered that we knew the true label with certainty, that the goal to achieve was maximize the objective under the real distribution of labels, even if they are unbalanced in the dataset, leading to strong bias. In practice, we can go one step further, rebalancing these probability as Bayes rules would suggest, or integrate the notion of incertainty in the groundtruth label, to reduce the influence of noise. Here are a few techniques we can use in practice. It is still possible:</p>
<ul>
<li>to rebalance the dataset with <script type="math/tex">\alpha_{c}</script> the inverse class frequency :</li>
</ul>
<script type="math/tex; mode=display">\text{CrossEntropy}(p, \tilde{p}) = - \alpha_{\hat{c}} \times \log p_\hat{c}</script>
<p>This could also be performed by replacing the current sampling schema (<script type="math/tex">\tilde{p}</script>), by sampling uniformly the class first, then a sample belonging to this class.</p>
<ul>
<li>to train a model with smoother values than 0 and 1 for negatives and positives, for example 0.1 or 0.9, which will help achieve better performances. This technique of <em>label smoothing</em> or <em>soft labels</em> enables in particular to re-introduce the outputs for the negative classes so that it will preserve a symmetry between the negative and positive labels:</li>
</ul>
<script type="math/tex; mode=display">\text{CrossEntropy}(p, \tilde{p}) = - 0.9 \times \log p_\hat{c} - 0.1 \times \sum_{c \neq \hat{c}} \log p_c</script>
<p>This technique reduces the confidence in the targets, and the network overfitting. It discourages too high differences between the logits for the true class and for the other classes.</p>
<p>It is also possible to regularize with label smoothing, by drawing with probability <script type="math/tex">\epsilon</script> a class among C classes uniformly:</p>
<script type="math/tex; mode=display">\tilde{p}' (c) =(1-\epsilon) \delta_{c,\hat{c}} + \frac{\epsilon}{C}</script>
<script type="math/tex; mode=display">\text{CrossEntropy'}(p, \tilde{p}) = (1-\epsilon) \text{CrossEntropy}(p, \tilde{p}) + \epsilon \text{CrossEntropy}(p, \text{uniform})</script>
<ul>
<li>
<p>to use smoother values than 0 and 1 when the labels in the groundtruth are less certain,</p>
</li>
<li>
<p>to focus more on wrongly classified examples</p>
</li>
</ul>
<script type="math/tex; mode=display">\text{CrossEntropy}(p, \tilde{p}) = - ( 1 - p_\hat{c} )^\gamma \times \log p_\hat{c}</script>
<p>as in the Focal Loss for object detection where background negatives are too numerous and tend to take over the positives. This technique replaces <strong>hard negative mining</strong>.</p>
<h4 id="reinforcement">Reinforcement</h4>
<p>That is where the magic happens ;-)</p>
<p>As you might have understood, the cross entropy comes from the theory of information, but the definition of the probabilities and their weighting scheme can be adapted to the problem we want to solve. That is where we leave theory for practice.</p>
<p>Still, there is a very important theoretical generalization of cross-entropy through reinforcement learning which is very easy to understand.</p>
<p>In reinforcement, given an observation <script type="math/tex">x_t</script> at a certain timestep, you’re going to perform an action, for example driving car, going right, left, or straight, or in a game, using some keyboard commands. Then the environment is modified and you’ll need to decide of the next action… and sometimes you get a reward <script type="math/tex">r_t</script>, a feedback from the environment, good news or bad news, gain some points… In the case of reinforcement learning, we do not have any labels as targets or groundtruth. We just want to get the best reward :</p>
<script type="math/tex; mode=display">R = \mathbb{E}_{\text{seq}} \sum_{t} r_t</script>
<p>over all possible sequence of actions.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL35.png" /></p>
<p>The probability of a sequence is</p>
<script type="math/tex; mode=display">p({a_0, a_1, ..., a_T}) = p (a_0 | x_0) p( a_1 | a_0, x_0, x_1)... = \prod_{t=0}^T p(a_t | a_0, x_0, ... x_T)</script>
<p>so R can be written</p>
<script type="math/tex; mode=display">R = \sum_{a_0, a_1, ..., a_T} \big( \prod_{t=0}^T p(a_t | a_0, x_0, ... x_T) \big) (\sum_{t=0}^T r_t)</script>
<script type="math/tex; mode=display">= \mathbb{E}_{a_0 \sim p(\cdot |x_0)} \mathbb{E}_{a_1 \sim p(\cdot | a_0, x_0, x_1)} ... \mathbb{E}_{a_T \sim p(\cdot | a_0, x_0, ... x_T)} \sum_{t=0}^T r_t</script>
<p>When my actions are parametrized by a model <script type="math/tex">p = f_\theta</script>, the whole problem becomes parametrized by <script type="math/tex">\theta</script> and it is possible to find the <script type="math/tex">\theta</script> that maximizes the reward by following the gradient of the expected reward:</p>
<script type="math/tex; mode=display">R(\theta) = \mathbb{E}_{\text{seq} \sim p_\theta} \sum_{t=0}^T r_t = \sum_{\text{seq}} p_\theta(\text{seq}) \sum_{t=0}^T r_t</script>
<p>In order to maximize the expected reward <script type="math/tex">J(\theta)</script>, we compute the derivative:</p>
<script type="math/tex; mode=display">\nabla_\theta R = \sum_{\text{seq}} \frac{\partial p_\theta(\text{seq})}{\partial \theta} (\sum_{t=0}^T r_t)
= \sum_{\text{seq}} p_\theta(\text{seq}) \frac{\partial \log p_\theta(\text{seq})}{\partial \theta} (\sum_{t=0}^T r_t)
= \mathbb{E}_{\text{seq}} \frac{\partial \log p_\theta(\text{seq})}{\partial \theta} (\sum_{t=0}^T r_t)</script>
<p>because <script type="math/tex">\frac{\partial f}{\partial \theta} = f \times \frac{1}{f} \frac{\partial f}{\partial \theta} = f \times \frac{\partial \log f}{\partial \theta}</script></p>
<p>which looks exactly the same as the derivative of the cross entropy:</p>
<script type="math/tex; mode=display">\text{CE} = \sum_c \log(p_c) \tilde{p_c}</script>
<script type="math/tex; mode=display">\frac{\partial \text{CE}}{\partial \theta} = \sum_c \frac{\partial \log(p_c)}{\partial \theta} \tilde{p_c}</script>
<p>except that you replace the expected reward in place of the true label probability:</p>
<script type="math/tex; mode=display">\log p_\theta(\text{seq}) = \log \prod_{t=0}^T p_\theta(a_t | a_0, x_0, ... x_t)</script>
<script type="math/tex; mode=display">= \sum_{t=0}^T \log p_\theta (a_t| a_0, x_0, ... x_t)</script>
<script type="math/tex; mode=display">\frac{\partial \log p_\theta(\text{seq}) }{ \partial \theta } = \sum_{t=0}^T \frac{\partial \log p_\theta (a_t| a_0, x_0, ... x_t)}{\partial \theta}</script>
<script type="math/tex; mode=display">\nabla_\theta R = \mathbb{E}_{\text{seq}} ( \sum_{t=0}^T \frac{\partial \log p_\theta (a_t| a_0, x_0, ... x_t)}{\partial \theta} ) \times ( \sum_{t=0}^T r_t )</script>
<p>In <strong>conclusion</strong>, in place of the 1 and 0 of the classification case, reinforcement learning proposes to use the global reward R as target label for each timestep that led to this reward, considering each time step as individual samples.</p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL40.png" /></p>
<p><img src="//christopher5106.github.io/img/deeplearningcourse/DL41.png" /></p>
<p>A fantastic demonstration [Williams, 1992] of re-weighting the cross-entropy, where the eligibility of each parameter to the gradient is multiplied by the reward or the progress in the goal we want to achieve.</p>
<p>Classical cross-entropy definition can be seen as a specific case of this more global theorization.</p>
<p><strong>Well done!</strong></p>
<p>Now let’s go to next course: <a href="http://christopher5106.github.io/deep/learning/2018/10/20/course-one-programming-deep-learning.html">Course 1: programming deep learning!</a></p>
Sat, 20 Oct 2018 05:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.html
//christopher5106.github.io/deep/learning/2018/10/20/course-zero-deep-learning.htmldeeplearningUnderstand shape inference in deep learning technologies<p>Run the following code in your Python shell with Keras Python installed,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Embedding</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="mi">12</span><span class="p">,</span> <span class="p">))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">trainable</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">input_length</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"word_embedding"</span><span class="p">)(</span><span class="n">a</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">int_shape</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">get_shape</span><span class="p">()</span><span class="o">.</span><span class="n">as_list</span><span class="p">())</span>
</code></pre></div></div>
<p>And you’ll get suprised. You’ll get two different returns:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(None, 10, 10)
[None, 12, 10]
</code></pre></div></div>
<p>If you are not sure why, this article is for you!</p>
<h2 id="static-shapes">Static Shapes</h2>
<p>In Tensorflow, the static shape is given by <code class="highlighter-rouge">.get_shape()</code> method of the Tensor object, which is equivalent to <code class="highlighter-rouge">.shape</code>.</p>
<p>The static shape is an object of type <code class="highlighter-rouge">tensorflow.python.framework.tensor_shape.TensorShape</code>.</p>
<p>With <code class="highlighter-rouge">None</code> instead of an integer, it leaves the possibility for partially defined shapes :</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>returns <code class="highlighter-rouge">(?, ?, 10)</code> where unknown dimensions <code class="highlighter-rouge">None</code> are printed with an question mark <strong>?</strong>. Since we are using the Keras <code class="highlighter-rouge">Input</code> layer,
the first dimension is systematically <code class="highlighter-rouge">None</code> for the batch size and cannot be set.</p>
<p>The TensorShape has 2 public attributes, dims and ndims:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="o">.</span><span class="n">dims</span><span class="p">)</span> <span class="c"># [Dimension(None), Dimension(None), Dimension(10)]</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="o">.</span><span class="n">ndims</span><span class="p">)</span> <span class="c"># 3</span>
<span class="k">print</span><span class="p">(</span><span class="n">K</span><span class="o">.</span><span class="n">ndim</span><span class="p">(</span><span class="n">a</span><span class="p">))</span> <span class="c"># 3</span>
</code></pre></div></div>
<h2 id="dynamic-shapes">Dynamic Shapes</h2>
<p>Of course, during the run of the graph with input values, all shapes become known. Shape at run time are named <em>dynamic shapes</em>.</p>
<p>To access to their values at run time, you can use either the tensorflow operator <code class="highlighter-rouge">tf.shape()</code> or the Keras wrapper <code class="highlighter-rouge">K.shape()</code>.</p>
<p>As graph operators, they both return a <code class="highlighter-rouge">tensorflow.python.framework.ops.Tensor</code> Tensor.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="p">))</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">a</span><span class="p">],</span> <span class="p">[</span><span class="n">K</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">a</span><span class="p">)])</span>
<span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">(</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">11</span><span class="p">))</span> <span class="p">]))</span>
</code></pre></div></div>
<p>returns <code class="highlighter-rouge">[array([ 3, 11], dtype=int32)]</code>. 3 is the batch size, and 11 the value for the unknown data dimension at graph definition.</p>
<p>The equivalent of <code class="highlighter-rouge">a.get_shape().as_list()</code> for static shapes is <code class="highlighter-rouge">tf.unstack(tf.shape(a))</code>.</p>
<p>The number of dimensions is returned by <code class="highlighter-rouge">tf.rank()</code> which returns a tensor of rank zero (scalar) and for that reason is very different from <code class="highlighter-rouge">K.ndim</code> or <code class="highlighter-rouge">ndims</code> methods with integer return.</p>
<h2 id="shape-setting-for-operators">Shape setting for Operators</h2>
<p>Most operators have an output shape function that enables to infer the static shape, <strong>without running the graph</strong>, given the shapes of the operators’ input tensors.</p>
<p>For example, in Tensorflow you can check how to define the <a href="https://www.tensorflow.org/extend/adding_an_op#shape_functions_in_c">Shape functions in C++</a> for any operator.</p>
<p>Nevertheless, shape functions do not cover all cases correctly, and in some cases, it is impossible to infer the shape without knowing more of your intent</p>
<p>Let’s take an example, where automatic shape inference is not possible.</p>
<p>Let’s define a reshaping operation given a reshaping tensor for which the values are not known at graph definition, but only at run time.</p>
<p>Such a reshaping tensor can be for example depending on input tensor dimensions, or any other dynamic shapes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="bp">None</span><span class="p">,))</span>
<span class="n">reshape_tensor</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">"int32"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c"># reshape the first input tensor given the second input tensor</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">reshape_tensor</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c"># build the model</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">K</span><span class="o">.</span><span class="n">function</span><span class="p">([</span><span class="n">a</span><span class="p">,</span> <span class="n">reshape_tensor</span><span class="p">],</span> <span class="p">[</span><span class="n">x</span><span class="p">])</span>
<span class="c"># eval the model on input data</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">f</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span><span class="mi">10</span><span class="p">)),</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="p">])</span>
</code></pre></div></div>
<p>prints</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(?, ?)
<unknown>
</code></pre></div></div>
<p>The shape of variable <code class="highlighter-rouge">a</code> is of rank 2, but both dimensions are not known: it is named <em>partially known shape</em>. Its value is <code class="highlighter-rouge">TensorShape([None, None])</code>.</p>
<p>The shape of variable <code class="highlighter-rouge">x</code> is <em>unknown</em>, neither the rank nor the dimensions are known. Its value is <code class="highlighter-rouge">TensorShape(None)</code>.</p>
<p>That is where, if possible, setting the shape manually can help get a more precise shape than <code class="highlighter-rouge">TensorShape([None, None])</code> or <code class="highlighter-rouge">TensorShape(None)</code>.</p>
<p>To set the unknown shape dimensions:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="bp">None</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="n">a</span><span class="o">.</span><span class="n">set_shape</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">a</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>
<p>Note that shape setting requires to preserve the shape rank and known dimensions. IF I write:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span><span class="o">.</span><span class="n">set_shape</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
</code></pre></div></div>
<p>it leads to a value error</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ValueError: Shapes (?, ?, 10) and (10, 10, 11) are not compatible
</code></pre></div></div>
<p>Setting the shape enables further operators to compute the shape.</p>
<p>Since Tensorflow does not have a concept of Layers (it is much more based on Operators and Scopes), the <code class="highlighter-rouge">set_shape()</code> function is the method for shape inference without running the graph.</p>
<h2 id="shape-setting-for-layers">Shape setting for Layers</h2>
<p>Let’s come back to the initial example, where a layer, the <code class="highlighter-rouge">Embedding</code> layer, is a concept involved in the middle of the Keras graph definition.</p>
<p>The concept of layers gives a struture to the neural networks, enabling to run through the layers later on:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.models</span> <span class="kn">import</span> <span class="n">Model</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="mi">11</span><span class="p">,))</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">trainable</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">input_length</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">"word_embedding"</span><span class="p">)(</span><span class="n">a</span><span class="p">)</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">Model</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">x</span><span class="p">)</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">m</span><span class="o">.</span><span class="n">layers</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">name</span><span class="p">)</span>
</code></pre></div></div>
<p>returns the name of each layers: <code class="highlighter-rouge">input_1, word_embedding</code>.</p>
<p>Still, when a shape cannot be inferred, it is possible to set it also, so that further layers benefit from their output shape information.</p>
<p>Let’s see in practice, with a simple custom concatenate Layer.</p>
<p>For the purpose, let me introduce an error in the <code class="highlighter-rouge">compute_output_shape()</code> function, adding 2 to the last shape dimension, as I did in the <code class="highlighter-rouge">Embedding</code> layer at the begining of this article, setting <code class="highlighter-rouge">input_length</code> to 10 instead of 12:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">keras.engine.topology</span> <span class="kn">import</span> <span class="n">Layer</span>
<span class="k">class</span> <span class="nc">MyConcatenateLayer</span><span class="p">(</span><span class="n">Layer</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">call</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inputs</span><span class="p">):</span>
<span class="k">return</span> <span class="n">K</span><span class="o">.</span><span class="n">concatenate</span><span class="p">(</span><span class="n">inputs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">compute_output_shape</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_shape</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">input_shape</span><span class="p">)</span> <span class="c"># [(None, 10), (None, 12)]</span>
<span class="k">return</span> <span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="n">input_shape</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">input_shape</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">keras.layers</span> <span class="kn">import</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Embedding</span>
<span class="kn">from</span> <span class="nn">keras</span> <span class="kn">import</span> <span class="n">backend</span> <span class="k">as</span> <span class="n">K</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="mi">10</span><span class="p">,))</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">Input</span><span class="p">((</span><span class="mi">12</span><span class="p">,))</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">MyConcatenateLayer</span><span class="p">()([</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="c"># (?, 22)</span>
<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="o">.</span><span class="n">_keras_shape</span><span class="p">)</span> <span class="c"># (?, 24)</span>
</code></pre></div></div>
<p>The code will run without error, as well as the graph evaluation on values for inputs.</p>
<p>As you can see, Keras adds attributes to Tensor such as <code class="highlighter-rouge">_keras_shape</code>, to be able to retrieve layers information. This can be useful for layer weight saving for example.</p>
<p>The Keras <code class="highlighter-rouge">K.int_shape()</code> method relies on <code class="highlighter-rouge">_keras_shape</code> attribute to return the result, leading to propagation of the error.</p>
<p>Since shapes can vary in rank and their values can be <code class="highlighter-rouge">None</code>, it is difficult, on such a simple concatenation example, to be sure to cover all cases in the shape inference function and this leads to errors.</p>
<h2 id="a-note-on-cntk">A note on CNTK</h2>
<p>CNTK distinguishes unknown dimensions into 2 categories:</p>
<ul>
<li>
<p>the <em>inferred dimensions</em> whose value is to be inferred by the system and is printed with <strong>-1</strong> instead of the question mark <strong>?</strong>. For example, in the matrix multiplication <em>A x B</em> between tensors A and B, the last dimension of A can be inferred by the system given the first dimension of B. See <a href="https://docs.microsoft.com/en-us/cognitive-toolkit/parameters-and-constants#automatic-dimension-inference">here</a></p>
</li>
<li>
<p>the <em>free dimensions</em> whose value is known only when data is bound to the variable and is printed with <strong>-3</strong></p>
</li>
</ul>
<p><strong>Well done!</strong></p>
<p>Now, you are aware of the Why of the shape setting, its advantages and its risks.</p>
Fri, 19 Oct 2018 00:00:00 +0000
//christopher5106.github.io/deep/learning/2018/10/19/understand-shape-inference-in-deep-learning-technologies.html
//christopher5106.github.io/deep/learning/2018/10/19/understand-shape-inference-in-deep-learning-technologies.htmldeeplearningObject detection deep learning frameworks for Optical Character Recognition and Document Pretrained Features<p>Working as AI architect at Ivalua company, I’m happy to announce the release in the open source of my code for optical character recognition using Object Detection deep learning techniques.</p>
<p>The main purpose of this work is to compute pretrained features that can serve as early layers in more complex deep learning nets for document analysis, segmentation, classification or reading.</p>
<p>Object detection task has been improving a lot with the arise of new deep learning models such as R-CNN, Fast-RCNN, Faster-RCNN, Mask-RCNN, Yolo, SSD, RetinaNet… These models have been developped on the case of natural images, on datasets such as COCO, Pascal VOC, … They have been applied to the task of digit recognition in natural images, such as the Street View House Numbers (SVHN) Dataset.</p>
<p>These object detection models have sometimes been applied to documents at the global document scale to extract document zones, such as tables or document layouts. But documents are very different from natural images, mainly black and white, with very strong image gradients and very different gradient patterns. These object detection models are all based on pretrained features coming from classification networks on natural images, such as AlexNet, VGG, ResNets, …, included in the cases quoted on document images, but might not be suited to document, in particular at character scale, since they are not the standard architecture to recognize characters, but also probably at the global document scale.</p>
<p>The classification network basis for document images could certainly be better found in networks developped for the MNIST digit dataset, such as the LeNet. The idea is to explore the application of object detection networks to character recognition with network architectural backbones inspired by these digit classification networks, better suited to document image data.</p>
<p>The code has first been developped on toy examples, built with MNIST data:</p>
<p><img src="//christopher5106.github.io/img/ocr/res1.png" height="500" /> <img src="//christopher5106.github.io/img/ocr/res2.png" height="500" /></p>
<p>Training on the full document images is challenging, since</p>
<ul>
<li>
<p>characters could be hardly read when the document size is less than 1000 pixel high,</p>
</li>
<li>
<p>classically, training deep learning networks for image tasks is usually performed on small sized images, less than 300 pixels high and wide (224, 256, …) to fit on GPU.</p>
</li>
</ul>
<p>So, in order to keep a good resolution to read the characters and a document image small enough to fit on the GPU, as classically in object detection, we first used crops, as well as multiple layers as in SSD to recognize characters at different font sizes:</p>
<p><img src="//christopher5106.github.io/img/ocr/res3.png" height="500" /> <img src="//christopher5106.github.io/img/ocr/res4.png" height="500" /></p>
<p>Once the results were good enough, document image resolution and batch size have been decreased to fit the image and the network in the GPU, and the following result has been achieved, dropping the characters of size too big or too small:</p>
<p><img src="//christopher5106.github.io/img/ocr/res5.png" height="1000" /></p>
<p>The full experiment settings and results are described in the <a href="//christopher5106.github.io/img/ocr/Object_detection_deep_learning_networks_for_Optical_Character_Recognition.pdf">PDF paper</a>.</p>
<p>By releasing our work, we hope to help open source community use and work with us on these nets, to invent more accurate or more efficient networks for full document processing, that could serve as early layers for further document tasks.</p>
<p><a href="https://github.com/Ivalua/object_detection_ocr">The code on Ivalua’s github</a></p>
<p><strong>Well done!</strong></p>
Tue, 26 Jun 2018 00:00:00 +0000
//christopher5106.github.io/deep/learning/2018/06/26/object-detection-deep-learning-for-optical-character-recognition-and-document-pretrained-features.html
//christopher5106.github.io/deep/learning/2018/06/26/object-detection-deep-learning-for-optical-character-recognition-and-document-pretrained-features.htmldeeplearningConfigure Windows 10 for Ubuntu and server X<p>In Windows 10, it is now possible to run Ubuntu Bash shell, without dual boot nor virtual machine, directly using the Windows kernel’s new properties. It is named <strong>Windows Subsystem for Linux (WSL)</strong>.</p>
<p>In this tutorial, I’ll give you the command to install and use Ubuntu shell on a typical enterprise Windows computer.</p>
<h1 id="install-ubuntu-shell">Install Ubuntu Shell</h1>
<p>First, in <strong>Settings > Update and security > For developers</strong>, activate <strong>Developer mode</strong>:</p>
<p><img src="//christopher5106.github.io/img/windows_developer_mode.PNG" alt="" /></p>
<p>Second, in <strong>Settings > Applications > Applications and features</strong>, click on <strong>Programs and features</strong>,</p>
<p><img src="//christopher5106.github.io/img/windows_programs_and_features.PNG" alt="" /></p>
<p>open <strong>activate or desactivate Windows features</strong> panel:</p>
<p><img src="//christopher5106.github.io/img/windows_features_activation.PNG" alt="" /></p>
<p>enable the “Windows Subsystem for Linux” optional feature (you can also enable the feature with <code class="highlighter-rouge">Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux</code> in PowerShell as administrator).</p>
<p>Third reboot.</p>
<p>Last, run <code class="highlighter-rouge">lsxrun /install</code> in the Windows Command Prompt to install Ubuntu Bash, without requiring the activation of the Windows Store of applications.</p>
<p>You’ll find the Ubuntu bash under <strong>Bash</strong> in the Windows Command prompt:</p>
<p><img src="//christopher5106.github.io/img/windows_bash.PNG" alt="" /></p>
<h1 id="install-a-server-x">Install a server X</h1>
<p>It is possible to run graphical applications from Ubuntu, for that purpose you need to install <a href="https://sourceforge.net/projects/xming/">Xming X Server for Windows</a>. Then, run Xming server and set the <code class="highlighter-rouge">DISPLAY</code> environment variable in the Ubuntu Bash Shell:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export DISPLAY=localhost:0.0
</code></pre></div></div>
<p>Now you can run <code class="highlighter-rouge">firefox</code> in your Ubuntu Bash terminal.</p>
<p>In the Ubuntu Bash terminal under Windows, it is also possible to get the GUI environment from a remote server as under Linux, with command <code class="highlighter-rouge">ssh -X</code>. To enable this, install SSH, XAUTH and XORG:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt install ssh xauth xorg
sudo vi /etc/ssh/ssh_config
</code></pre></div></div>
<p>and edit the <strong>ssh_config</strong> file, uncommenting or adding the following lines:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host *
ForwardAgent yes
ForwardX11 yes
ForwardX11Trusted yes
Port 22
Protocol 2
GSSAPIDelegateCredentials no
XauthLocaion /usr/bin/xauth
</code></pre></div></div>
<p>Now, setting the display, you can access your Ubuntu remote server through the Ubuntu server X on your Windows Ubuntu computer:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh -X ...
</code></pre></div></div>
<h3 id="line-endings">Line endings</h3>
<p>You will probably find your files created with Atom editor with strange line endings under your Linux servers:</p>
<p><img src="//christopher5106.github.io/img/windows_line_endings.PNG" alt="" /></p>
<p>To avoid that, you can install <a href="https://atom.io/packages/windows-carriage-return-remover">Windows Carriage Return Removers</a> to remove Windows line endings.</p>
<p>You will also install the <a href="https://atom.io/packages/line-ending-selector">Line Ending Selector Package</a> to write new files with Linux the new line.</p>
<p><img src="//christopher5106.github.io/img/linux_line_endings.PNG" alt="" /></p>
<p><strong>Well done!</strong></p>
Fri, 02 Feb 2018 00:00:51 +0000
//christopher5106.github.io/admin/2018/02/02/configure-windows-10-for-ubuntu.html
//christopher5106.github.io/admin/2018/02/02/configure-windows-10-for-ubuntu.htmladminPython Dask: evaluate true skill of reinforcement learning agents with a distributed cluster of instances<p>Reinforcement learning requires a high number of matches for an agent to learn from a game. Once multiple agents have been trained, evaluating the quality of the agents requires to let them play multiple times against each others.</p>
<p>Microsoft has released and patented a <a href="https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/">Bayesian based library</a> named <a href="http://trueskill.org/">TrueSkill</a> for Xbox Live that can also be used to compute the skill of agents given the scores they achieved in multi-player matches. The Bayesian theory gives a framework to update the skill value distribution of an agent each time the agent is implied in a game for which a result of the game has been known. Under gaussian assumptions, each update will modify the mean skill value and sharpen the distribution, which means reducing the uncertainty of the new (or a posteriori) skill value. The library is available as a <a href="http://trueskill.org/">Python package</a>.</p>
<p>When the number of agents is important, the number of plays becomes also important. Moreover, agents and games might require lot’s of computations, in particular when they are based on deep learning neural networks. Let’s see in practice an implementation to distribute the plays of the agents to evaluate their true skills.</p>
<p>To follow this article, the full code can be cloned from <a href="https://github.com/christopher5106/distributed-trueskill-eval-of-agents">here</a>.</p>
<h3 id="python-package-manager">Python package manager</h3>
<p>It is a good practice to create a <code class="highlighter-rouge">requirements.txt</code> file to list the required Python modules to run the code. In this project, we’ll use TrueSkill library, as well as Dask for code distribution :</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>trueskill
dask
distributed
paramiko
</code></pre></div></div>
<p>To install the modules on your local computer, run:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip3 install <span class="nt">-r</span> requirements.txt
</code></pre></div></div>
<h3 id="emulate-a-cluster-on-the-local-computer">Emulate a cluster on the local computer</h3>
<p>Let’s create a virtual cluster on our local computer using Docker, which enables to launch multiple virtual machines. A local cluster will be tremendously useful to develop fast. For that purpose, we also suppose that the instances have installed the required Python modules.</p>
<p>First, I write a <code class="highlighter-rouge">Dockerfile</code> to create a Docker image based on Ubuntu 17.04 with an SSH server and the Python modules that are required by my code:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM ubuntu:17.04
RUN apt-get update
RUN apt-get install <span class="nt">-y</span> openssh-server vim man
RUN apt-get install <span class="nt">-y</span> python3-pip
COPY requirements.txt <span class="nb">.</span>
RUN pip3 install <span class="nt">-r</span> requirements.txt
ENV <span class="nv">HOME</span><span class="o">=</span>/root
WORKDIR /root/
COPY docker.py docker.py
RUN python3 <span class="nt">-c</span> <span class="s2">"from docker import setup_ssh; setup_ssh()"</span>
RUN <span class="nb">echo</span> <span class="s2">"export LC_ALL=C.UTF-8 && export LANG=C.UTF-8"</span> <span class="o">>></span> ~/.bashrc
COPY <span class="nb">.</span> /root/
</code></pre></div></div>
<p>Then, I’m writing a small Python script <code class="highlighter-rouge">docker.py</code> to run Docker commands</p>
<ul>
<li><code class="highlighter-rouge">docker build . -t distributed</code> to build the Docker image under the name “distributed”</li>
<li><code class="highlighter-rouge">docker run -d -rm distributed /bin/bash -c /etc/init.d/ssh start && while [ ! -f /root/ips.txt ]; do sleep 1s; done && ls -l && cat ips.txt && sleep 60m</code> to run multiple instances with this image and let them wait for the script to get their IPs</li>
<li><code class="highlighter-rouge">docker inspect --format { { .NetworkSettings.IPAddress }} DOCKER_ID</code> on each container ID to get their IPs</li>
</ul>
<p>The list of IPs is written to <code class="highlighter-rouge">ips.txt</code> file. From now on, we can use this file, containing the list of hostnames or IPs, to build a generic code that will work wether the cluster is virtual (as here with Docker) or a real cluster of different physical instances.</p>
<h3 id="dask-a-python-library-for-distribution">Dask, a Python library for distribution</h3>
<p>In the past, I wrote many articles about PySpark which is a great library to distribute computations on multiple instances.</p>
<p>Since Spark, there has been some new libraries, and to reduce the overhead of running Python inside Java, directly developed in and for Python.</p>
<p>There is Dask, for example, that extends the Python multiprocessing library, aimed at paralellizing code execution on the multipe cores of a computer. Dask keeps its API very close to the Python multiprocessing library and the PySpark collection methods, which greatly reduces the time to learn it and become familiar with it.</p>
<p>So, to launch a Dask cluster on the provided instances, real or virtual, two options:</p>
<p>The first option is to connect manually to each instance, launch one Dask worker on each of them with <code class="highlighter-rouge">dask-worker</code> command, and run a Dask scheduler on the first instance with <code class="highlighter-rouge">dask-scheduler</code> command. Since the image contains the Python modules <code class="highlighter-rouge">dask</code> and <code class="highlighter-rouge">distributed</code>, the commands will be available.</p>
<p>The second option is to use <code class="highlighter-rouge">dask-ssh</code> that will do it for you, given the <code class="highlighter-rouge">ips.txt</code> file listing the IPs or hostnames of different instances:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dask-ssh <span class="nt">--hostfile</span> ips.txt
</code></pre></div></div>
<p>Now, our Dask cluster is set.</p>
<h3 id="build-a-test-game">Build a test game</h3>
<p>Let’s build a test game, that will enable to verify that the implementation works without bugs, ie :</p>
<p>1- be reliable to failures</p>
<p>2- predict the correct skills</p>
<p>In order to see if the implementation predicts the skills correctly, let’s us define a game level to each created agent in a <code class="highlighter-rouge">game.py</code> file:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Agent</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">r</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">RATING_RANGE</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>and implement a simple game, in which one could think logical that :</p>
<p>1- a better agent should win more often when playing with a lower agent</p>
<p>2- a better agent should have harder times winning when it was close in level to the lower agent</p>
<p>3- in the case the best agent does not win against the lower one, there might be some tie / drawn matches.</p>
<p>Moreover, there might some unresponsive tasks in a real game evaluation, and we’ll need to emulate such failures. Let’s use the <code class="highlighter-rouge">sleep</code> method, random errors (exceptions or invalid results) on top of a stochastic strategy that will not always give the success to the best agent:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">play</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">agent0</span><span class="p">,</span> <span class="n">agent1</span><span class="p">):</span>
<span class="n">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="c"># emulate an unresponsive task</span>
<span class="k">if</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="mf">0.1</span><span class="p">:</span> <span class="c"># emulate errors with low probability</span>
<span class="k">raise</span> <span class="nb">Exception</span>
<span class="k">if</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="mf">0.01</span><span class="p">:</span> <span class="c"># return some non-valid values</span>
<span class="k">return</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="s">""</span><span class="p">,</span> <span class="o">-</span><span class="mi">10</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="p">{}])</span>
<span class="n">res</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">agent0</span><span class="o">.</span><span class="n">r</span> <span class="o">-</span> <span class="n">agent1</span><span class="o">.</span><span class="n">r</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="c"># result of the match if it was deterministic</span>
<span class="k">if</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="n">MAX_RAND_PROB</span> <span class="o">-</span> <span class="nb">float</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">agent0</span><span class="o">.</span><span class="n">r</span> <span class="o">-</span> <span class="n">agent1</span><span class="o">.</span><span class="n">r</span><span class="p">))</span> <span class="o">/</span> <span class="n">RATING_RANGE</span><span class="p">:</span> <span class="c"># add some randomness</span>
<span class="k">if</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="n">MAX_TIE_PROB</span> <span class="o">-</span> <span class="nb">float</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">agent0</span><span class="o">.</span><span class="n">r</span> <span class="o">-</span> <span class="n">agent1</span><span class="o">.</span><span class="n">r</span><span class="p">))</span> <span class="o">/</span> <span class="n">RATING_RANGE</span> <span class="p">:</span> <span class="c"># when players have close level, return tie game with a certain probability</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">res</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">res</span>
</code></pre></div></div>
<p>The implementation of our distributed framework should deliver a prediction of skill values of the agents, that should be more or less in the same order as the ground truth levels we choose behind the scene.</p>
<p>Note that, once the test game has proved the implementation is correct, any game with the same interface can be used in place of this test game.</p>
<h3 id="running-the-game-matches">Running the game matches</h3>
<p>To submit the game matches to play to the cluster, let’s create a <code class="highlighter-rouge">sketch.py</code> in a Python 3 environment, using Dask API to connect to the cluster:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dask.distributed</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="n">scheduler_IP</span> <span class="o">+</span> <span class="s">':8786'</span><span class="p">)</span>
<span class="n">client</span><span class="o">.</span><span class="n">upload_file</span><span class="p">(</span><span class="s">'game.py'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_plays</span><span class="p">):</span>
<span class="n">jobs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">client</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">play</span><span class="p">,</span> <span class="n">game</span><span class="p">,</span> <span class="n">agents</span><span class="p">))</span>
</code></pre></div></div>
<p>Let’s run the matches:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>></span> python3 sketch.py
Connecting to cluster scheduler 172.17.0.2 with workers:
tcp://172.17.0.3:46703 8 cores
tcp://172.17.0.2:34123 8 cores
1000/1000 - Pending: 0, Error: 109, Completed: 891, Elapsed <span class="nb">time</span>: 2.96
Game run <span class="k">in </span>3.97
Skills computed <span class="k">in </span>4.81
Accuracy of the ratings: 1.0
</code></pre></div></div>
<p>More precisely, the final script <code class="highlighter-rouge">sketch.py</code> does the following actions:</p>
<ul>
<li>
<p>it reads the hostname file <code class="highlighter-rouge">ips.txt</code> that acts as our configuration file. The scheduler is considered to be set on the first instance in the list.</p>
</li>
<li>
<p>it prints the cluster configuration, ie available workers, number of threads per workers. In the default settings, one thread is used per core, and a thread pool on each instance.</p>
</li>
<li>
<p>it runs the matches in a distributed way connecting to Dask. The elapsed time does not reflect a real game setting, and I had difficulty to emulate a heavy computation game with <code class="highlighter-rouge">sleep</code> method or operations based on elapsed time since the proc scheduler as well as Dask considers them as non responsive tasks and rotates them with other tasks (that’s my current interpretation of the very sublinear growth of computation times). The message can be seen in the Dask logs : <code class="highlighter-rouge">distributed.core - WARNING - Event loop was unresponsive for 2.86s.</code></p>
</li>
<li>
<p>it computes the skills with the TrueSkill library. We can notice the times to compute these skills confirms us it is not necessary to distribute this computation which will stay very small compared to the game times in a real world setting.</p>
</li>
<li>
<p>it estimates the accuracy of the predicted skills. For this estimation, I’m using a very simple algorithm in which I pick all pairs of agents and I check if their skills are aligned with the level they were assigned. I could not simply compare orders since a small error in ordering a pair of players could have an impact on the complete ordering.</p>
</li>
</ul>
<p>To get more information about the parameters to run the code:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> python3 sketch.py <span class="nt">--help</span>
usage: sketch.py <span class="o">[</span><span class="nt">-h</span><span class="o">]</span> <span class="o">[</span><span class="nt">--num-agents</span> NUM_AGENTS] <span class="o">[</span><span class="nt">--num-matches</span> NUM_MATCHES]
<span class="o">[</span><span class="nt">--ip-file</span> IP_FILE]
optional arguments:
<span class="nt">-h</span>, <span class="nt">--help</span> show this <span class="nb">help </span>message and <span class="nb">exit</span>
<span class="nt">--num-agents</span> NUM_AGENTS
number of players
<span class="nt">--num-matches</span> NUM_MATCHES
number of matches to play
<span class="nt">--ip-file</span> IP_FILE location of the nodes
</code></pre></div></div>
<p>Let’s check that if I use less plays, the accuracy will drop:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> python3 sketch.py <span class="nt">--num-matches</span><span class="o">=</span>10
Connecting to cluster scheduler 172.17.0.2 with workers:
tcp://172.17.0.2:34123 8 cores
tcp://172.17.0.3:46703 8 cores
10/10 - Pending: 0, Error: 1, Completed: 9, Elapsed <span class="nb">time</span>: 1.03
Game run <span class="k">in </span>2.03
Skills computed <span class="k">in </span>0.05
Accuracy of the ratings: 0.8
</code></pre></div></div>
<p>The accuracy is still high due to the way the algorithm computes it: there is lot’s of redundant evaluations since all pairs are checked against all in <code class="highlighter-rouge">O(n**2)</code> while some ordering algorithm only require a complexity of <code class="highlighter-rouge">O(n)</code> or <code class="highlighter-rouge">O(log(n))</code>. Anyway, an accuracy of 1 indicates that the order is fully correct.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> python3 sketch.py <span class="nt">--num-matches</span><span class="o">=</span>100
Connecting to cluster scheduler 172.17.0.2 with workers:
tcp://172.17.0.3:46703 8 cores
tcp://172.17.0.2:34123 8 cores
100/100 - Pending: 0, Error: 13, Completed: 87, Elapsed <span class="nb">time</span>: 0.27
Game run <span class="k">in </span>1.27
Ratings computed <span class="k">in </span>0.47
Accuracy of the ratings: 0.91
</code></pre></div></div>
<h3 id="about-my-implementation">About my implementation</h3>
<p>Please note the following points:</p>
<ul>
<li>
<p>there might be some other cases to test the distribution reliability, and this comes with experience. For example, a play that never finishes. It is possible in the <code class="highlighter-rouge">check_status()</code> method to exit after a timeout or when a certain percentage of the matches has finished.</p>
</li>
<li>
<p>the reliability of the cluster is the job of <code class="highlighter-rouge">Dask</code> team</p>
</li>
<li>
<p>here, I do not re-submit plays for which there has been a failure. In this case, such failures stay irrelevant, and do not disturb the final skill accuracy computation. But, in a real world setting, it is always important to understand all sources of failures because they might hide bigger problems with a huge impact on the final result.</p>
</li>
<li>
<p>if Trueskill were computation costly, or agents heavy to move from one machine to another one, we could have partitionned the game, as in a real world competition, with semi-finals, finals, … and shuffling the partitions after a few matches inside each partition.</p>
</li>
<li>
<p>it is not so easy to test the correctness of the implementation in another way than by implementing a test game, in particular for the tie matches. For example, after a tie game between an agent and himself, at the beginning, when its skill is 25.0, the Trueskill skill becomes 25.000000000000004 or 24.999999999999993.</p>
</li>
</ul>
<p>For me, there are still some open questions, I’m not sure :</p>
<ul>
<li>
<p>how TrueSkill updates the skills in a tie game / drawn game between an agent and himself, so so I removed this case in my implementation, and</p>
</li>
<li>
<p>about the behavior of Dask default distributed scheduler and the Python interpretation in the case of unresponsive tasks (<code class="highlighter-rouge">sleep</code> method or time-based algorithm). It looks like the Python interpreter has optimized the sequence of instructions of the game, and the pool of threads takes into account such unresponsive tasks. I also tested the resource limitation feature implemented in Dask, by simulating limited resource workers with <code class="highlighter-rouge">dask-worker 172.17.0.2:8786 --resources "GPU=8"</code> command line, and adding <code class="highlighter-rouge">resources={'GPU': 1}</code> to the job submission, but this did not help block and wait the scheduler. So, the only way would be to use real costly operations to emulate a heavy game.</p>
</li>
</ul>
<p>Last but not least, it would be interesting to use the quality estimation of a game given by TrueSkill to reduce the number of matches required to estimate the skills. The more likely a match to be a drawn one, the better the match will be to update the skills.</p>
<p><strong>Well done!</strong></p>
Fri, 01 Dec 2017 00:00:51 +0000
//christopher5106.github.io/reinforcement/learning/2017/12/01/python-dash-evaluate-true-skills-reinforcement-learning-agents-distributed-cluster.html
//christopher5106.github.io/reinforcement/learning/2017/12/01/python-dash-evaluate-true-skills-reinforcement-learning-agents-distributed-cluster.htmlreinforcementlearning