Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    Efficient orthogonal parametrisation of recurrent neural networks using householder reflections
    Mhammedi, Zakaria ( 2017)
    In machine learning, Recurrent Neural Networks (RNNs) have been successfully used in many applications. They are particularly well suited for problems involving time-series. This is because RNNs process input sequences one element at a time, and thus, the chronological order of these elements is taken into account. In many practical prediction tasks involving time-series, long-term time dependencies in data are crucial. These are the features that RNNs need to learn. However, learning these long-term dependencies in sequences using RNNs is still a major challenge. This is mainly due to the exploding and vanishing gradient problems, which are more prominent the longer the input sequences are. Recent methods have been suggested to solve this problem by constraining the transition matrix to be unitary during training, which ensures that its norm is exactly equal to one. This sets an upper bound on the norm of the back-propagated gradients, preventing them from growing exponentially. However, existing methods either have limited expressiveness or scale poorly with the size of the network when compared with the simple RNN case, especially when using stochastic gradient descent. Our contributions are as follows. We first show that constraining the transition matrix to be unitary is a special case of an orthogonal constraint. Therefore, it may not be necessary to work with complex-valued matrices. Then we present a new parametrisation of the transition matrix which allows efficient training of an RNN while ensuring that the matrix is always orthogonal. Furthermore, a good trade-off between speed and expressiveness can be chosen by selecting the right number of reflection vectors - the vectors used in the new parametrisation. In particular, when the number of reflection vector is equal to the size of the hidden layer the transition matrix is allowed to span the full set of orthogonal matrices. Using our approach, one stochastic gradient step can, in the worst case, be performed in time complexity $\mathcal{O}(\bm{T} n^2)$, where $T$ and $n$ are the length of the input sequence and the size of the hidden layer respectively. This time complexity is the same as that of the simple RNN. Finally, we test our new parametrisation on problems with long-term dependencies. Our results suggest that the orthogonal constraint on the transition matrix has similar benefits to the unitary constraint.