Notes On Optimization On Stiefel Manifolds

It's time to take what you have in your fridge and make it better. Get tips and ideas to bring something better to the table. Conversion Calculator. Know exactly how much Better Than Bouillon you will need in your favorite recipes. Enter Values to Convert. An organic take on one of our signature flavors, Organic. Better than bouillon flavors. May 19, 2008 Better than Bouillon Concentrated Stocks are made with Roasted Chicken, Roasted Beef, Roasted Garlic or Seasoned Vegetables. This gives them a richer, more robust flavor than ordinary bouillons or soup bases. Our chicken tastes like chicken because it IS chicken; our beef tastes like beef because it IS beef. Sep 26, 2017 While I love them all, the one I use in most dishes is Better than Bouillon. Technically, it’s a food base — an intensely flavored paste made of ground meats, vegetables, and salt that can be used to make quick broths or enhance other recipes — but I think of it as magic paste. There’s more than one way to make food better. Better Than Bouillon bases are available in a wide variety of flavors made with only premium ingredients. They’re a great-tasting, high-quality way to get creative with your cooking and enhance everything from soups to.

Notes On Optimization On Stifel Manifolds Work

Full-Capacity Unitary Recurrent Neural Networks

Abstract

Recurrent neural networks are powerful models for processing sequential data, but they are generally plagued by vanishing and exploding gradient problems. Unitary recurrent neural networks (uRNNs), which use unitary recurrence matrices, have recently been proposed as a means to avoid these issues. However, in previous experiments, the recurrence matrices were restricted to be a product of parameterized unitary matrices, andan open question remains:when does such a parameterization fail to represent all unitary matrices, and how does this restricted representational capacity limit what can be learned?To address this question,we propose full-capacity uRNNs that optimize their recurrence matrix over all unitary matrices, leading to significantly improved performance over uRNNs that use a restricted-capacity recurrence matrix.Our contribution consists of two main components.First, we provide a theoretical argument to determine if a unitary parameterization has restricted capacity. Using this argument, we show that a recently proposed unitary parameterization has restricted capacity for hidden state dimension greater than 7.Second, we show how a complete, full-capacity unitary recurrence matrix can be optimized over the differentiable manifold of unitary matrices. The resulting multiplicative gradient step is very simple and does not require gradient clipping or learning rate adaptation.We confirm the utility of our claims by empirically evaluating our new full-capacity uRNNs on both synthetic and natural data, achieving superior performance compared to both LSTMs and the original restricted-capacity uRNNs.

setcitestyle

1 Introduction

Deep feed-forward and recurrent neural networks have been shown to be remarkably effective in a wide variety of problems. A primary difficulty in training using gradient-based methods has been the so-called vanishing or exploding gradient problem, in which the instability of the gradients over multiple layers can impede learning [1, 2]. This problem is particularly keen for recurrent networks, since the repeated use of the recurrent weight matrix can magnify any instability.

This problem has been addressed in the past by various means, including gradient clipping [3], using orthogonal matrices for initialization of the recurrence matrix [4, 5], or by using pioneering architectures such as long short-term memory (LSTM) recurrent networks [6] or gated recurrent units [7]. Recently, several innovative architectures have been introduced to improve information flow in a network: residual networks, which directly pass information from previous layers up in a feed-forward network [8], and attention networks, which allow a recurrent network to access past activations [9].The idea of using a unitary recurrent weight matrix was introduced so that the gradients are inherently stable and do not vanish or explode [10]. The resulting unitary recurrent neural network (uRNN) is complex-valued and uses a complex form of the rectified linear activation function. However, this idea was investigated using, as we show, a potentially restricted form of unitary matrices.

The two main components of our contribution can be summarized as follows:

1) We provide a theoretical argument to determine the smallest dimension N for which any parameterization of the unitary recurrence matrix does not cover the entire set of all unitary matrices. The argument relies on counting real-valued parameters and using Sard’s theorem to show that the smooth map from these parameters to the unitary manifold is not onto.Thus, we can show that a previously proposed parameterization [10] cannot represent all unitary matrices larger than 7×7. Thus, such a parameterization results in what we refer to as a restricted-capacity unitary recurrence matrix.

2) To overcome the limitations of restricted-capacity parameterizations, we propose a new method for stochastic gradient descent for training the unitary recurrence matrix, which constrains the gradient to lie on the differentiable manifold of unitary matrices. This approach allows us to directly optimize a complete, or full-capacity, unitary matrix.Neither restricted-capacity nor full-capacity unitary matrix optimization require gradient clipping. Furthermore, full-capacity optimization still achieves good results without adaptation of the learning rate during training.

To test the limitations of a restricted-capacity representation and to confirm that our full-capacity uRNN does have practical implications, we test restricted-capacity and full-capacity uRNNs on both synthetic and natural data tasks. These tasks include synthetic system identification, long-term memorization, frame-to-frame prediction of speech spectra, and pixel-by-pixel classification of handwritten digits.Our proposed full-capacity uRNNs generally achieve equivalent or superior performance on synthetic and natural data compared to both LSTMs [6] and the original restricted-capacity uRNNs [10].

In the next section, we give an overview of unitary recurrent neural networks.Section 3 presents our first contribution: the theoretical argument to determine if any unitary parameterization has restricted-capacity.Section 4 describes our second contribution, where we show how to optimize a full-capacity unitary matrix. We confirm our results with simulated and natural data in Section 5 and present our conclusions in Section 6.

2 Unitary recurrent neural networks

The uRNN proposed by Arjovsky et al. [10] consists of the following nonlinear dynamical system that has real- or complex-valuedinputsxt of dimension M,complex-valued hidden statesht of dimension N,and real- or complex-valued outputsyt of dimension L:

ht=	σb(Wht−1+Vxt)	(1)
yt=	Uht+c,	(1)

where yt=Re{Uht+c} if the outputs yt are real-valued. The element-wise nonlinearity σ is

[σb(z)]i={(|zi|+bi)zi|zi|,if |zi|+bi>0,0,otherwise.

(2)

Note that this non-linearity consists in a soft-thresholding of the magnitude using the bias vector b. Hard-thresholding would set the output of σ to zi if |zi|+bi>0.The parameters of the uRNN are as follows:W∈U(N), unitary hidden state transition matrix; V∈CN×M, input-to-hidden transformation; b∈RN, nonlinearity bias; U∈CL×N, hidden-to-output transformation; and c∈CL, output bias.

Arjovsky et al. [10] propose the following parameterization of the unitary matrix W:

Wu(θu)=D3R2F−1D2PR1FD1,

(3)

where D are diagonal unitary matrices, R are Householder reflection matrices [11], F is a discrete Fourier transform (DFT) matrix, and P is a permutation matrix. The resulting matrix Wu is unitary because all its component matrices are unitary. This decomposition is efficient because diagonal, reflection, and permutation matrices are O(N) to compute, and DFTs can be computed efficiently in O(NlogN) time using the fast Fourier transform (FFT). The parameter vector θu consists of 7N real-valued parameters: N parameters for each of the 3 diagonal matrices where Di,i=ejθi and 2N parameters for each of the 2 Householder reflection matrices, which are real and imaginary values of the complex reflection vectors ui: Ri=I−2uiuHi⟨ui,ui⟩.

3 Estimating the representation capacity of structured unitary matrices

In this section,we state and prove a theoremthat can be used to determine when any particular unitary parameterization does not have capacity to representall unitary matrices.As an application of this theorem, we show that the parameterization (3) does not have the capacity to cover all N×N unitary matrices for N>7.First, we establish an upper bound on the number of real-valued parameters required to represent any N×N unitary matrix. Then, we state and prove our theorem.

Lemma 3.1

The set of all unitary matrices is a manifold of dimension N2.

Proof:The set of all unitary matrices is the well-known unitary Lie group U(N)[12, §3.4]. A Lie group identifies group elements with points on a differentiable manifold [12, §2.2]. The dimension of the manifold is equal to the dimension of the Lie algebra u, which is a vector space that is the tangent space at the identity element [12, §4.5].For U(N), the Lie algebra consists of all skew-Hermitian matrices A[12, §5.4].A skew-Hermitian matrix is any A∈CN×N such that A=−AH, where (⋅)H is the conjugate transpose. To determine the dimension of U(N), we can determine the dimension of u. Because of the skew-Hermitian constraint, the diagonal elements of A are purely imaginary, which corresponds to N real-valued parameters. Also, since Ai,j=−A∗j,i, the upper and lower triangular parts of A are parameterized by N(N−1)2 complex numbers, which corresponds to an additional N2−N real parameters. Thus, U(N) is a manifold of dimension N2.

Theorem 3.2

If a family of N×N unitary matrices is parameterized by P real-valued parameters for P<N2, then it cannot contain all N×N unitary matrices.

Proof:We consider a family of unitary matrices that is parameterized by P real-valued parameters through a smooth map g:P(P)→U(N2) from the space of parameters P(P) to the space of all unitary matrices U(N2). The space P(P) of parameters is considered as a P-dimensional manifold, while the space U(N2) of all unitary matrices is an N2-dimensional manifold according to lemma 3.1.Then, if P<N2, Sard’s theorem [13] implies that the image g(P) of g is of measure zero in U(N2), and in particular g is not onto. Since g is not onto, there must exist a unitary matrix W∈U(N2) for which there is no corresponding input P∈P(P) such that W=g(P). Thus, if P is such that P<N2, the manifold P(P) cannot represent all unitary matrices in U(N2).

We now apply Theorem 3.2 to the parameterization (3). Note that the parameterization (3) has P=7N real-valued parameters. If we solve for N in 7N<N2, we get N>7. Thus, the parameterization (3) cannot represent all unitary matrices for dimension N>7.

4 Optimizing full-capacity unitary matrices on the Stiefel manifold

In this section, we show how to get around the limitations of restricted-capacity parameterizations and directly optimize a full-capacity unitary matrix. We consider the Stiefel manifold of all N×N complex-valued matrices whose columns are N orthonormal vectors in CN[14]. Mathematically, the Stiefel manifold is defined as

VN(CN)={W∈CN×N:WHW=IN×N}.

(4)

For any W∈VN(CN), any matrix Z in the tangent space TWVN(CN) of the Stiefel manifold satisfies ZHW−WHZ=0[14].The Stiefel manifold becomes a Riemannian manifold when its tangent space is equipped with an inner product. Tagare [14] suggests using the canonical inner product, given by

⟨Z1,Z2⟩c=tr(ZH1(I−12WWH)Z2).

(5)

Under this canonical inner product on the tangent space, the gradient in the Stiefel manifold of the loss function f with respect to the matrix W is AW, whereA=GHW−WHGis a skew-Hermitian matrix and G with Gi,j=δfδWi,j is the usual gradient of the loss function f with respect to the matrix W[14]. Using these facts, Tagare [14] suggests a descent curve along the Stiefel manifold at training iteration k given by the matrix product of the Cayley transformationof A(k) with the current solution W(k):

Y(k)(λ)=(I+λ2A(k))−1(I−λ2A(k))W(k),

(6)

where λ is a learning rate and A(k)=G(k)HW(k)−W(k)HG(k). Gradient descent proceeds by performing updatesW(k+1)=Y(k)(λ).Tagare [14] suggests an Armijo-Wolfe search along the curve to adapt λ, but such a procedure would be expensive for neural network optimization since it requires multiple evaluations of the forward model and gradients. We found that simply using a fixed learning rate λ often works well. Also, RMSprop-style scaling of the gradient G(k) by a running average of the previous gradients’ norms [15] before applying the multiplicative step (6) can improve convergence. The only additional substantial computation required beyond the forward and backward passes of the network is the N×N matrix inverse in (6).

5 Experiments

All models are implemented in Theano [16], based on the implementation of restricted-capacity uRNNs by [10], available from https://github.com/amarshah/complex_RNN. All code to replicate our results is available from https://github.com/stwisdom/urnn. All models use RMSprop [15] for optimization, except that full-capacity uRNNs optimize their recurrence matrices with a fixed learning rate using the update step (6) and optional RMSprop-style gradient normalization.

5.1 Synthetic data

First, we compare the performance of full-capacity uRNNs to restricted-capacity uRNNs and LSTMs on two tasks with synthetic data. The first task is synthetic system identification, where a uRNN must learn the dynamics of a target uRNN given only samples of the target uRNN’s inputs and outputs. The second task is the copy memory problem, in which the network must recall a sequence of data after a long period of time.

System identification

For the task of system identification, we consider the problem of learning the dynamics of a nonlinear dynamical system that has the form (1), given a dataset of inputs and outputs of the system. We will draw a true system Wsys randomly from either a constrained set Wu of restricted-capacity unitary matrices using the parameterization Wu(θu) in (3) or from a wider set Wg of restricted-capacity unitary matrices that are guaranteed to lie outside Wu. We sample from Wg by taking a matrix product of two unitary matrices drawn from Wu.

We use a sequence length of T=150, and we set the input dimension M and output dimension L both equal to the hidden state dimension N. The input-to-hidden transformation V and output-to-hidden transformation U are both set to identity, the output bias c is set to 0, the initial state is set to 0, and the hidden bias b is drawn from a uniform distribution in the range [−0.11,−0.09]. The hidden bias has a mean of −0.1 to ensure stability of the system outputs. Inputs are generated by sampling T-length i.i.d. sequences of zero-mean, diagonal and unit covariance circular complex-valued Gaussians of dimension N. The outputs are created by running the system (1) forward on the inputs.

We compare a restricted-capacity uRNN using the parameterization from (3) and a full-capacity uRNN using Stiefel manifold optimization with no gradient normalization as described in Section 4. We choose hidden state dimensions N to test critical points predicted by our arguments in Section 3 of Wu(θu) in (3):N∈{4,6,7,8,16}. These dimensions are chosen to test below, at, and above the critical dimension of 7.

For all experiments, the number of training, validation, and test sequences are 20000, 1000, and 1000, respectively.Mean-squared error (MSE) is used as the loss function. The learning rate is 0.001 with a batch size of 50 for all experiments.Both models use the same matrix drawn from Wu as initialization. To isolate the effect of unitary recurrence matrix capacity, we only optimize W, setting all other parameters to true oracle values. For each method, we report the best test loss over 100 epochs and over 6 random initializations for the optimization.

The results are shown in Table 1. “Wsys init.” refers to the initialization of the true system unitary matrix Wsys, which is sampled from either the restricted-capacity set Wu or the wider set Wg.

Notice that for N<7, the restricted-capacity uRNN achieves comparable or better performance than the full-capacity uRNN. At N=7, the restricted-capacity and full-capacity uRNNs achieve relatively comparable performance, with the full-capacity uRNN achieving slightly lower error. For N>7, the full-capacity uRNN always achieves better performance versus the restricted-capacity uRNN. This result confirms our theoretical arguments that the restricted-capacity parameterization in (3) lacks the capacity to model all matrices in the unitary group for N>7 and indicates the advantage of using a full-capacity unitary recurrence matrix.

Copy memory problem

The experimental setup follows the copy memory problem from [10], which itself was based on the experiment from [6]. We consider alternative hidden state dimensions and extend the sequence lengths to T=1000 and T=2000, which are longer than the maximum length of T=750 considered in previous literature.

In this task, the data is a vector of length T+20 and consists of elements from 10 categories. The vector begins with a sequence of 10 symbols sampled uniformly from categories 1 to 8. The next T−1 elements of the vector are the ninth ’blank’ category, followed by an element from the tenth category, the ‘delimiter’. The remaining ten elements are ‘blank’.The task is to output T+10 blank characters followed by the sequence from the beginning of the vector. We use average cross entropy as the training loss function.The baseline solution outputs the blank category for T+10 time steps and then guesses a random symbol uniformly from the first eight categories. This baseline has an expected average cross entropy of 10log(8)T+20.

The full-capacity uRNN uses a hidden state size of N=128 with no gradient normalization. To match the number of parameters (≈22k), we use N=470 for the restricted-capacity uRNN, and N=68 for the LSTM.The training set size is 100000 and the test set size is 10000. The results of the T=1000 experiment can be found on the left half of Figure 1. The full-capacity uRNN converges to a solution with zero average cross entropy after about 2000 training iterations, whereas the restricted-capacity uRNN settles to the baseline solution of 0.020. The results of the T=2000 experiment can be found on the right half of Figure 1. The full-capacity uRNN hovers around the baseline solution for about 5000 training iterations, after which it drops down to zero average cross entropy. The restricted-capacity again settles down to the baseline solution of 0.010.These results demonstrate that the full-capacity uRNN is very effective for problems requiring very long memory.

5.2 Speech data

We now apply restricted-capacity and full-capacity uRNNs to real-world speech data and compare their performance to LSTMs.The main task we consider is predicting the log-magnitude of future frames of a short-time Fourier transform (STFT).The STFT is a commonly used feature domain for speech enhancement, and is defined as the Fourier transform of short windowed frames of the time series. In the STFT domain, a real-valued audio signal is represented as a complex-valued F×T matrix composed of T frames that are each composed of F=Nwin/2+1 frequency bins, where Nwin is the duration of the time-domain frame. Most speech processing algorithms use the log-magnitude of the complex STFT values and reconstruct the processed audio signal using the phase of the original observations.

The frame prediction task is as follows: given all the log-magnitudes of STFT frames up to time t, predict the log-magnitude of the STFT frame at time t+1.We use the TIMIT dataset [17]. According to common practice [18], weuse a training set with 3690 utterances from 462 speakers, a validation set of 400 utterances, an evaluation set of 192 utterances. Training, validation, and evaluation sets have distinct speakers. Results are reported on the evaluation set using the network parameters that perform best on the validation set in terms of the loss function over three training trials. All TIMIT audio is resampled to 8kHz. The STFT uses a Hann analysis window of 256 samples (32 milliseconds) and a window hop of 128 samples (16 milliseconds).

The LSTM requires gradient clipping during optimization, while the restricted-capacity and full-capacity uRNNs do not. The hidden state dimensions N of the LSTM are chosen to match the number of parameters of the full-capacity uRNN. For the restricted-capacity uRNN, we run models that match either N or number of parameters. For the LSTM and restricted-capacity uRNNs, we use RMSprop [15] with a learning rate of 0.001, momentum 0.9, and averaging parameter 0.1. For the full-capacity uRNN, we also use RMSprop to optimize all network parameters, except for the recurrence matrix, for which we use stochastic gradient descent along the Stiefel manifold using the update (6) with a fixed learning rate of 0.001 and no gradient normalization.

Results are shown in Table 2, and Figure 2 shows example predictions of the three types of networks. Results in Table 2 are given in terms of the mean-squared error (MSE) loss function and several metrics computed on the time-domain signals, which are reconstructed from the predicted log-magnitude and the original phase of the STFT. These time-domain metrics are segmental signal-to-noise ratio (SegSNR), short-time objective intelligibility (STOI), and perceptual evaluation of speech quality (PESQ). SegSNR, computed using [19], uses a voice activity detector to avoid measuring SNR in silent frames.STOI is designed to correlate well with human intelligibility of speech, and takes on values between 0 and 1, with a higher score indicating higher intelligibility [20]. PESQ is the ITU-T standard for telephone voice quality testing [21, 22], and is a popular perceptual quality metric for speech enhancement [23]. PESQ ranges from 1 (bad quality) to 4.5 (no distortion).

Note that full-capacity uRNNs generally perform better than restricted-capacity uRNNs with the samenumber of parameters,and both types of uRNN significantly outperform LSTMs.

5.3 Pixel-by-pixel MNIST

As another challenging long-term memory task with natural data, we test the performance of LSTMs and uRNNs on pixel-by-pixel MNIST and permuted pixel-by-pixel MNIST, first proposed by [5] and used by [10] to test restricted-capacity uRNNs. For permuted pixel-by-pixel MNIST, the pixels are shuffled, thereby creating some non-local dependencies between pixels in an image. Since the MNIST images are 28×28 pixels, resulting pixel-by-pixel sequences are T=784 elements long. We use 5000 of the 60000 training examples as a validation set to perform early stopping with a patience of 5. The loss function is cross-entropy. Weights with the best validation loss are used to process the evaluation set. The full-capacity uRNN uses RMSprop-style gradient normalization.

Learning curves are shown in Figure 3, and a summary of classification accuracies is shown in Table 3. For the unpermuted task, the LSTM with N=256 achieves the best evaluation accuracy of 98.2%. For the permuted task, the full-capacity uRNN with N=512 achieves the best evaluation accuracy of 94.1%, which is state-of-the-art on this task. Both uRNNs outperform LSTMs on the permuted case, achieving their best performance after fewer traing epochs and using an equal or lesser number of trainable parameters. This performance difference suggests that LSTMs are only able to model local dependencies, while uRNNs have superior long-term memory capabilities. Despite not representing all unitary matrices, the restricted-capacity uRNN with N=512 still achieves impressive test accuracy of 93.3% with only 1/16 of the trainable parameters, outperforming the full-capacity uRNN with N=116 that matches number of parameters.This result suggests that further exploration into the potential trade-off between hidden state dimension N and capacity of unitary parameterizations is necessary.

6 Conclusion

Unitary recurrent matrices prove to be an effective means of addressing the vanishing and exploding gradient problems.We provided a theoretical argument to quantify the capacity of constrained unitary matrices.We also described a method for directly optimizing a full-capacity unitary matrix by constraining the gradient to lie in the differentiable manifold of unitary matrices. The effect of restricting the capacity of the unitary weight matrix was tested on system identification and memory tasks, in which full-capacity unitary recurrent neural networks (uRNNs) outperformed restricted-capacity uRNNs from [10] as well as LSTMs.Full-capacity uRNNs also outperformed restricted-capacity uRNNs on log-magnitude STFT prediction of natural speech signals and classification of permuted pixel-by-pixel images of handwritten digits, and both types of uRNN significantly outperformed LSTMs.In future work, we plan to explore more general forms of restricted-capacity unitary matrices, including constructions based on products of elementary unitary matrices such as Householder operators or Givens operators.

Acknowledgments:We thank an anonymous reviewer for suggesting improvements to our proof in Section 3 and Vamsi Potluru for helpful discussions. Scott Wisdom and Thomas Powers were funded by U.S. ONR contract number N00014-12-G-0078, delivery orders 13 and 24. Les Atlas was funded by U.S. ARO grant W911NF-15-1-0450.

References

Y. Bengio, P. Simard, and P. Frasconi.Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber.Gradient flow in recurrent nets: the difficulty of learning long-termdependencies.In S. C. Kremer and J. F. Kolen, eds, A field guide to dynamicalrecurrent neural networks. IEEE Press, 2001.
R. Pascanu, T. Mikolov, and Y. Bengio.On the difficulty of training Recurrent Neural Networks.arXiv:1211.5063, Nov. 2012.
A. M. Saxe, J. L. McClelland, and S. Ganguli.Exact solutions to the nonlinear dynamics of learning in deep linearneural networks.arXiv:1312.6120, Dec. 2013.
Q. V. Le, N. Jaitly, and G. E. Hinton.A simple way to initialize recurrent networks of rectified linearunits.arXiv:1504.00941, Apr. 2015.
S. Hochreiter and J. Schmidhuber.Long short-term memory.Neural computation, 9(8):1735–1780, 1997.
K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio.On the properties of neural machine translation: Encoder-decoderapproaches.arXiv:1409.1259, 2014.
K. He, X. Zhang, S. Ren, and J. Sun.Deep residual learning for image recognition.arXiv:1512.03385, Dec. 2015.
V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu.Recurrent models of visual attention.In Advances in Neural Information Processing Systems (NIPS),pp. 2204–2212, 2014.
M. Arjovsky, A. Shah, and Y. Bengio.Unitary Evolution Recurrent Neural Networks.In International Conference on Machine Learning (ICML), Jun.2016.
A. S. Householder.Unitary triangularization of a nonsymmetric matrix.Journal of the ACM, 5(4):339–342, 1958.
R. Gilmore.Lie groups, physics, and geometry: an introduction forphysicists, engineers and chemists.Cambridge University Press, 2008.
A. Sard.The measure of the critical values of differentiable maps.Bulletin of the American Mathematical Society, 48(12):883–890,1942.
H. D. Tagare.Notes on optimization on Stiefel manifolds.Technical report, Yale University, 2011.
T. Tieleman and G. Hinton.Lecture 6.5—RmsProp: Divide the gradient by a runningaverage of its recent magnitude, 2012.COURSERA: Neural Networks for Machine Learning.
Theano Development Team.Theano: A Python framework for fast computation of mathematicalexpressions.arXiv: 1605.02688, May 2016.
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett.DARPA TIMIT acoustic-phonetic continous speech corpus.Technical Report NISTIR 4930, National Institute of Standards andTechnology, 1993.
A. K. Halberstadt.Heterogeneous acoustic measurements and multiple classifiers forspeech recognition.PhD thesis, Massachusetts Institute of Technology, 1998.
M. Brookes.VOICEBOX: Speech processing toolbox for MATLAB, 2002.[Online]. Available:http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
C. Taal, R. Hendriks, R. Heusdens, and J. Jensen.An algorithm for intelligibility prediction of time-frequencyweighted noisy speech.IEEE Trans. on Audio, Speech, and Language Processing,19(7):2125–2136, Sep. 2011.
A. Rix, J. Beerends, M. Hollier, and A. Hekstra.Perceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs.In Proc. ICASSP, vol. 2, pp. 749–752, 2001.
ITU-T P.862.Perceptual evaluation of speech quality (PESQ): An objectivemethod for end-to-end speech quality assessment of narrow-band telephonenetworks and speech codecs, 2000.
P. C. Loizou.Speech Enhancement: Theory and Practice.CRC Press, Boca Raton, FL, Jun. 2007.

Abstract

We consider the problem of sampling from posterior distributions for Bayesian models where some parameters are restricted to be orthogonal matrices. Such matrices are sometimes used in neural networks models for reasons of regularization and stabilization of training procedures, and also can parameterize matrices of bounded rank, positive-definite matrices and others. In Byrne & Girolami (2013) authors have already considered sampling from distributions over manifolds using exact geodesic flows in a scheme similar to Hamiltonian Monte Carlo (HMC). We propose new sampling scheme for a set of orthogonal matrices that is based on the same approach, uses ideas of Riemannian optimization and does not require exact computation of geodesic flows. The method is theoretically justified by proof of symplecticity for the proposed iteration. In experiments we show that the new scheme is comparable or faster in time per iteration and more sample-efficient comparing to conventional HMC with explicit orthogonal parameterization and Geodesic Monte-Carlo. We also provide promising results of Bayesian ensembling for orthogonal neural networks and low-rank matrix factorization.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.Please do not change the page layout, or include packages like geometry,savetrees, or fullpage, which change it for you.We’re not able to reliably undo arbitrary changes to the style. Please removethe offending package(s), or layout-changing commands and try again.

Hamiltonian Monte-Carlo for Orthogonal Matrices

Viktor Yanush⁰⁰Dmitry Kropotov⁰⁰

^†^†footnotetext: ¹AUTHORERR: Missing icmlaffiliation.²AUTHORERR: Missing icmlaffiliation..Correspondence to: Viktor Yanush <[email protected]>.
@xsect

In the paper we consider the problem of sampling from posterior distributions for Bayesian models where some parameters are restricted to be orthogonal matrices. Examples of such models include orthogonal recurrent neural network (Arjovsky et al., 2016), where orthogonal hidden-to-hidden matrix is used to facilitate problems with exploding or vanishing gradients, orthogonal feed-forward neural network (Huang et al., 2017), where orthogonal matrices in fully-connected or convolutional layers give additional regularization and stabilize neural network training. Besides, orthogonal matrices can be used in parameterizations for matrices of bounded rank, positive-definite matrices or tensor decompositions.

It is known that a set of orthogonal matrices forms a Riemannian manifold (Tagare, 2011). From optimization point of view, using the properties of such manifolds can accelerate optimization procedures in many cases. Examples include optimization w.r.t. low-rank matrices (Vandereycken, 2013) and tensors in tensor train format (Steinlechner, 2015), K-FAC approach for training neural networks of different architectures (Martens & Grosse, 2015; Martens et al., 2018) and many others. Hence, it is desirable to consider the Riemannian manifold structure of orthogonal matrices within sampling procedures.

One of the major approaches for sampling in Bayesian models is Hamiltonian Monte Carlo (HMC) (Neal et al., 2011). It can be applied in stochastic setting (Chen et al., 2014) and can consider Riemannian geometry (Ma et al., 2015). However, application of these methods for sampling w.r.t. orthogonal matrices requires explicit parameterization of the corresponding manifold. In case of quadratic matrices one possible choice of unambiguous parameterization is given in (Lucas, 2011). However, it requires computing of matrix exponential which is impractical for large matrices. Another option is to consider parameterization Q=X(XTX)−1/2, where X is arbitrary [possibly rectangular] matrix. This parameterization is non-unique and in practice, as we show in this paper, could lead to slow distribution exploration.

In this paper we propose a new sampling method for Bayesian models with orthogonal matrices. Here we extend HMC approach using ideas from Riemannian optimization for orthogonal matrices from (Tagare, 2011). In general outline the main idea of the proposed method is the following. Basic Riemannian HMC approach (Girolami & Calderhead, 2011) supposes introduction of auxiliary momentum variables that are restricted to lie in tangent space for the current distribution parameters. Given some transformation of these parameters we consider vector transport transformation that maps momentum variables to tangent space for the new parameter values.Our method is an extension of previously known Geodesic Monte-Carlo (Geodesic HMC, gHMC) (Byrne & Girolami, 2013). The major difference is that step along the geodesic is replaced with retraction step which is much cheaper in terms of computation.We prove that the proposed transformation of parameters and momentum variables is symplectic – a sufficient condition for correctness of HMC sampling procedure.

We show in experiments that the proposed sampling method is way more sample-efficient than standard HMC with parameterization Q=X(XTX)−1/2. This is true both for non-stochastic and stochastic regimes. We also consider Bayesian orthogonal neural networks for MNIST and CIFAR-10 datasets and show that Bayesian emsembling can improve classification accuracy. For the sake of generality we consider low-rank matrix factorization problem and also show that Bayesian ensembling here can improve prediction quality.

@xsect

In this section we first review the conventional Hamiltonian Monte-Carlo approach and discuss some properties of orthogonal matrices manifold. Then we describe our method called Orthogonal HMC (oHMC) and Orthogonal Stochastic Gradient HMC (oSGHMC) for non-stochastic and stochastic regimes correspondingly.

@xsect

Suppose we want to sample from distribution π(θ)∝exp(−U(θ)) which we know only up to normalizing constant. For example, we may want to sample from posterior distribution of model parameters given data, i.e. U(θ)=−logp(Data∣θ)−logp(θ). In this situation we can resort to Hamiltonian Monte-Carlo approach that is essentially a Metropolis algorithm with special proposal distribution constructed using Hamiltonian dynamics. Let’s consider the following joint distribution of θ and auxiliary variable r:

π(θ,r)∝exp(−U(θ)−12rTM−1r)=exp(−H(θ,r)).

The function H(θ,r) is called Hamiltonian. Then for sampling from π(θ) it is enough to sample from the joint distribution π(θ,r) and then discard auxiliary variable r. As joint distribution factorizes

π(θ,r)=π(r)π(θ∣r),

we can sample r∼N(0,M) and then sample θ from π(θ∣r). To do this we simulate the Hamiltonian dynamics:

dθdt=∂H∂r=M−1r,	(1)
drdt=−∂H∂θ=∇θlogπ(θ).	(2)

It can be shown that solution of these differential equations preserves the value of Hamiltonian, i.e. H(θ(t),r(t))=const. Hence we can start from r(0)=r and θ(0) equaling to previous θ, take (θ(t),r(t)) from some t as new proposal and accept/reject it using standard Metropolis formula. Since H(θ(t),r(t))=H(θ(0),r(0)) the new point will be accepted with probability paccept=1.

In practice differential equations (1)-(2) cannot be integrated analytically for arbitrary distributions π(θ). However, there exist a special class of numerical integrators for Hamiltonian equations which are called symplectic. Such methods are known to approximately conserve Hamiltonian of the system and produce quite accurate solutions (Hairer, 2006). The most popular symplectic method for HMC is Leapfrog. One iteration of this method looks as follows:

rn+12=rn+ε2∇θlogπ(θn),	(3)
θn+1=θn+εM−1rn+12,	(4)
rn+1=rn+12+ε2∇θlogπ(θn+1).	(5)

Since numerical methods introduce some error, we need to calculate Metropolis acceptance probability using old and new Hamiltonian:

ρ=exp(H(θold,rold)−H(θnew,r%new)).

@xsect

If π(θ) is posterior distribution of model parameters θ given Data={xi}Ni=1, then ∇θlogπ(θ) can be written as follows:

∇θlogπ(θ)=N∑i=1∇θlogp(xi∣θ)+∇θlogp(θ).

When N is big enough the sum becomes hard to compute. The straightforward idea is to compute stochastic gradient over a minibatch instead:

∇θlogπ(θ)=NBB∑i=1∇θlogp(xi∣θ)+∇θlogp(θ).

However, this introduces noise in the Hamiltonian equations and forces to change integration method to counteract this. Moreover, exact calculation of Metropolis acceptance probability becomes intractable since we need to sum log-likelihood over all the objects. This forces to use very small step size to make acceptance probability approximately one. Stochastic algorithms can be found in Chen et al. (2014).

Suppose now that we want to sample from distribution π(θ), but θ lies on a Riemannian manifold. This means that we are given a Riemannian tensor G(θ) defining the inner product at every point θ:

⟨r1,r2⟩θ=⟨G(θ)r1,r2⟩.

This inner product has direct influence on how distances are computed in the θ space and this should be taken into account in efficient sampler in order to make larger steps in parameter space. This was done in Girolami & Calderhead (2011) and in Ma et al. (2015). The main idea is to adapt the Hamiltonian to account for Riemannian structure:

H(θ,r)=U(θ)+12rTG(θ)r.

Also, in this case the Leapfrog should be replaced with the Generalized Leapfrog.

It’s important to note that both conventional HMC and Riemannian HMC treat parameters θ and r as unconstrained variables. If this is not true, HMC steps could move parameters off the manifold. Let’s consider an example, when we have some distribution π(e), e∈E, where E is two-dimensional ellipse with axes 2a and 2b. We can parameterize the points on the ellipse at least in two ways:

e=(x,y)∈R2, such that x2a2+y2b2=1.

In the first case we can apply HMC or Riemannian HMC directly since variable t and auxiliary variable are unconstrained. In this case G(t)=a2sin2t+b2cos2t. In the second case we should have (x,y) lying on the ellipse. As auxiliary variable r is, in a sense, the point velocity it cannot have non-zero component orthogonal to the tangent space. Moreover, even if r lies in the tangent space, modifying (x,y) using equation (4) would move the point off the manifold.

For the case of rectangular orthogonal matrices it is hard to find unambiguous unconstrained parameterization. It means that for sampling on this manifold we need to do iterations that 1) always leave θ on the manifold and 2) forces r to lie in the tangent space for the current point θ. Hopefully, this can be done using known properties of orthogonal matrix manifold.

@xsect@xsect

In this subsection we present some useful definitions and concepts about manifolds. More complete discussion of this topic can be found in Absil & Malick (2012).

There is a concept of straight line in Euclidean space. On the manifold this concept is extended using notion of geodesic curve which is defined as curve γ(t) which locally minimizes arc length. Given smooth manifold M there exists tangent space TXM at every point X∈M. Given X∈M,U∈TXM,t∈R we can define exponential mapExp(X,U) as γ(1)=Γ(1,X,U) where Γ(t,x,u) is defined as a curve with the properties:

Γ(0,X,U)=XddtΓ(t,X,U)∣∣t=0=U

(6)

It is known that Exp(X,tU)=Γ(t,X,U). In Byrne & Girolami (2013) authors use exponential map to define the sampling scheme. However, in many cases exact geodesics are quite hard to compute. Any map R(t,X,U)=R(X,tU) which satisfies equations (6) is called retraction. Retractions are known to be (at least) first-order approximations of exponential map. Usually, retraction is less expensive computationally than corresponding exponential map.

Along with retraction there is also a concept of vector transport. Intuitively, when some point on the manifold moves along a curve its tangent space and all vectors in it are also moving. This process is described via notion of vector transport. Formally, for X,Y∈M it is defined as a function TX→Y:TXM→TYM, which is linear in its argument. A canonical example of such transformation is so-called parallel transport. More information can be found in Sato & Iwai (2015).

@xsect

In this subsection we present some properties of the manifold of the orthogonal matrices that will be useful later. Many of them can be found in Tagare (2011).

Let Vp(Rn)={X∈Rn×p∣XTX=I} where n≥p denote the Stiefel manifold. There is a convenient representation for the tangent space:

TXVp(Rn)={XA+X⊥B∣	AT=−A∈Rp×p,
B∈R(n−p)×p,
[XX⊥] is orthogonal}

where X⊥ is an orthogonal complement to a set of orthogonal vectors X.

For arbitrary matrix G∈Rn×p it is possible to find projection of this matrix to the tangent space at point X analytically:

projTXVp(Rn)G=G−XGTX.

Projection is taken with respect to the canonical inner product:

⟨Z1,Z2⟩X=tr(ZT1(I−12XXT)Z2)

For the Stiefel manifold exponential map and parallel transport for t:(X,U)→(X′,U′) are computed simultaneously as follows (Byrne & Girolami, 2013):

[X′,U′]=[X,U]exp(t[A−SIA])×[exp(−tA)00exp(−tA)],

(7)

where A=XTU=−UTX,S=UTU.Time complexity of that update is at least two matrix exponentials of size p×p which can be computed in O(p3) time plus time complexity of matrix multiplication which is O(np2) in this case. While in the case of big n and small p computation time for matrix exponentials is almost negligible, it is not the case for the square matrices.

Finally, for Stiefel manifold there are many examples of retractions. Most of them can be found in Absil et al. (2009). One particularly important example is based on Cayley transform.Let AT=−A∈Rn×n,X∈Vp(Rn). Let

Y(ε)=(I+ε2A)−1(I−ε2A)X.

(8)

Then it is true that:

Y(ε)TY(ε)=I,

Y(0)=X,

ddεY(ε)|ε=0=−AX.

Thus, if we take A=−(GXT−XGT), then

−AX=G−XGTX=projTXVp(Rn)G.

This retraction is useful because it makes our sampling scheme symplectic which is desirable.

@xsect

Suppose we want to sample from distribution π(θ) of orthogonal matrices. The main idea is to try to adapt the Leapfrog integration (3)-(5) in order to maintain orthogonality constraint for θ and importantly, preserve symplecticity of update. At first, we note that at every iteration we should haveθ∈Vp(Rn) and r∈TθVp(Rn). Secondly, we note that if θ∈Vp(Rn) then Riemannian gradient ^∇θlogπ(θ)∈TθVp(Rn) is definedas a particular vector in the tangent space. Given that the update (3) would keep momentum r in the same tangent space. For updating θ we can use retraction defined in equation (8). So, the update formula (4) transforms to the following one:

θn+1=Q(θn,rn,ε)θn	(9)
Q(θ,r,ε)=(I−ε2(rθT−θrT))−1(I+ε2(rθT−θrT))	(10)

Yet, that inevitably changes the tangent space and we lose the property that r∈TθVp(Rn). Nonetheless, we can fix this by making vector transport for r to the new tangent space. An orthogonal matrix X∈Vp(Rn) is an orthonormal system of p vectors on a n-dimensional sphere. The tangent matrix Z∈TXVp(Rn) is a matrix of tangent vectors which show the direction of rotation. Apparently, if the points on the sphere are rotated in some direction then their velocities should be rotated in the same direction too. So, θ and r should be updated at the same time to make sure r∈TθVp(Rn).Finally we get the following update:

θn+1=Q(θn,rn,ε)θnrn+1=Q(θn,rn,ε)rn

(11)

Combining these update rules we come to the final scheme:

rn+12=rn+ε2^∇θlogπ(θn)θn+1=Q(θn,rn+12,ε)θnrn+1=Q(θn,rn+12,ε)rn+12+ε2^∇θlogπ(θn+1)

(12)

This scheme is symmetric (can be reversed in time) and symplectic. The full proof of symplecticity is given in supplementary material. The sketch of the proof is as follows: first we need to compute Jacobian J=∂(θn+1,rn+1)∂(θn,rn). This can be done using Woodbury identity several times to rewrite inverse matrices. Having this done, we need to prove that JTAJ=A, where

A=[0I−I0].

This can be done directly by multiplying these matrices in the block fashion.

In conclusion, algorithm works like HMC, but uses new scheme for orthogonal matrices. Leapfrog consists of three steps, so if there are several groups of parameters, constrained or unconstrained, we can run each of Leapfrog steps independently on each group of parameters. The pseudocode is shown in Algorithm 1. The inverse matrix in equation (10) can be computed more efficiently using Woodbury identity. This allows us to reduce the time complexity from O(n3) to O(p3). The overall time complexity is O(p3+np2)=O(np2). This suggests that it is best to use our method with tall matrices of small rank. While asymptotically this method has the same time complexity as Geodesic Monte-Carlo, in practice it is faster and more stable which is shown in the experiments.

Stochastic version of oHMC is introduced in the same way as in SGHMC replacing the Leapfrog with our scheme. The pseudocode is shown in Algorithm 2.

@xsect

In this section we give experimental results for oHMC and oSGHMC methods. The first experiments are toy ones intended to show that the proposed method truly works and can sample from the correct distribution.Also we compare our method with Geodesic HMC (Byrne & Girolami, 2013) and show its improved time per iteration and numerical stability along with better effective sample size.Next we apply oSGHMC for Bayesian neural networks where parameters in fully-connected and convolutional layers are given by orthogonal matrices. Finally we experiment with Bayesian models consisting of low-rank matrices parameterized by singular value decomposition. Here we consider both simple low-rank matrix factorization model and neural network with low-rank fully connected layers.

@xsect

Here we consider a mixture of matrix normal distributions as a distribution to sample from. To introduce orthogonal matrices we represent the matrix using QR-decomposition. In other words, the distribution is defined in the following way:

π(Q,R)=m∑i=1πiN(QR∣Mi,σ2I),

where Q∈Rn×p is orthogonal and R∈Rp×p is upper-triangular.We choose the following parameters: m=16,πi=1m,σ=0.3. We consider two setups: n=p=2 and n=3,p=2 to show that our method works both for square and rectangular matrices. Each mode Mi is a matrix where any element Mijk is 1 or 2. Initial point for sampling is equal to M1.We run oHMC for the nsamples=20000 samples skipping the first nburn=10000 samples.Then we plot every tenth sample on some pair of coordinates. The result is shown in figure 1. Results for square and rectangular matrices are not too different and we show them only for square matrices. To get samples from the true distribution we just sample from the usual matrix normal mixture and then compute QR-decomposition of every sample matrix. As can be seen from the figure, all the modes are covered with sampled points.

We also compare our method with HMC where Q is parameterized as Q=X(XTX)−1/2. This is ambiguous parameterization which can possibly diminish efficiency of the sampler. We compare our method with Geodesic HMC as well. All methods cover the density well enough, however, our method achieves better Effective Sample Size (ESS) which is reported in table 1. Time per iteration of our method is 1.5 times better than of Geodesic HMC and comparable with standard HMC as reported in table 2.

The reason why Geodesic HMC is slower is because of the need to compute matrix exponential twice (7). While time complexity of this operation is asymptotically the same as of matrix inverse, in fact it is more expensive computationally. Another issue with matrix exponential is that it is not quite stable numerically. In this experiment, we could not use larger step size ε for Geodesic HMC because matrix exponential computation failed. Therefore effective sample size for gHMC is less than for oHMC.

@xsect

In this experiment we validate that the stochastic version also works correctly. We take the same model as in the previous experiment, but consider unimodal distribution with m=1,σ=1. We run both oHMC and oSGHMC. For the stochastic method we add artificial noise to the gradients:

~∇logπ(Q,R)=∇logπ(Q,R)+N(0,σ2%noiseI),

where σnoise=0.1. As stochastic method needs more iterations to explore the distribution we take nsamples=50000,nburn=10000. Results are shown in figure 2. As we can see, stochastic method still covers the density well and there is no collapse to the mode.

We also compare SGHMC with oSGHMC in terms of ESS. Our method achieves significantly better ESS which shows that our method explores the distribution faster. Results are given in table 3.

@xsect

In this experiment we demonstrate that oSGHMC can be applied also to neural networks. For this we train orthogonal VGG-16 (Simonyan & Zisserman, 2014) on CIFAR-10 and a small orthogonal multilayer perceptron (MLP) on MNIST. By orthogonal network we mean the network where all weight matrices except for the one from the last layer are replaced with orthogonal ones. For convolutional kernel W∈Ro×i×w×h we reshape it to matrix W∈Ro×iwh and require W to be orthogonal.To make sure the performance of sampling and not optimization is measured, we first find good initial point to start sampler from. To do this we optimize the network using oHMC method, where we switch noise to zero. So in fact the method turns into SGD + momentum-like optimization method over orthogonal matrices.

For MNIST dataset we take an orthogonal MLP with the following architecture: 784-300-300-10, where each number shows the number of units on each layer. The first number is the dimensionality of input. The last layer is the usual (non-orthogonal) fully-connected layer. Such choice is motivated by the fact that since orthogonal matrices preserve norm of the input we need to be able to scale them arbitrarily before the softmax. That is, if we make the last fully-connected layer to be orthogonal, then accuracy significantly reduces which proves our intuition.

We run the optimization for 50 epochs and achieve 97.26% accuracy on the test set. After that we run oSGHMC for nsamples=25000 samples discarding the first nburn=5000. For Bayesian ensemble we take each 1000th sample. Other parameters are: ε=0.00015, batch size=256. The results are given in table 4.

For CIFAR-10 we run optimization of VGG-16 for 200 epochs and get 92.70±0.05% accuracy on the test set. In comparison, for the same VGG-16 but with usual weight matrices we need 300 epochs to get the same quality. We want to note that our network has reduced capacity comparing to usual VGG-16 because of orthogonality constraints. Nevertheless, it can achieve almost the same quality. We take learning rate ε=0.1 and divide it by 10 after 150 epochs. We choose batch size = 128.

After that we run our sampler for nall=24000 samples where we discard first nburn=4000 as burn-in and take the last nsamples=20000. We take each 2000th sample to form an ensemble of 10 networks. Using the ensemble we get 92.85±0.03% test accuracy. Here we take rather small ε=0.00001 to make sure the method does not diverge. We gather all results in table 5. Results for non-orthogonal networks are taken from Izmailov et al. (2018).

@xsect

In this experiment we show how we can use orthogonal parameters to parameterize low-rank matrices.The straightforward way to get a low-rank matrix W∈Rm×n is to make a bottleneck:

W=W1W2, where W1∈Rm×r,W2∈Rr×n.

This factorization is ambiguous since multiplying W1 by a constant and dividing W2 by the same constant leaves W unchanged. This property makes optimization and sampling harder. On the other hand, there exists unique factorization:

W=UΣVT,	(13)
U∈Rm×r,UTU=I,	(14)
V∈Rr×n,VVT=I,	(15)
σ1≥σ2≥⋯≥σr≥0.	(16)

This is well-known singular value decomposition (SVD). This parameterization is non-redundant and can be much more efficient for optimization and sampling.

We apply these two methods to factorize the rating matrix from the MovieLens-100k dataset. We take the rank=10. The preprocessing goes as follows: we center all ratings to lie in [−2,2] where new rating ^r is defined as ^r=r−3. After that we run optimization to find a good starting point for the sampler. Afterwards, we run our sampler for nsamples=5000 with ε=0.0001. We discard the first nburn=1000 samples. We take each 500th sample and get an ensemble of 10 factorizations. The results are collected in table 6.

@xsect

In this experiment we show that low-rank matrix parameterization is applicable for the deep models too.Suppose we want to have linear layers with low-rank weight matrices. We may want to do this because this reduces number of layer parameters and allows to make our network smaller hopefully without quality deterioration.

The dataset of choice here is MNIST which is a dataset of 60000 images of size 28x28. The train part is the first 50000 images and the rest is the test part. We construct a multi-layer perceptron (MLP) using only low-rank layers. The architecture is as follows: we have two hidden layers with size of 784 which is quite big, however for the first layer we upper bound the rank by 100 and for the second by 10. The output layer is full-rank. That is, our architecture can be written as 784-784,100-784,10-10,10. We test our method against the same perceptron but in bottleneck parametrization. We train both networks for 20 epochs. The results on the train/test set are given in table 7. The accuracy is quite reasonable for the network without convolutional layers. Moreover, we can see in figure 3 that in SVD parameterization the training goes much faster.

We also ran sampling from posterior distribution for our network, however, it did not help much, probably because the accuracy is already quite high for MLP.

@xsect

In this paper we have considered a new sampling scheme for the manifold of orthogonal matrices. From the results, it is clear that oHMC improves effective sample size comparing to other methods. We have showed that our method is applicable not only for toy problems, but also for the harder tasks such as Bayesian inference for neural networks.

One of the possible concerns about the proposed method is the irreducibility property. Roughly speaking, this property means that the Markov Chain which the method constructs can travel through the whole parameter space. For the square matrices it is quite obvious that our method is not irreducible since the retraction step X→R(X,Z) preserves detX but real orthogonal matrices have two possible determinants: 1 and −1. This issue can be resolved by working with unitary matrices as they constitute a connected manifold.

For the rectangular matrices the answer is not known for us. However, there is the following intuition: for any rectangular orthogonal matrix X we can find an orthogonal complement such that the resulting square orthogonal matrix will have the determinant of desired sign. Therefore, any two rectangular can be connected using our retraction. Although this is not a formal proof, it gives us some hints about the answer to the question about the irreducibility of the method.

@xsect

In the paper we have presented a new HMC sampler for Bayesian models with orthogonal matrices. Experimental results have shown that new sampling scheme can be much more sample-efficient comparing to conventional HMC with ambiguous parameterization for orthogonal matrices. Using proposed method it becomes possible to make Bayesian ensembling for orthogonal deep neural networks.

Orthogonal matrices can also be used in parameterizations for low-rank matrices, positive-definite matrices and others. It is known that a set of low-rank matrices forms a Riemannian manifold (Vandereycken, 2013). It should be noted that the proposed oHMC and oSGHMC methods do not fully account for this manifold since here matrices U and V in SVD parameterization are allowed to change independently. We consider finding the optimal Riemannian HMC sampler for low-rank matrices for future work. In the same way it is interesting to generalize our sampling scheme for tensors in tensor train format (Steinlechner, 2015). This family for tensor representation becomes more and more popular and, in particular, can be used within neural networks (Novikov et al., 2015).

References

Notes On Optimization On Stifel Manifolds Work

Absil & Malick (2012)Absil, P.-A. and Malick, J.Projection-like retractions on matrix manifolds.SIAM Journal on Optimization, 22(1):135–158, 2012.
Absil et al. (2009)Absil, P.-A., Mahony, R., and Sepulchre, R.Optimization algorithms on matrix manifolds.Princeton University Press, 2009.
Arjovsky et al. (2016)Arjovsky, M., Shah, A., and Bengio, Y.Unitary evolution recurrent neural networks.In International Conference on Machine Learning, pp. 1120–1128, 2016.
Byrne & Girolami (2013)Byrne, S. and Girolami, M.Geodesic monte carlo on embedded manifolds.Scandinavian Journal of Statistics, 40(4):825–845, 2013.
Chen et al. (2014)Chen, T., Fox, E., and Guestrin, C.Stochastic gradient hamiltonian monte carlo.In International Conference on Machine Learning, pp. 1683–1691, 2014.
Girolami & Calderhead (2011)Girolami, M. and Calderhead, B.Riemann manifold langevin and hamiltonian monte carlo methods.Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214, 2011.
Hairer (2006)Hairer, E.Long-time energy conservation of numerical integrators.2006.
Huang et al. (2017)Huang, L., Liu, X., Lang, B., Yu, A. W., and Li, B.Orthogonal weight normalization: Solution to optimization overmultiple dependent stiefel manifolds in deep neural networks.arXiv preprint arXiv:1709.06079, 2017.
Izmailov et al. (2018)Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G.Averaging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407, 2018.
Lucas (2011)Lucas, A.Lagrangian mechanics on lie groups: a pedagogical approach.arXiv preprint arXiv:1111.1275, 2011.
Ma et al. (2015)Ma, Y.-A., Chen, T., and Fox, E.A complete recipe for stochastic gradient mcmc.In Advances in Neural Information Processing Systems, pp. 2917–2925, 2015.
Martens & Grosse (2015)Martens, J. and Grosse, R.Optimizing neural networks with kronecker-factored approximatecurvature.arXiv preprint arXiv:1503.05671, 2015.
Martens et al. (2018)Martens, J., Ba, J., and Johnson, M.Kronecker-factored curvature approximations for recurrent neuralnetworks.In International Conference on Learning Representations, 2018.URL https://openreview.net/forum?id=HyMTkQZAb.
Neal et al. (2011)Neal, R. M. et al.Mcmc using hamiltonian dynamics.Handbook of Markov Chain Monte Carlo, 2(11):2, 2011.
Novikov et al. (2015)Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P.Tensorizing neural networks.In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
Sato & Iwai (2015)Sato, H. and Iwai, T.A new, globally convergent riemannian conjugate gradient method.Optimization, 64(4):1011–1031, 2015.
Simonyan & Zisserman (2014)Simonyan, K. and Zisserman, A.Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.
Steinlechner (2015)Steinlechner, M.Riemannian optimization for high-dimensional tensor completion.Technical report, MATHICSE 5.2015, EPFL, Lausanne, Switzerland, 2015.
Tagare (2011)Tagare, H. D.Notes on optimization on stiefel manifolds.Technical report, Yale University, 2011.
Vandereycken (2013)Vandereycken, B.Low-rank matrix completion by riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013.

This project is the Torch implementation of the paper : Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks (arXiv:1709.06079 )

bibtex:

Requirements and Dependency

Install Torch with CUDA GPU
Install cudnn v5
Install dependent lua packages optnet by run:luarocks install optnet
Install Magma (you can follow the instruction in the file 'Install Magma.txt' )Note: Magma is used for the SVD on GPU. If you don't install Magma, you can not run the code on GPU (For all the experiment on CNNs, we run the experiment on GPU)

Experiments in the paper

1. Reproduce the results for sovling OMSDM problem:

Run script:

This script will download MNIST dataset automatically and you should put the 'mnist.t7'(directory) to the './dataset/' directory.You can try more small learning rate, and add more layer, or use different batch size based on this script.

2. Reproduce the results on MLP architecture:

Dataset preparations: you should download the PIE dataset, and put the data file in the directory: './dataset/' (The final paths of the data files are:'./dataset/PIE/PIE_train.dat' and './dataset/PIE/PIE_test.dat')
Execute:

-----------------------------Note that the experiment above is under MLP and run on CPU, and therefore it is not necessary to install Magma for above experiemnt --------------------

3. Reproduce the results on VGG style, BN-Incption and Wide residual network over CIFAR datset:

Dataset preparations: you should download the CIFAR-10 and CIFAR-100 datasets, and put the data file in the directory: './dataset/'
To reproduce the experimental results, you can run the script below, which include all the information of experimental configurations:

The Inception model is based on the project on: https://github.com/soumith/imagenet-multiGPU.torch.

The wide residual network model is based on the facebook torch project: https://github.com/szagoruyko/wide-residual-networks

4. Run the experiment on imageNet dataset.

(1) You should clone the facebook residual network project from:https://github.com/facebook/fb.resnet.torch
(2) You should download imageNet dataset and put it on: '/tmp/dataset/imageNet/' directory (you also can change the path, which is set in 'opts_imageNet.lua')
(3) Copy 'opts_imageNet.lua', 'exp_Conv_imageNet_expDecay.lua', 'train_expDecay.lua', 'module' and 'models' to the project's root path.
(4) Execute:

You can training other respective models by using the parameter '-model'

Contact

[email protected], Any discussions and suggestions are welcome!

For questions about Stiefel-manifolds, the set of all orthonormal k-frames in R^n.

3answers

How to solve the system of matrix equations $XX^TA = A$, $X^TX = I$?

Given tall matrix $A in mathbb R^{n times k}$ (where $n gg k$), is there a way to solve the following system of matrix equations in $X in mathbb R^{n times k}$?$$begin{aligned} X X^T A &=..

1answer

Maximize trace over Stiefel manifold

This question is the same as the question in this post. The OP of that post changed what they were asking and reduced it to a special case, so I’m asking the question in full generality here. Given ..

1answer

References for injectivity/surjectivity $pi_i(U(n-k)) rightarrow pi_i(U(n))$

I have read (Theorem 29.3, line 7, pg 144) the following statement: $pi_i(U(n-k)) rightarrow pi_i(U(n))$ is surjective for $i le 2(n-k)+1$ and injective for $i le 2(n-k)$. and ..

CL.

2,6843 gold badges9 silver badges26 bronze badges

1answer

Smallest trace of a matrix product where one is given and the other is orthogonal

What are the optimal solution and optimal value for the following semidefinite program$$ min_{ V } { mbox{tr} (VSigma) : VV^T=I }$$where $Sigma$ is a given positive semidefinite matrix, and ..

1answer

How to optimize objective in the Grassmann manifold?

For Stiefel manifold, it contains all the orthogonal column matrices$$St(d,M) = {X in R^{M times d} | X^TX = I}$$For Grassmann manifold, it is $$Gr(d,M) = {col(X), X in R^{M times d}}$$..

0answers

'Jacobian' of QR decomposition of a rectangular matrix

I want to calculate the volume of real Stiefel manifold $V_{k}(mathbb{R}^N)$ .$$ V_{k} (mathbb{R}^N) = { H in M(N, k, mathbb{R})| H^{T}H = I_{k} }$$((^T) denotes transposed matrix. $M(N, k, ..

0answers

Closed geodesics on the Stiefel manifold

I would like a reference on the following property of Stiefel manifolds $St(p,n)$: A closed geodesic $c:[a,b]rightarrow St(p,n)$ is a local minimum of the curve length function$$L(gamma)=int_a^b|..

0answers

General Stiefel-Whitney classes and Stiefel manifolds

Here are some statements that I wish to understand more deeply, whose truth value I want to check, and to determine under which criteria they are valid. Consider the Stiefel manifold $V_k(R^n)$ of ..

wonderich

2,3883 gold badges14 silver badges31 bronze badges

1answer

Spheres and orthogonal matrices as spaces of solutions to matrix equations

For $a in mathbb{R}^n$, the solutions to $$1=sum_{i=1}^na_i^2$$form an $(n-1)$-sphere in $mathbb{R}^n$. Meanwhile, for $A in mbox{GL}_k (mathbb{R})$, the solutions to$$1=AA^T$$are the ..

1answer

BO(-) example in Weiss Calculus

I'nm reading Orthogonal Calculus by Michael Weiss, and trying to understand example 2.7, concerning the derivatives of the functor $BO$, which sends a (finite dimensional) inner product space to the ..

1answer

Minimize $ mbox{tr} ( X^T A X ) + lambda mbox{tr} ( X^T B ) $ subject to $ X^T X = I $ - Linear Matrix Function with Norm Equality Constraint

We have the following optimization problem in tall matrix $X inmathbb R^{n times k}$$$begin{array}{ll} text{minimize} & mbox{tr}(X^T A X) + lambda ,mbox{tr}(X^T B) text{subject to} &..

0answers

From what manifold do we need to start to build an eversion for $any$-tridimensional object?

If a sphere eversion is possible using a half-way model how model is used for $cylinder$ eversion ?I need to make some premises to be able to frame the true nature of the problem In sphere eversion ..

0answers

Self maps on Stiefel Manifolds

Let $m_l: S^{n-1} stackrel{z mapsto z^l}{to} S^{n-1}$ be a map of spheres. Then it induces a map on $( S^{n-1})^k$, given by product. Note the Stiefel manifold $V_k(mathbb{R^{n}})$, the ..

0answers

Neighbourhood of a point in Stiefel manifold

In Hatcher above, he claims that 'This determines orthonormal bases for the $(k-m)$ planes orthogonal to all nearby $m$-frames', I feel confused about this claim (even after checking the preceding ..

1answer

Nearest Semi Orthonormal Matrix Using the Entry Wise $ {ell}_{1} $ Norm

Given an $m times n$ matrix $M$ ($m geq n$), the nearest semi-orthonormal matrix problem in $m times n$ matrix $R$ is$$begin{array}{ll} text{minimize} & | M - R |_F text{subject to} &..

153050per page

McTorch, a manifold optimizationlibrary for deep learning

Abstract

In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch¹. It aims to lower the barrier for users wishing to use manifold constraints in deep learning applications, i.e., when the parameters are constrained to lie on a manifold. Such constraints include the popular orthogonality and rank constraints, and have been recently used in a number of applications in deep learning. McTorch follows PyTorch’s architecture and decouples manifold definitions and optimizers, i.e., once a new manifold is added it can be used with any existing optimizer and vice-versa. McTorch is available at https://github.com/mctorch.

1 Introduction

Manifold optimization refers to nonlinear optimization problems of the form

minx∈Mf(x),

(1)

where f is the loss function and the parameter search space M is a smooth Riemannian manifold absil08a (). Note that the Euclidean space is trivially a manifold. Conceptually, manifold optimization translates the constrained optimization problem (1) into an unconstrained optimization over the manifold M, thereby generalizing many of the standard nonlinear optimization algorithms with guarantees absil08a (); bonnabel13a (); sato13a (); Sato17a (); zhang16a (); kasai18a (); boumal2018globalrates (). A few ingredients of manifold optimization are the matrix representations of the tangent space (linearization of the search space at a point), an inner product (to define a metric structure to compute the Riemannian gradient and Hessian efficiently), and a notion of a straight line (a characterization of the geodesic curves to maintain strict feasibility of the parameters). Two popular examples² of smooth manifolds are i) the Stiefel manifold, which is the set of n×p matrices whose columns are orthonormal, i.e., Mcoloneqq{X∈Rn×p:X⊤X=I}edelman98a () and ii) the symmetric positive definite manifold, which is the set of symmetric positive definite matrices, i.e., Mcoloneqq{X∈Rn×n:X≻0}bhatia09a ().

Manifold optimization has gained significant interest in computer vision kovnatsky2016madmm (); tron2017space (), Gaussian mixture models hosseini2015matrix (), multilingual embeddings jawanpuria18a (), matrix/tensor completion boumal14a (); mjaw18a (); Kasai2016a (); kressner13a (); mishra14a (); nimishakavi18a (); vandereycken13a (), metric learning meyer2011linear (); zadeh2016geometric (), phase synchronization boumal2016nonconvexphase (); zhong2018nearoptimal (), to name a few.

Deep learning refers to machine learning methods with multiple layers of processing to learn effective representation of data. These methods have led to state-of-the-art results in computer vision, speech, and natural language processing. Recently, manifold optimization has been applied successfully in various deep learning applications nickel18a (); arjovsky16a (); roy18a (); huang17a (); huang18a (); ozay18a (); badrinarayanan15a ().

On the practical implementation front, there exists popular toolboxes – Manopt boumal14a (), Pymanopt townsend16a (), and ROPTLIB huang16a () – that allow rapid prototyping without the burden of being well-versed with manifold-related notions. However, these toolboxes are more suitable for handling standard nonlinear optimization problems and not particularly well-suited for deep learning applications. On the other hand, PyTorch paszke17a (), a Python based deep learning library, supports tensor computations on GPU and provides dynamic tape-based auto-grad system to create neural networks. PyTorch provides a flexible format to define and train deep learning networks. Currently, however, PyTorch lacks manifold optimization support. The proposed McTorch³ library aims to bridge this gap between the standard manifold toolboxes and PyTorch by extending the latter’s functionality.

McTorch builds upon the PyTorch library for tensor computation and GPU acceleration, and derives manifold definitions and optimization methods from the toolboxes boumal14a (); huang16a (); townsend16a (). McTorch is well-integrated with PyTorch that allows users to use manifold optimization in a straightforward way.

2 Overview of McTorch

Notes On Optimization On Stiefel Manifolds

McTorch library has been implemented by extending a PyTorch fork to closely follow its architecture. All manifold definitions reside in the module torch.nn.manifold and are derived from the parent class Manifold, which defines the manifold structure (i.e., the expressions of manifold-related notions) that any manifold must implement. A few of these expressions are:

retr: to retract a tangent vector onto the manifold,
egrad2rgrad: to convert the back-propagated gradient to the Riemannian gradient.

To facilitate creation of a manifold-constrained parameter, PyTorch’s nativeParameter class is modified to accept an extra argument on initialization to specify the manifold type and size. Parameter can be initialized to a random point on the manifold or a particular value provided by the user. Parameter also holds the attribute rgrad (which stands for the Riemannian gradient) that gets updated with every back-propagated gradient step.

The existing optimizers in the module torch.optim are modified to support updates on the manifold. An optimization step is a function of parameter’s current value, gradient, and optimizer state. In the manifold optimization, the gradient is the Riemannian gradient and the update is with the retraction operation.

To use manifolds in PyTorch layers (in torch.nn.Module), we have added the property weight_manifold to the linear and convolutional layers which constrains the weight tensor of the layer to a specified manifold. As the shape of weight tensor is calculated using the inputs to the layer, we have added ManifoldShapeFactory to create a manifold object for a given tensor shape such that it obeys the initialization conditions of that manifold.

All the numerical methods are implemented using the tensor functions of PyTorch and support both CPU and GPU computations. As the implementation modifies and appends to the PyTorch code, all the user facing APIs are similar to PyTorch. McTorch currently supports:

Optimizers: SGD, Adagrad, ConjugateGradient,

3 McTorch Usage

Orthogonal weight normalization in multi-layer perceptronhuang17a (). An example showing creation and optimization of McTorch module with the Stiefel manifold.

2importtorch.nnasnn

5#AMcTorchmoduleusingmanifoldconstrainedlinearlayers.

6classOrthogonalWeightNormalizationNet(nn.Module):

7def__init__(self,input_size,hidden_sizes,output_size):

8super(OrthogonalWeightNormalizationNet,self).__init__()

9layer_sizes=[input_size]+hidden_sizes+[output_size]

13self.layers.append(nn.Linear(in_features=layer_sizes[i-1],

15weight_manifold=nn.Stiefel))

17ifi!=len(layer_sizes)-1:

19#LogSoftmaxattheoutputlayer

21self.layers.append(nn.LogSoftmax(dim=1))

25returnself.model(x)

27#Createmoduleobject.

28model=OrthogonalWeightNormalizationNet(input_size=1024,

29hidden_sizes=[128,128,128,128,128],output_size=68)

31#OptimizewiththeAdagradalgorithm.

32optimizer=torch.optim.Adagrad(params=model.parameters(),lr=1e-2)

34optmizer.zero_grad()

36output=model(data)

38cost.backward()

In the above example, a new PyTorch module is defined by inheriting from nn.Module. It requires defining an __init__ function to set up layers and a forward function to define forward pass of the module. The backward pass for back-propagation of gradients is automatically computed. In the layer definition, Linear layers with Stiefel (orthogonal) manifold are initialized. It also adds ReLU nonlinearity between layers and softmax nonlinearity on the output. The forward function of the model does a forward pass on sequence of layers. To optimize, torch.optim.Adagrad is initialized, which is followed by multiple epochs of forward and backward passes of the module on batched training data.

4 Roadmap and Conclusion

We are actively working on adding support for more manifolds and optimization algorithms. We are also curating a collection of code examples and benchmarks to showcase various uses of manifold optimization in deep learning research and applications.

McTorch is released under the BSD-3 Clause license and all the codes and examples are available on the GitHub repository of the project at https://github.com/mctorch/mctorch.

Footnotes

PyTorch is available at https://pytorch.org/ and introduced in paszke17a ().
A comprehensive list of manifolds is at http://manopt.org/tutorial.html.
McTorch stands for Manifold-constrained Torch.

References

P.-A. Absil, R. Mahony, and R. Sepulchre.Optimization Algorithms on Matrix Manifolds.Princeton University Press, 2008.
M. Arjovsky, A. Shah, and Y. Bengio.Unitary evolution recurrent neural networks.In ICML, 2016.
V. Badrinarayanan, B. Mishra, and R. Cipolla.Symmetry-invariant optimization in deep networks.Technical report, arXiv preprint arXiv:1511.01754, 2015.
R. Bhatia.Positive definite matrices, volume 24.Princeton university press, 2009.
S. Bonnabel.Stochastic gradient descent on Riemannian manifolds.IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
N. Boumal.Nonconvex phase synchronization.SIAM Journal on Optimization, 26(4):2355–2377, 2016.
N. Boumal, P.-A. Absil, and C. Cartis.Global rates of convergence for nonconvex optimization on manifolds.IMA Journal of Numerical Analysis, 2018.
N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre.Manopt, a Matlab toolbox for optimization on manifolds.Journal of Machine Learning Research, 15(Apr):1455–1459, 2014.
A. Edelman, T.A. Arias, and S.T. Smith.The geometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications,20(2):303–353, 1998.
R. Hosseini and S. Sra.Matrix manifold optimization for gaussian mixtures.In NIPS, 2015.
L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li.Orthogonal weight normalization: Solution to optimization overmultiple dependent Stiefel manifolds in deep neural networks.In AAAI, 2017.
W. Huang, P.-A. Absil, K. A. Gallivan, and P. Hand.ROPTLIB: an object-oriented C++ library for optimization onRiemannian manifolds.Technical Report FSU16-14.v2, Florida State University, 2016.
Z. Huang, J. Wu, and L. Van Gool.Building deep networks on Grassmann manifolds.In AAAI, 2018.
P. Jawanpuria, A. Balgovind, A. Kunchukuttan, and B. Mishra.Learning multilingual word embeddings in latent metric space: ageometric approach.Technical report, arXiv preprint arXiv:1808.08773, 2018.
P. Jawanpuria and B. Mishra.A unified framework for structured low-rank matrix learning.In ICML, 2018.
H. Kasai and B. Mishra.Low-rank tensor completion: a riemannian manifold preconditioningapproach.In ICML, 2016.
H. Kasai, H. Sato, and B. Mishra.Riemannian stochastic recursive gradient algorithm.In ICML, 2018.
A. Kovnatsky, K. Glashoff, and M. M. Bronstein.Madmm: a generic algorithm for non-smooth optimization on manifolds.In ECCV, 2016.
D. Kressner, M. Steinlechner, and B. Vandereycken.Low-rank tensor completion by Riemannian optimization.BIT Numerical Mathematics, 2013.Doi: 10.1007/s10543-013-0455-z.
G. Meyer, S. Bonnabel, and R. Sepulchre.Linear regression under fixed-rank constraints: a Riemannianapproach.In ICML, 2011.
B. Mishra, G. Meyer, S. Bonnabel, and R. Sepulchre.Fixed-rank matrix factorizations and Riemannian low-rankoptimization.Computational Statistics, 29(3–4):591–621, 2014.
M. Nickel and D. Kiela.Learning continuous hierarchies in the Lorentz model of hyperbolicgeometry.In ICML, 2018.
M. Nimishakavi, P. Jawanpuria, and B. Mishra.A dual framework for low-rank tensor completion.In NIPS, 2018.
M. Ozay and T. Okatani.Training CNNs with normalized kernels.In AAAI, 2018.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer.Automatic differentiation in PyTorch.In The NIPS workshop on the future of gradient-based machinelearning software & techniques, 2017.
S. K. Roy, Z. Mhammedi, and M. Harandi.Geometry aware constrained optimization techniques for deep learning.In CVPR, 2018.
H. Sato and T. Iwai.A new, globally convergent Riemannian conjugate gradient method.Optimization: A Journal of Mathematical Programming andOperations Research, 64(4):1011–1031, 2013.
H. Sato, H. Kasai, and B. Mishra.Riemannian stochastic variance reduced gradient.arXiv preprint: arXiv:1702.05594, 2017.
J. Townsend, N. Koep, and S. Weichwald.Pymanopt: A Python toolbox for optimization on manifolds usingautomatic differentiation.Journal of Machine Learning Research, 17(137):1–5, 2016.
R. Tron and K. Daniilidis.The space of essential matrices as a Riemannian quotient manifold.SIAM Journal Imaging Sciences, 10(3):1416–1445, 2017.
B. Vandereycken.Low-rank matrix completion by Riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013.
P. Zadeh, R. Hosseini, and S. Sra.Geometric mean metric learning.In ICML, 2016.
H. Zhang, S. J. Reddi, and S. Sra.Riemannian svrg: Fast stochastic optimization on Riemannianmanifolds.In NIPS, 2016.
Y. Zhong and N. Boumal.Near-optimal bounds for phase synchronization.SIAM Journal on Optimization, 28(2):989–1016, 2018.