On the relationship between disentanglement and multitask learning
Abstract
One of the main arguments behind studying disentangled representations is the assumption that they can be easily reused in different tasks. At the same time finding a joint, adaptable representation of data is one of the key challenges in the multitask learning setting. In this paper, we take a closer look at the relationship between disentanglement and multitask learning based on hard parameter sharing. We perform a thorough empirical study of the representations obtained by neural networks trained on automatically generated supervised tasks. Using a set of standard metrics we show that disentanglement appears naturally during the process of multitask neural network training.
1 Introduction
Disentangled representations have recently become an important topic in the deep learning community (Eastwood and Williams, 2018; Locatello et al., 2019a; Ma et al., 2019; Sanchez et al., 2019; Do and Tran, 2020). The main assumption in this problem is that the data encountered in the real world is generated by few independent and explanatory factors of variation. It is commonly accepted that such representations are not only more interpretable and robust but also perform better in tasks related to transfer learning and oneshot learning (Bengio, 2013; Lake et al., 2017; Schölkopf et al., 2012; Locatello et al., 2019c).
Intuitively, a disentangled representation encompasses all the factors of variation and as such can be used for various tasks based on the same input space. On the other hand, nondisentangled representations, such as those learned by vanilla neural networks, might focus only on one or a few factors of variations that are relevant for the current task, while discarding the rest. Such a representation may fail when encountering different tasks that rely on distant aspects of variation which have not been captured.
Exploiting prevalent features and differences across tasks is also the paradigm of multitask learning. In a standard formulation of a multitask setting, a model is given one input and has to return predictions for multiple tasks at once. The neural network might be therefore implicitly regularized to capture more factors of variation than a network that learns only a single task. Based on this intuition, we hypothesize that disentanglement is likely to occur in the latent representations in this type of problem.
This paper aims to test this hypothesis empirically. We investigate whether the use of disentangled representations improves the performance of a multitask neural network and whether disentanglement itself is achieved naturally during the training process in such a setting.
Our key contributions are:

Construction of synthetic datasets that allow studying the relationship between multitask and disentanglement learning.

Study of the effect of multitask learning with hard parameter sharing on the level of disentanglement obtained in the latent representation of the model.

Analysis of the informativeness of the latent representation obtained in the single and multitask training.

Inspection of the effect of disentangled representations on the performance of a multitask model.
We verify our hypotheses by training multiple models in single and multitask settings and investigating the level of disentanglement achieved in their latent representations. In our experiments, we find that in a hardparameter sharing scenario multitask learning indeed seems to encourage disentanglement. However, it is inconclusive whether disentangled representations have a clear positive impact on the models performance, as the obtained by us results in this matter vary for different datasets.
2 Related Work
2.1 Disentanglement
Over the recent years, many methods that directly encourage disentanglement have been proposed. This includes algorithms based on variational and Wasserstein autoencoders (Kim and Mnih, 2018; Higgins et al., 2017; Kumar et al., 2017; Brakel and Bengio, 2017; Spurek et al., 2020), flow networks (Dinh et al., 2014; Sorrenson et al., 2020) or generative adversarial networks (Chen et al., 2016). The main interest behind disentanglement learning lays in the assumption that such transformation unravels the semantically meaningful factors of variation present in the observations and thus it is desired in training deep learning models. In particular, disentanglement is believed to allow for informative compression of the data that results in a structural, interpretable representation, which is easily adaptable for new tasks (Bengio, 2013; Lake et al., 2017; Schmidhuber, 1992; Lipton, 2018).
Several of these properties have been experimentally proven in applications in many domains, including video processing tasks (Hsieh et al., 2018), recommendation systems (Ma et al., 2019) or abstract reasoning (Van Steenkiste et al., 2019; Steenbrugge et al., 2018). Moreover, recent research in reinforcement learning concludes that disentangling embeddings of skills allows for faster retraining and better generalization (Petangoda et al., 2019). Finally, disentanglement seems also to be positively correlated with fairness when sensitive variables are not observed (Locatello et al., 2019a). On the other hand, some empirical studies suggest that one should be cautious while interpreting the properties of disentangled representations. For instance, the latest studies in the unsupervised learning domain point that increased disentanglement does not lead to a decreased sample complexity in downstream tasks (Locatello et al., 2019b).
Another key challenge in studying disentangled representations is the fact that measuring the quality of the disentanglement is a nontrivial task (Do and Tran, 2020; Eastwood and Williams, 2018; Kim and Mnih, 2018), especially in a unsupervised setting (Locatello et al., 2019b). This motivates the research on practical advantages of disentanglement representations and their impact on the studied problem in possible future applications, which is the main focus of our work in the case of multitask learning.
2.2 Multitask Learning
Multitask learning aims at simultaneously solving multiple tasks by exploiting common information (Ruder, 2017). The approaches used predominantly to this problem are soft (Duong et al., 2015) and hard (Caruana, 1993) parameter sharing. In hard parameter sharing the weights of the model are divided into those shared by all tasks, and taskspecific. In deep learning, this idea is typically implemented by sharing consecutive layers of the network, which are responsible for learning a joint data representation. In soft parameter sharing each task is given a set of separate parameters. The limitations are then imposed by informationsharing or regularizing the distance between the parameters by adding an applicable loss to the optimization objective.
Multitask learning is widely used in the Deep Learning community, for instance in applications related to natural language processing (Liu et al., 2019), computer vision (Misra et al., 2016) or molecular property prediction modeled by graph neural networks (Capela et al., 2019). One may observe that the premises of multitask and disentanglement learning are related to each other and thus it is interesting to investigate whether the joint data representation obtained in a multitask problem exhibits some disentanglementrelated properties.
3 Methods
In this section, we describe the methods and datasets used for conducting the experiments.
3.1 Dataset Creation
In order to investigate the relationship between multitask learning and disentanglement, we require a dataset that fulfills two conditions:

It provides access to the true (disentangled) generative factors from which the observations are created.

It proposes multiple tasks for a supervised learner by providing labels which nonlinearly depend on the true factors .
The first condition is required in order to measure how well the learned representations approximate the true latent factors . Access to the true factors allows for full control over the experimental settings and permits a fair comparison through the use of supervised disentanglement metrics. Note that even though unsupervised metrics have been proposed in the literature as well, they typically yield less reliable results, as we further discuss in section 3.3.
The second condition is needed to train a network on multiple nontrivial tasks to approximate the realworld setting of multitask learning.
To our best knowledge, no nontrivial datasets exist that would abide by both those requirements. Most of the available disentanglement datasets, such as dSprites, Shapes3D, and MPI3D do fulfill the first condition, as they provide pairs of observations and their true generative factors. However, those datasets do not offer any type of challenging task on which our model could be trained. On the other hand, many datasets used for supervised multitask learning fulfill the second condition by providing pairs , but do not equip the researcher with the latent factors (ground truth), failing the first condition.
Thus, we aim to create our own datasets which fulfill both conditions by incorporating nontrivial tasks into standard disentanglement datasets. Since in multitask approaches one often tries to solve tens of tasks at once, designing them by hand is infeasible and as such we decide to generate them automatically in a principled way. In particular, since supervised learning tasks might be formalized as finding a good approximation to an unknown function given a set of points , we generate random functions which are then used to obtain targets for our dataset (see Figure 1).
We require to be both nontrivial (i.e. nonlinear and nonconvex) and sufficiently smooth to approximate the nature of reallife tasks. In order to find a family of functions that fulfills those conditions, we take inspiration from the field of extreme learning, which finds that features obtained from randomly initialized neural networks are useful for training linear models on various realworld problems (Huang et al., 2011). As such, randomly initialized networks should be able to approximate these tasks up to a linear operation.
In particular, in order to generate the dataset, we define a neural network architecture . For this purpose, we used an MLP with four hidden layers with 300 units, tanh activations, and an output layer which returns a single number. Then we sample weight initializations of this network from the Gaussian distribution , where . Each of the networks obtained by random initialization defines a single task in our approach. Thus, for a given dataset containing observations and their true generative factors, we obtain a dataset for multitask supervised learning by applying:
where is a vector of stacked target values for each task, whose element is given by .
We use this data as a regression task, i.e. for a given neural network parameterized by the goal is to find:
We use this process to create multitask supervised versions of dSprites, Shapes3D, and MPI3D, with tasks for each dataset.
3.2 Models
3.2.1 Multitask model
We investigate the relation between disentanglement and multitask learning based on a hard parameter sharing approach. In this setting, several consecutive hidden layers of the model are shared across all tasks in order to produce a joint data representation. This representation is then propagated to separate taskspecific layers which are responsible for computing the final predictions.
In particular, we use a network consisting of a shared convolutional encoder and separate fullyconnected heads for each of the tasks. The encoder learns the joint representation by transforming the inputs into a dimensional latent space. ^{1}^{1}1We provide the full model summary in Appendix A. The architecture of the encoder follows the one from (Abdi et al., 2019), which adopts the work of (Locatello et al., 2019b) for the pytorch package. We use the implementations from https://github.com/amirabdi/disentanglementpytorch. The heads are implemented by 4layer MLPs with ReLU activations, in order to match the capacity of the networks used for task generating functions . This overview of the model is illustrated in Figure 2.
3.2.2 Autoencoder model
In the second part of our experiments we want to understand if disentangled representation provides some benefits for the multitask problem. In order to produce disentangled representations, we decided to use three different representationlearning algorithms: a vanilla autoencoder, the (beta)variational autoencoder (Kingma and Welling, 2013; Higgins et al., 2017) and FactorVAE (Kim and Mnih, 2018).
All these variants of the autoencoder architecture encompass a similar framework. An autoencoder imposes a bottleneck in the network which forces a compressed knowledge representation of the original input. In some variants of those models, we additionally try to constrain the latent variables to be highly informative and independent which further correlates to disentanglement, e.g. in models like VAE and FactorVAE. We use latent representations from these models to train taskspecific heads and evaluate if disentanglement helped to decrease an error for that task.
The vanilla autoencoder is also used in Section 4.2, where we add a decoder with transposed convolutions to pretrained encoders from Section 4.1. This treatment is aimed to decode information for particular encoders in the most efficient way. As such, we find autoencoders to be a useful tool for investigating disentanglement.
3.3 Disentanglement Metrics
Measuring the qualitative and quantitative properties of the disentanglement representation discovered by the model is a nontrivial task. Due to the fact that the true generating factors of a given dataset are usually unknown, one may assume that decomposition can be obtained only to some extend.
Commonly used unsupervised metrics are based on correlation coefficients which measure the intrinsic dependencies between the latent components. Such measures are widely used in the independent component analysis (Hyvarinen and Morioka, 2016, 2017; Hirayama et al., 2017; Brakel and Bengio, 2017; Spurek et al., 2020; Bedychaj et al., 2020). However, uncorrelatedness does not imply stochastical independence. Furthermore, metrics based on linear correlations may not be able to capture higherorder dependencies and are often ineffective in large dimensional or in overdetermined spaces. All this makes the use of such unsupervised metrics questionable.
An alternative solution would be to use supervised metrics, which usually are more reliable (Locatello et al., 2019b). This is of course only possible after assuming access to the true generative factors. Such an assumption is rarely valid for realworld datasets, however, it is satisfied for synthetic datasets. Synthetic datasets present therefore a reasonable baseline for benchmarking disentanglement algorithms.
Frequently used metrics which use supervision are mutual information gap (MIG) (Chen et al., 2018), the FactorVAE metric (Kim and Mnih, 2018), Separated Attribute Predictability (SAP) score (Kumar et al., 2018) and disenanglementcompletnessinformativeness (DCI) (Eastwood and Williams, 2018). In order to comprehensively assess the level of disentanglement in our experiments, we have decided to use all of the abovementioned metrics to validate our results. A more detailed description of those metrics is available in Appendix B.
4 Results and Discussion
In this section, we describe the performed experiments and discuss the obtained results. For more details on the training regime and experimental setup please refer to Appendix C.
4.1 Does Hard Parameter Sharing Encourage Disentanglement?
One of the most common approaches to multitask learning is hard parameter sharing. The key challenge in this method is to learn a joint representation of the data which is at the same time informative about the input and can be easily processed in more than one task. It is therefore tempting to verify whether disentanglement arises in those representations implicitly, as a consequence of hard parameter sharing.
In order to investigate this problem we build a simple multitask model described in Section 3.2 and evaluate it on the three datasets discussed in Section 3.1: dSprites, Shapes3D, and MPI3D, each with artificial tasks. After the training is complete, we calculate each of the disentanglement metrics described in Section 3.3 on the latent representation of the input data^{2}^{2}2We use the implementations of Locatello et al. (2019b), which are available at https://github.com/googleresearch/disentanglement_lib. We compare the obtained results with the same metrics computed for an untrained (randomly initialized) network and for singletask models. In all the cases we use the same architecture and training regime. Note that in the singlemodel scenario we train a separate model for each of the tasks, which is implemented by utilizing only one, dedicated head in the optimization process. We train all models three times, using a different random seed in the parameters initialization procedure. We report the mean results and standard deviations in Figure 3.
We observe that disentanglement metrics computed for the representations obtained in the multitask setting are typically significantly better than the values obtained for singletask or random representations. Note that even the maximum mean result over all ten singletask models is in almost every case further than one standard deviation from the multitask mean. Moreover, this is true for all the tested datasets.
Let us also point out that instead of using separate heads for each of the tasks in the multitask model one could simply use one head with the output dimension equal to the number of tasks and perform standard multivariate regression (with no parameter sharing). As presented in Figure 4, the latent representations emerging in such a scenario are less disentangled (in terms of the considered metrics) than the representations obtained when utilizing hard parameter sharing. However, the achieved values are still better than in singletask models. This suggests that even though the increase in the metrics may be partially caused by simply training the network on higherdimensional targets, the positive influence of hard parameter sharing cannot be ignored. This advocates in favor of the hypothesis that multitask representations are indeed more disentangled than the ones arising in singletask learning.
4.2 What Are the Properties of the Learned Representations?
The previous section discussed the obtained representations by analyzing quantitative disentanglement metrics. Here, we provide more insights into the characteristics of latent encodings.
4.2.1 UMAP embeddings
In order to gain intuition behind the differences between the representations obtained in the previous experiment we compute a 2Dembedding of the latent encodings using the UMAP algorithm (McInnes et al., 2018). The results are presented in Figure 5.
The embeddings obtained for the multitask representations are much more semantically meaningful, with easily distinguishable separate clusters. Moreover, the position and internal structure of the clusters correspond to different values of the true factors. This cannot be observed for the untrained or singletask representations, suggesting that the multitask representations are indeed more successful in encompassing the information about the real values of the generative sources of the data.
input  random  singletask  multitask  

Shapes3D 
4.2.2 Latent space traversal
Providing qualitative results of the retrieved factors is a common practice in disentanglement learning (Locatello et al., 2019c; Kumar et al., 2017; Sanchez et al., 2019; Sorrenson et al., 2020; Locatello et al., 2019b). In particular, visual presentation of the interpolations over the latent space allows assessing — from a human perspective — the informativeness and decomposition of the obtained representations. Note that such analysis is possible only after adding and training a suitable decoder network, which maps the retrieved factors back to the image space.
In our setting, the decoder mirrors the architecture of the encoder (the convolutions are replaced by transposed convolutions of the same size — see Appendix A). Given the latent representations as an input, the decoder optimizes the reconstruction error (as measured by MSE) between its outputs and the original images. We train three separate decoders corresponding to the different encoders from the previous section — a randomly initialized encoder, an encoder produced by one of the singletask models, and a multitask encoder.
First, let us discuss the reconstruction quality achieved by each of the tested decoders. Results of this experiment are presented in Figure 6^{3}^{3}3Numerical values for reconstruction errors are presented in Appendix D.2.. Reconstructions produced for the multitask encodings are clearly superior to the ones obtained for the singletask encodings. In the first case, the resulting images are sharp and contain almost no noise. In contrast, the single task reconstructions are blurry and similar to the ones produced for the randomly initialized encoder. We would like to emphasise that all the decoders used the same architecture and that during their optimization the parameters of the corresponding encoders were kept fixed. Therefore the quality of the reconstruction is an important property of a latent representation, as it allows us to assess the compression capacity of the representation. From this perspective, the compression obtained in the multitask scenario is much more informative about the input than in the singletask scenario.
Another approach to the visualisation of the latent variables is to perform interpolations (traversals) in the latent space. We start by selecting a random sample from the dataset and compute its encoding . By modifying one of the components of vector from to with step and leaving the unchanged, we produce a traversal along that particular factor. We repeat this procedure for all the factors in order to capture their impact on the decoded example. Results of such traverses for the dSprites dataset are shown in Figure 7.
Note that since the models were not trained directly for disentanglement but only to solve a supervision task, it is not surprising that the representations are not as clearly factorized as in specialized methods such as FactorVAE. However, for the multitask model, certain latent dimensions still appear to be disentangled and one can easily spot the difference in quality between the single and multitask representations. In the multitask traversals, we can notice components that are responsible for the position and scale of a given figure (in Figure 7c, consider the 5th and 7th factors, respectively). In contrast, the results for single task representations demonstrate that even a slight change in any of the single latent dimensions leads to a degradation of the reconstructed examples. As expected, this effect is even more evident for the random (untrained) representations, where the corruption over latent factor is even more prevalent than in the case of a singletask traversal.
4.3 Does Disentanglement Help in Training Multitask Models?
In the previous sections, we studied whether multitask learning encourages disentanglement. Here we consider an inverse problem by asking whether using disentangled representation helps in multitask learning. To investigate this issue, we train an autoencoderbased model devised specifically to produce disentangled latent representations without access to the true latent factors. Next, we freeze its parameters and use the encoder function to transform the inputs. The obtained latent encodings are then passed directly to the heads of a multitask network which minimizes the average regression loss given the target values of the artificial tasks.
We consider three different autoencoderbased algorithms described in Section 3.2.2: a vanilla autoencoder (AE), a variational autoencoder (VAE), and the FactorVAE. The vanilla autoencoder does not directly enforce latent disentanglement during the training. In the VAE model, the prior normal distribution with identity covariance matrix implies some disentanglement. Finally, FactorVAE introduces a new module to the VAE architecture that explicitly induces informative decomposition. Therefore, the representations obtained for each subsequent model should be also naturally ordered by the level of the achieved disentanglement. For the exact values of the calculated metrics please refer to Appendix F. In addition, we also study a scenario in which we explicitly provide the true source factors. We trained all regression models three times, using a different random seed in the parameters initialization procedure.
Dataset  dSprites  Shapes3D  MPI3D 

Ground Truth  150.235 3.754  72.979 0.193  108.568 0.285 
AE  80.062 0.341  114.939 0.160  150.190 0.097 
VAE  63.260 0.260  132.072 0.169  194.865 15.61 
FactorVAE  91.937 0.199  118.396 0.423  151.646 0.336 
Table 1 summarizes the performance of the multitask model trained on the representations obtained for the abovediscussed methods. Although the representations obtained from FactorVAE are better (see, for instance, MIG or DCI measures in Appendix F) than those from VAE and AE, the encodings produced by the vanilla AE are the best among the tested, exceeding the others on Shapes3D and MPI3D and being second on dSprites. Note that these results coincide with observations presented in the literature. For example, (Locatello et al., 2019b) compared different models that enforce disentanglement during the training and showed that even a high value of that property within the factors do not constitute a better model performance. However, in two out of three datasets, the use of the ground true factors seems to significantly improve the obtained results. This may suggest that the representations produced by the considered disentanglement methods are not fully factorized. It is therefore inconclusive whether the discrepancy between the obtained results is due to the shortcomings of the used methods or a manifestation of the impracticality of disentanglement.
5 Conclusions
In this paper, we studied the relationship between multitask and disentanglement representation learning. A fair evaluation of our hypothesis is impossible on realworld datasets, without provided ground truth factors. To evaluate our results we had to introduce synthetic datasets that contain all necessary properties to be seen as a benchmark in this field. Next, we studied the effects of multitask learning with hard parameter sharing on representation learning. We found that nontrivial disentanglement appears in the representations learned in a multitask setting. Obtained factors have intuitive interpretations and correspond to the actual ground truth components. Finally, we inverted the question and investigated the hypothesis that disentangled representation is needed for multitask learning, the results however are not conclusive. We found out that multitask models benefit from disentanglement only on specific datasets. However, we cannot name an indicator of when this unambiguously applies.
References
 Variational learning with disentanglementpytorch. arXiv preprint arXiv:1912.05184. Cited by: §C.3, footnote 1.
 WICA: nonlinear weighted ica. External Links: 2001.04147 Cited by: §3.3.
 Deep learning of representations: looking forward. External Links: 1305.0445 Cited by: §1, §2.1.
 Learning independent features with adversarial nets for nonlinear ica. External Links: 1710.05050 Cited by: §2.1, §3.3.
 Multitask learning on graph neural networks applied to molecular property predictions. arXiv preprint arXiv:1910.13124. Cited by: §2.2.
 Multitask learning: a knowledgebased source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, pp. 41–48. Cited by: §2.2.
 Isolating sources of disentanglement in variational autoencoders. In Advances in neural information processing systems, pp. 2610–2620. Cited by: §3.3.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems 29, pp. 2172–2180. Cited by: §2.1.
 Nice: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §2.1.
 Theory and evaluation metrics for learning disentangled representations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1.
 Low resource dependency parsing: crosslingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 845–850. Cited by: §2.2.
 A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.1, §3.3.
 Betavae: learning basic visual concepts with a constrained variational framework. In ICLR, Cited by: §2.1, §3.2.2.
 SPLICE: fully tractable hierarchical extension of ica with pooling. In Proceedings of the International Conference on Machine Learning, Vol. 70, pp. 1491–1500. Cited by: §3.3.
 Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, pp. 517–526. Cited by: §2.1.
 Extreme learning machines: a survey. International journal of machine learning and cybernetics 2 (2), pp. 107–122. Cited by: §3.1.
 Unsupervised feature extraction by timecontrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, pp. 3765–3773. Cited by: §3.3.
 Nonlinear ica of temporally dependent stationary sources. Cited by: §3.3.
 Disentangling by factorising. arXiv preprint arXiv:1802.05983. Cited by: §2.1, §2.1, §3.2.2, §3.3.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.1, §C.3.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.2.
 Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848. Cited by: §2.1, §4.2.2.
 Variational inference of disentangled latent concepts from unlabeled observations. External Links: 1711.00848 Cited by: §3.3.
 Building machines that learn and think like people. Behavioral and brain sciences 40. Cited by: §1, §2.1.
 The mythos of model interpretability. Queue 16 (3), pp. 31–57. Cited by: §2.1.
 Multitask deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §2.2.
 On the fairness of disentangled representations. In Advances in Neural Information Processing Systems, pp. 14611–14624. Cited by: §1, §2.1.
 Challenging common assumptions in the unsupervised learning of disentangled representations. External Links: 1811.12359 Cited by: §2.1, §2.1, §3.3, §4.2.2, §4.3, footnote 1, footnote 2.
 Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258. Cited by: §1, §4.2.2.
 Learning disentangled representations for recommendation. In Advances in neural information processing systems, pp. 5711–5722. Cited by: §1, §2.1.
 Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §4.2.1.
 Crossstitch networks for multitask learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003. Cited by: §2.2.
 Disentangled skill embeddings for reinforcement learning. External Links: 1906.09223 Cited by: §2.1.
 An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §2.2.
 Learning disentangled representations via mutual information estimation. External Links: 1912.03915 Cited by: §1, §4.2.2.
 Learning factorial codes by predictability minimization. Neural computation 4 (6), pp. 863–879. Cited by: §2.1.
 On causal and anticausal learning. arXiv preprint arXiv:1206.6471. Cited by: §1.
 Disentanglement by nonlinear ica with general incompressibleflow networks (gin). External Links: 2001.04872 Cited by: §2.1, §4.2.2.
 Nonlinear ica based on cramerwold metric. In International Conference on Neural Information Processing, pp. 294–305. Cited by: §2.1, §3.3.
 Improving generalization for abstract reasoning tasks using disentangled feature representations. arXiv preprint arXiv:1811.04784. Cited by: §2.1.
 Are disentangled representations helpful for abstract visual reasoning?. In Advances in Neural Information Processing Systems, pp. 14245–14258. Cited by: §2.1.
Appendix A Summary of the architecture of the multitask model
The architecture of the convolutional encoder is provided in Table 2, together with the architecture of the corresponding decoder, which was used in experiments in Section 4.2. For the fullyconnected heads, we used the same architecture as the one utilized during dataset creation, which is presented in Table 3.
Encoder  Decoder  

Type  Kernel  Stride  Outputs  Type  Kernel  Stride  Outputs 
Conv 2d  4  2  32  Conv 2d  1  2  256 
Conv 2d  4  2  32  Conv Transpose 2d  4  2  256 
Conv 2d  4  2  64  Conv Transpose 2d  4  2  128 
Conv 2d  4  2  128  Conv Transpose 2d  4  2  128 
Conv 2d  4  2  256  Conv Transpose 2d  4  2  64 
Conv 2d  4  2  256  Conv Transpose 2d  4  2  64 
Dense  output_dim  Conv Transpose 2d  3  1  num_channels 
Type  Output shape 

Dense  300 
Dense  300 
Dense  300 
Dense  10 
Appendix B Disentanglement metrics
In our experiments, we decided to use four measures of disentanglement to comprehensively validate our results. For the convenience of the reader, in this part of the appendix, we shortly describe the used measures (for wider context we encourage the reader to refer to the original papers).
b.1 Mutual Information Gap (MIG)
MIG computes the mutual information between each of the ground truth components and the disentangled factor . The mutual information between and is denoted by . Next, the latent dimension with maximum mutual information score is identified for each of the retrieved factor (denoted by ), along with the secondbest result of the same score (denoted by ). The difference between those values gives a gap, which finally is normalized with respect to the total mutual information associated with the studied factor:
Where is the dimension of ground truth components space. To report one score we average the MIG scores of all factors.
b.2 FactorVAE metric
We start by normalizing retrieved factors by their respective standard deviation computed over the dataset. For a subset of the dataset, a ground truth component is then randomly selected and fixed at a random value. Variance is then computed over normalized retrieved factors in this subset. Next, the lowest variance factor — the one that should mostly resemble the fixed ground truth component — is associated with that ground truth component.
This procedure with selecting the subsets and fixing one of its ground truth components is then repeated multiple times (in our experiments times). As a result, the associations between disentangled factor and ground truth component are used as inputs in a majority vote classifier. FactorVAE metric is the mean accuracy of the classifier.
b.3 Separated Attribute Predictability (SAP)
SAP attributes a score to all pairs of ground truth components and disentangled factors . For continuous components, linear regression predicts the disentangled factors, and is the coefficient of determination () of the regression. In the case of categorical features, SAP fits a decision tree on ground truth components and reports the balanced classification accuracy. The final SAP score is achieved by computing the difference between the two highest values for all factors:
where is the dimension of ground truth components space, is the highest score for component and is the second highest score for the same component.
b.4 Disentanglement, Completeness, and Informativeness (DCI)
Unlike previous measures, DCI is a complete framework that allows verifying several properties of the achieved representation. Disentanglement and completeness are estimated by inspecting the regressor’s parameters to derive predictive importance weights for each pair of ground truth and retrieved components.
The completeness for ground truth component is given by
where stands for ground truth dimension and is the probability that disentangled factor is important to predict . These probabilities are obtained by dividing each importance weight by the sum of all importance weights related to a given component:
The final compactness score is an average of compactness scores over all components.
Disentanglement for retrieved factor is given by
where is the dimension of the latent space and is the probability that the latent factor is important to predict only the component . Analogously to completeness, those probabilities are normalized with respect to potentially disentangled factors:
The final disentanglement score is a weighted average of the individual disentanglement scores:
If a disentangled variable is irrelevant for predicting , then its (and thus contribution to the overall disentanglement) will be near zero.
Finally, the prediction error of the regressor measures the informativeness of the representation. Normalized inputs and outputs allow to compute the estimation error for a completely random mapping and use it to normalize the score between and .
Appendix C Training regime and experimental setup
c.1 The multitask model — experiment 4.1
We train the multitask model to minimize the sum of the task errors. The training is performed for epochs with learning rate and batch size , by using the AdaM optimizer (Kingma and Ba, 2014) with and . We repeat this procedure three times, changing the random seed initialization, and report the mean and average values of the disentanglement metrics.
c.2 Latent visualisations — experiment 4.2
The encoder architecture was taken from the experiments in Section 4.3. The multitask model for each experiment was randomly selected from one of the seeds from the 10 tasks setting. Additionally, one of the singletask encoders was selected out of the trained ones for the same seed. The random encoder was initialized by the default initialization used by the pytorch library.
The decoder architecture was optimized by minimizing the mean square error between the decoded and input image. The training was performed over epochs. We used minibatches of images and gradually reduced learning rate starting from , with a reduction of every epochs.
c.3 Classification based on latent factors — experiment 4.3
We used the same autoencoder and multitask architectures like the one used in previous experiments (and defined in Section A), however with nonlinearity given by tanh function. We trained all autoencoders for 100 epochs, using batch size 64, learning rate 0.0001, Adam optimizer (Kingma and Ba, 2014) and latent dimension equal to 8. Other hyperparameters settings were adapted from (Abdi et al., 2019). Multitask networks were trained for 30 epochs, using batch size 64, learning rate 0.0001, and adam optimizer. In order to average the scores over different runs, we repeated the multitask network training 3 times.
Appendix D Visualisations of decoded representations
d.1 UMAP Embeddings
In order to visualize the latent representations obtained for the random (untrained), singletask, and multitask models we embed them into a twodimensional space by using the UMAP algorithm. The results are shown in Figure 8. It may be observed that the embeddings obtained for the multitask representations are much more semantically meaningful. This is especially evident for the dSprites and Shapes3D datasets. The MPI3D dataset is a significantly more difficult problem, and although the multitask embeddings seem to be correlated to some of the true factors, the difference is not as visible in this case.
d.2 Reconstructions
random  singletask  multitask  

dSprites  308.04  326.30  35.97 
Shapes3D  0.044  0.082  0.008 
MPI3D  0.0021  0.0061  0.0009 
As described in Section 4.2, we trained decoders over various latent spaces produced by the encoders in the experiment from Section 4.1. We provide the numerical values of the reconstruction error in Table 4 and qualitative images of the reconstructed examples in Figure 9. It may be observed that the latent representations produced by random and single task encoders do not allow the decoder to successfully restore the input examples. Moreover, the decoder trained on singletask latent is even worse (in the case of reconstruction) than the random one.
dSprites  Shapes3D  MPI3D  

input 

random 

singletask 

multitask 
d.3 Traversals in latent space
In parallel to the study of the quality of the reconstructions, we have also explored the traversals in latent spaces. Given a latent representation of an arbitrary image we compute the traverse along each one of the components of , as described in Section 4.2. This traversal represents how the image changes if only one component is slightly modified. This procedure provides a visually qualitative way of assessing the level of disentangled in the obtained representations.
In order to complement the discussion conducted in Section 4.2 we present here also the traversals for the Shapes3D and MPI3D datasets (in Figures 11 and 12, respectively). One may observe that the results align with the quantitative studies of disentangled metrics from Figure 3 — where we showed that the most disentangled representation is obtained in the multitask scenario. Note that the most informative changes of a particular feature for a given object may be observed in multitask traversals. One may spot that object factors — although not totally disentangled — change independently from each other.
The same is not true for singletask traversals. In the example from Shapes3D dataset (Figure 11), we observe that the singletask traversals capture only the color change of the background wall. It is also not surprising that the least informative traversal comes from the randomly initialized encoder.
Appendix E Disentanglement and Hard Parameter Sharing
In Section 4.1 we discuss the influence of hard parameter sharing on disentanglement learning. Here we present the computed metrics for all models (including regression) in a tabulated manner in Table 5. In addition, we also present the average MSE loss on the test dataset in Figure 13.






Appendix F Disentangled representation as bases for multitask training
In Section 4.3 we took the opportunity to discuss how disentanglement influences multitask training. In this section, we present numerical results of all computed disentanglement metrics across trained encoders. It is not surprising that FactorVAE representations are most disentangled in the predominant number of cases. What can be read as a surprise is that FactorVAE representations are never the best in terms of the root mean square error metric of the model that was trained on them.





