Normalized Wasserstein Distance for Mixture Distributions with Applications in Adversarial Learning and Domain Adaptation
Abstract
Understanding proper distance measures between distributions is at the core of several learning tasks such as generative models, domain adaptation, clustering, etc. In this work, we focus on mixture distributions that arise naturally in several application domains where the data contains different subpopulations. For mixture distributions, established distance measures such as the Wasserstein distance do not take into account imbalanced mixture proportions. Thus, even if two mixture distributions have identical mixture components but different mixture proportions, the Wasserstein distance between them will be large. This often leads to undesired results in distancebased learning methods for mixture distributions. In this paper, we resolve this issue by introducing Normalized Wasserstein distance. The key idea is to introduce mixture proportions as optimization variables, effectively normalizing mixture proportions in the Wasserstein formulation. Using the proposed normalized Wasserstein distance, instead of the vanilla one, leads to significant gains working with mixture distributions with imbalanced mixture proportions. We demonstrate effectiveness of the proposed distance in GANs, domain adaptation, adversarial clustering and hypothesis testing over mixture of Gaussians, MNIST, CIFAR10, CelebA and VISDA datasets.
1 Introduction
Quantifying distances between probability distributions is a fundamental problem in machine learning and statistics with several applications in generative models, domain adaptation, hypothesis testing, etc. Popular probability distance measures include optimal transport measures such as the Wasserstein distance (Villani, 2008) and divergence measures such as the KullbackLeibler (KL) divergence (Cover & Thomas, 2012).
Classical distance measures, however, can lead to some issues for mixture distributions. A mixture distribution is the probability distribution of a random variable where with probability for . is the number of mixture components and is the vector of mixture (or mode) proportions. The probability distribution of each is referred to as a mixture component (or, a mode). Mixture distributions arise naturally in different applications where the data contains two or more subpopulations. For example, image datasets with different labels can be viewed as mixture (or, multimodal) distributions where samples with the same label characterize a specific mixture component. Another prominent example is the Mixture of Gaussians (MoG) where every mixture component has a Gaussian distribution.
If two mixture distributions have exactly same mixture components (i.e. same ’s) with different mixture proportions (i.e. different ’s), classical distance measures between the two will be large. This can lead to undesired results in several distancebased machine learning methods. To illustrate this issue, consider the Wasserstein distance between two distributions and , defined as (Villani, 2008)
(1)  
where is the joint distribution (or coupling) whose marginal distributions are equal to and . When no confusion arises and to simplify notation, in some equations, we use notation instead of .
The Wasserstein distance optimization is over all joint distributions (couplings) whose marginal distributions match exactly with input distributions and . This requirement can cause issues when and are mixture distributions with different mixture proportions. In this case, due to the marginal constraints, samples belonging to very different mixture components will have to be coupled together in (e.g. Figure 1(a)). Thus, using this distance measure can then lead to undesirable outcomes in problems such as domain adaptation. This motivates the need for developing a new distance measure to take into account mode imbalances in mixture distributions.
In this paper, we propose a new distance measure that resolves the issue of imbalanced mixture proportions for multimodal distributions. Our developments focus on a class of optimal transport measures, namely the Wasserstein distance Eq (1). However, our ideas can be extended naturally to other distance measures (eg. adversarial distances (Ganin & Lempitsky, 2015)) as well.
Let be an array of generator functions with components defined as . Let be a mixture probability distribution for a random variable where with probability for . Throughout the paper, we assume that has a normal distribution.
By relaxing marginal constraints of the classical Wasserstein distance (1), we introduce the Normalized Wasserstein distance (NW distance) as follows:
There are two key ideas in this definition that help resolve mode imbalance issues for mixture distributions. First, instead of directly measuring the Wasserstein distance between and , we construct two intermediate (and potentially mixture) distributions, namely and . These two distributions have the same mixture components (i.e. same ) but can have different mixture proportions (i.e. and can be different). Second, mixture proportions, and , are considered as optimization variables. This effectively normalizes mixture proportions before Wasserstein distance computations. See an example in Figure 1 (b, c) for a visualization of and , and the renormalization step.
In this paper, we show the effectiveness of the proposed Normalized Wasserstein distance in four application domains. In each case, the performance of our proposed method significantly improves against baselines when input datasets are mixture distributions with imbalanced mixture proportions. Below, we briefly highlight these results:
GANs: In Section 3, we use the normalized Wasserstein distance in GAN’s formulation to train mixture models with varying mode proportions. We show that such a generative model can help capture rare modes, decrease the complexity of the generator, and renormalize an imbalanced dataset.
Domain Adaptation: In Section 4, we formulate the problem of domain adaptation by minimizing the normalized Wasserstein distance between source and target feature embeddings. On classification tasks with imbalanced datasets, our method significantly outperforms baselines (e.g. accuracy gain in MNISTMNISTM adaptation, and gain in synthetic to real adaptation on VISDA3 dataset). On the unsupervised imagedenoising task, our method improves the baseline by .
Adversarial Clustering: In Section 5, we formulate the clustering problem as an adversarial learning task. First, we train a generative mixture model over input data using the proposed normalized Wasserstein distance. Then, we assign each sample to a generative component (cluster) using a minimum distance optimization. We observe that our method obtains high quality clusters on an imbalanced MNIST3 dataset, and significantly improves over the baselines.
Hypothesis Testing: In Section 6, we propose a test using a combination of Wasserstein and normalized Wasserstein distances to identify if two mixture distributions differ in mode components or mode proportions. Such a test can provide better understanding while comparing mixture distributions.
2 Normalized Wasserstein Distance
In this section, we introduce the normalized Wasserstein distance and discuss its properties. Recall that is an array of generator functions defined as where . Let be the set of all possible function arrays. Let be a discrete probability mass function with elements, i.e. where and . Let be the set of all possible ’s.
Let be a mixture probability distribution, i.e. it is the probability distribution of a random variable such that with probability for . We assume that has a normal distribution, i.e. . We refer to and as mixture components and proportions, respectively. The set of all such mixture distributions is defined as follows:
(2) 
where is the number of mixture components. This set captures a rich family of probability distributions. Given two distributions and belonging to the family of mixture distributions , we are interested in defining a distance measure agnostic to differences in mode proportions, but sensitive to shifts in mode components, i.e., the distance function should have high values only when mode components of and differ. If and have the same mode components but differ only in mode proportions, the distance should be low.
The main idea is to introduce mixture proportions as optimization variables in the Wasserstein distance formulation (1). This leads to the following distance measure which we refer to as the Normalized Wasserstein distance (NW distance), , defined as:
(3)  
Since the normalized Wasserstein’s optimization (3) includes mixture proportions and as optimization variables, if two mixture distributions have similar mixture components with different mixture proportions (i.e. and ), although the Wasserstein distance between the two can be large, the introduced normalized Wasserstein distance between the two will be zero. Note that is defined with respect to a set of generator functions . However, to simplify the notation, we make this dependency implicit.
To compute the NW distance, we use an alternating gradient decent approach similar to the dual computation of the Wasserstein distance (Arjovsky et al., 2017). Moreover, we impose the constraints using a softmax function. For details, see Section 2 of Supplementary material.
To illustrate how NW distance is agnostic to mode imbalances between distributions , consider an unsupervised domain adaptation problem with MNIST2 (i.e. a dataset with two classes: digits 1 and 2 from MNIST) as the source dataset, and noisy MNIST2 (i.e. a noisy version of it) as the target dataset (details of this example is presented in Section 4.2). The source dataset has digits one and digits two, while the target dataset has noisy digits one and noisy digits two. The couplings produced by estimating Wasserstein distance between the two distributions is shown in yellow lines in Figure 1a. We observe that there are many couplings between samples from incorrect mixture components. The normalized Wasserstein distance, on the other hand, constructs intermediate modenormalized distributions and , which get coupled to the correct modes of source and target distributions, respectively (see panels (b) and (c) in Figure 1)).
In what follows, we demonstrate applications of the introduced normalized Wasserstein distance in four machine learning problems  generative models, domain adaptation, hypothesis testing and adversarial clustering for mixture distributions.
3 Normalized Wasserstein GAN
Learning a probability model from data is a fundamental problem in statistics and machine learning. Building on the success of deep learning, a recent approach to this problem is using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). GANs view this problem as a game between a generator whose goal is to generate fake samples that are close to the real data training samples, and a discriminator whose goal is to distinguish between the real and fake samples. The generator and the discriminator functions are typically implemented as deep neural networks.
Most GAN frameworks can be viewed as methods that minimize a distance between the observed probability distribution, , and the generative probability distribution, , where . is referred to as the generator function. In several GAN formulations, the distance between and is formulated as another optimization which characterizes the discriminator. Several GAN architectures have been proposed in the last couple of years. A summarized list includes GANs based on optimal transport measures (e.g. Wasserstein GAN+Weight Clipping (Arjovsky et al., 2017), WGAN+Gradient Penalty (Gulrajani et al., 2017), GAN+Spectral Normalization (Miyato et al., 2018), WGAN+Truncated Gradient Penalty (Petzka et al., 2017), relaxed WGAN (Guo et al., 2017)), GANs based on divergence measures (e.g. the original GAN’s formulation (Goodfellow et al., 2014), DCGAN (Radford et al., 2015), GAN (Nowozin et al., 2016)), GANs based on momentmatching (e.g. MMDGAN (Dziugaite et al., 2015; Li et al., 2015)), and other formulations (e.g. LeastSquares GAN (Mao et al., 2016), Boundary equilibrium GAN (Berthelot et al., 2017), BigGAN (Brock et al., 2018), etc.)
If the observed distribution is a mixture one, the proposed normalized Wasserstein distance (3) can be used to compute a generative model. Instead of estimating a single generator as done in standard GANs, we estimate a mixture distribution using the proposed NW distance. We refer to this GAN as the Normalized Wasserstein GAN (or NWGAN) formulated as the following optimization:
(4) 
In this case, the NW distance simplifies as
(5) 
There are couple of differences between the proposed NWGAN and the existing GAN architecures. The generator in the proposed NWGAN is a mixture of models, each producing fraction of generated samples. We select a priori based on the application domain while is computed within the NW distance optimization. Modeling the generator as a mixture of neural networks have also been investigated in some recent works (Hoang et al., 2018; Ghosh et al., 2017). However, these methods assume that the mixture proportions are known beforehand, and are held fixed during the training. In contrast, our approach is more general as the mixture proportions are also optimized. Estimating mode proportions have several important advantages: (1) we can estimate rare modes, (2) an imbalanced dataset can be renormalized, (3) by allowing each to focus only on one part of the distribution, the quality of the generative model can be improved while the complexity of the generator can be reduced. In the following, we highlight these properties of NWGAN on different datasets.
3.1 Mixture of Gaussians
First, we illustrate results of training NWGAN on a two dimensional mixture of Gaussians. The input data is a mixure of Gaussians, each centered at a vertex of a grid as shown in Figure 2. The mean and the covariance matrix for each mode are randomly chosen. The mode proportion for mode is chosen as for (note that ).
Generations produced by NWGAN using affine generator models on this dataset is shown in Figure 2. We also compare our method with the vanilla WGAN (Arjovsky et al., 2017) and MGAN (Hoang et al., 2018). Since MGAN does not optimize over , we assume uniform mode proportions ( for all ). To train vanilla WGAN, a nonlinear generator function is used since a single affine function cannot model a mixture of Gaussian distribution.
To evaluate the generative models, we report the following quantitative scores: (1) the average mean error which is the meansquared error (MSE) between the mean vectors of real and generated samples per mode averaged over all modes, (2) the average covariance error which is the MSE between the covariance matrices of real and generated samples per mode averaged over all modes, and (3) the estimation error which is the normalized MSE between the vector of real and generated samples. Note that computing these metrics require mode assignments for generated samples. This is done based on the closeness of generative samples to the groundtruth means.
We report these error terms for different GANs in Table 1. We observe that the proposed NWGAN achieves best scores compared to the other two approaches. Also, from Figure 2, we observe that the generative model trained by MGAN misses some of the rare modes in the data. This is because of the error induced by assuming fixed mixture proportions when the groundtruth is nonuniform. Since our proposed NWGAN estimates in the optimization, even rare modes in the data are not missed. This shows the importance of estimating mixture proportions specially when the input dataset has imbalanced modes.
Method  Avg. error  Avg. error  error 

WGAN  0.007  0.0003  0.0036 
MGAN  0.007  0.0002  0.7157 
NWGAN  0.002  0.0001  0.0001 
3.2 A Mixture of CIFAR10 and CelebA
One benefit of learning mixture generative models is to disentangle the data distribution into multiple components where each component represents one mode of the input distribution. Such disentanglement is useful in many tasks such as clustering as we discuss in Section 5. To test the effectiveness of NWGAN in performing such disentanglement, we consider a mixture of images from CIFAR10 and images from CelebA (Liu et al., 2015) datasets as our input distribution. All images are reshaped to be .
To highlight the importance of optimizing mixture proportion to produce disentangled generative models, we compare the performance of NWGAN with a variation of NWGAN where the mode proportion is held fixed as (the uniform distribution). Sample generations produced by both models are shown in Figure 3. When is held fixed, the model does not produce disentangled representations (in the second mode, we observe a mix of CIFAR and CelebA generative images.) However, when we optimize , each generator produces distinct modes.
4 Normalized Wasserstein in Domain Adaptation
In this section, we demonstrate the effectiveness of the NW distance in Unsupervised Domain Adaptation (UDA) both for supervised (e.g. classification) and unsupervised (e.g. denoising) tasks. Note that the term unsupervised in UDA means that the label information in the target domain is unknown while unsupervised tasks mean that the label information in the source domain is unknown.
First, we consider domain adaptation for a classification task. Let represent the source domain while denote the target domain. Since we deal with the classification setup, we have . A common formulation for the domain adaptation problem is to transform and to a feature space where the distance between the source and target feature distributions is sufficiently small, while a good classifier can be computed for the source domain in that space (Ganin & Lempitsky, 2015). In this case, one solves the following optimization:
(6) 
where is an adaptation parameter and is the empirical classification loss function (e.g. the crossentropy loss). The distance fucntion between distributions can be adversarial distances (Ganin & Lempitsky, 2015; Tzeng et al., 2017), the Wasserstein distance (Shen et al., 2018), or MMDbased distances (Long et al., 2015, 2016).
When and are mixture distributions (which is often the case as each label corresponds to one mixture component) with different mixture proportions, the use of these classical distance measures can lead to computation of inappropriate transformation and classification functions. In this case, we propose to use the NW distance as follows:
(7) 
Computing the NW distance requires training mixture components and mode proportions . To simplify the computation, we make use of the fact that labels for the source domain (i.e. ) are known, thus source mixture components can be identified using these labels. Using this information, we can avoid the need for computing directly and use the conditional source feature distributions as a proxy for the mixture components as follows:
(8)  
where means matching distributions. Using (8), the formulation (7) can be simplified as
(9) 
The above formulation can be seen as a version of instance weighting as source samples in are weighted by . Instance weighting mechanisms have been well studied for domain adaptation (Yan et al., 2017; Yu & Szepesvári, 2012). However, different to these approaches, we train the mode proportion vector in an endtoend fashion using neural networks and integrate the instance weighting in a Wasserstein optimization. Of more relevance to our work is the method proposed in (Chen et al., 2018), where the instance weighting is trained endtoend in a neural network. However, in (Chen et al., 2018), instance weights are maximization variable with respect to the Wasserstein loss, while we show that the mixture proportions need to minimization variables to normalize mode mismatches. Moreover, our NW distance formulation can handle the case when mode assignments for source embeddings are unknown (as we discuss in Section 4.2). This case cannot be handled by the formulation of (Chen et al., 2018).
For unsupervised tasks when mode assignments for source samples are unknown, we cannot use the simplified formulation of (8). In that case, we use a domain adaptation method solving the following optimization:
(10) 
where is the loss corresponding to the desired unsupervised learning task on the source domain data.
4.1 UDA for supervised tasks
In this section, we present experiments on a domain adaptation problem for a classification task with imbalanced source and target domain datasets.
4.1.1 Mnist MnistM
In the first set of experiments, we consider adaptation between MNIST MNISTM datasets, one of the benchmark tasks for the domain adaptation problem. We consider three settings with imbalanced class proportions in source and target datasets: modes, modes, and modes. More details can be found in Table 7 of the Supplementary material.
Following (Ganin & Lempitsky, 2015), we use a modified Lenet architecture for the feature network, and a twolayer MLP for the domain discriminator. We compare our method with the following approaches: (1) Sourceonly which is a baseline model trained only on source domain with no domain adaptation performed, (2) DANN (Ganin & Lempitsky, 2015), a method where adversarial distance between source and target distibutions is minimized, and (3) Wasserstein (Shen et al., 2018) where Wasserstein distance between source and target distributions is minimized. Table 2 summarizes our results of this experiment. We observe that performing domain adaptation using adversarial distance and Wasserstein distance leads to decrease in performance compared to the baseline model. This is partially owing to not accounting for mode imbalances, thus resulting in negative transfer, i.e., samples belonging to incorrect classes are coupled and getting pushed to be close in the embedding. Our proposed NW distance, however, accounts for mode imbalances and leads to a significant boost in performance in all three settings.
Method  modes  modes  modes 

Source only  66.63  67.44  63.17 
DANN  62.34  57.56  59.31 
Wasserstein  61.75  60.56  58.22 
NW distance  75.06  76.16  68.57 
4.1.2 Visda
In the experiment of Section 4.1.1 on digits dataset, models have been trained from scratch. However, a common practice used in domain adaptation is to transfer knowledge from a pretrained network (eg. models trained on ImageNet) and finetune on the desired task. To evaluate the performance of our approach in such settings, we consider adaptation on the VISDA dataset (Peng et al., 2017); a recently proposed benchmark for adapting from synthetic to real images.
We consider a subset of the entire VISDA dataset containing the following three classes: aeroplane, horse and truck. The source domain contains fraction of samples per class, while that of the target domain is . We use a Resnet18 model pretrained on ImageNet as our feature network. As shown in Table 3, our approach significantly improves the domain adaptation performance over the baseline and other compared methods.
Method  Accuracy (in ) 

Source only  53.19 
DANN  68.06 
Wasserstein  64.84 
Normalized Wasserstein  73.23 
4.2 UDA for unsupervised tasks
For unsupervised tasks on mixture datasets, we use the formulation of Eq (10) to perform domain adaptation. To empirically validate this formulation, we consider the image denoising problem. The source domain consists of digits from MNIST dataset as shown in Fig 4(a). Note that the color of digit is inverted. The target domain is a noisy version of the source, i.e. source images are perturbed with random Gaussian noise to obtain target images. Our dataset contains samples of digit and samples of digit in the source domain, and samples of noisy digit and samples of noisy digit in the target. The task is to perform image denoising by dimensionaly reduction, i.e., given a target domain image, we need to reconstruct the corresponding clean image that looks like the source. We assume that no (source, target) correspondence is available in the dataset.
To perform denoising when the (source, target) correspondence is unavailable, a natural choice would be to minimize the reconstruction loss in the source domain while minimizing the distance between source and target embedding distributions. We use the NW distance as our choice of distance measure. This results in the following optimization:
where is the encoder and is the decoder.
As our baseline, we consider a model trained only on source using a quadratic reconstruction loss. Fig 4(b) shows source and target embeddings produced by this baseline. In this case, the source and the target embeddings are distant from each other. However, as shown in Fig 4(c), using the NW distance formulation, the distributions of source and target embeddings match closely (with estimated mode proportions) . We measure the reconstruction loss of the target domain, , as a quantitative evaluation measure. This value for different approaches is shown in Table 4. We observe that our method outperforms the compared approaches.
Method  

Source only  0.31 
Wasserstein  0.52 
Normalized Wasserstein  0.18 
Training on target (Oracle)  0.08 
5 Adversarial Clustering
In this section, we use the proposed NW distance to formulate an adversarial clustering approach. More specifically, let the input data distribution have underlying modes (each representing a cluster), which we intend to recover. The use of generative models for clustering has been explored in some of the recent works. In (Locatello et al., 2018), VAEs are used to perform clustering. Clustering using GANs is performed in (Yu & Zhou, 2018) using an ExpectationMaximization approach. Different from these, our approach makes use of the proposed NWGAN for clustering, and thus can explicitly handle data with imbalanced modes.
Let be observed empirical distribution. Let and be optimal solutions of NWGAN optimization (5). For a given point , the clustering assignment is computed using the closest distance to a mode. I.e.,
(11) 
To perform an effective clustering, we require each mode to capture one mode of the data distribution. Without enforcing any regularization and using rich generator functions, one model can capture multiple modes of the data distribution. To prevent this, we introduce a regularization term that maximizes the weighted average Wasserstein distances between different generated modes. That is,
This term encourages diversity among generative modes. With this regularization term, the optimization objective of a regularized NWGAN becomes
where is the regularization parameter.
We test the proposed adversarial clustering on an imbalanced MNIST dataset with digits containing samples of digit , samples of digit and samples of digit . We compare our approach with kmeans clustering and Gaussian Mixture Model (GMM). Cluster purity, NMI and ARI scores are used as quantitative metrics (refer to Appendix Section C.3 for more details). Performance of our approach in comparison with the other two techniques is shown in Table 5. Our clustering technique is able to achieve good performance than the other two methods.
Method  Cluster Purity  NMI  ARI 

kmeans  0.82  0.49  0.43 
GMM  0.75  0.28  0.33 
NW distance  0.98  0.94  0.97 
6 Hypothesis Testing
Suppose and are two mixture distributions with the same mixture components but different mode proportions. I.e., and both belong to . In this case, depending on the difference between and , the Wasserstein distance between the two distributions can be arbitrarily large. Thus, using the Wasserstein distance, we can only conclude that the two distributions are different. In some applications, it can be informative to have a test that determines if two distributions differ only in mode proportions. We propose a test based on a combination of Wasserstein and the NW distance for this task. This procedure is shown in Table. 6. We note that computation of values for the proposed test is beyond the scope of this paper.
Wasserstein distance  NW distance  Conclusion 

High  High  Distributions differ in mode components 
High  Low  Distributions have the same components, but differ in mode proportions 
Low  Low  Distributions are the same 
We demonstrate this test on 2D Mixture of Gaussians. We perform experiments on two settings, each involving two datasets and , which are mixtures of Gaussians:
Setting 1: Both and have same mode components, with the mode located at
Setting 2: and have shifted mode components. The mode of is located at
In both the settings, the mode fraction of is , and that of is . We use data points from and to compute Wasserstein and the NW distances in primal form by solving a linear program. The computed distance values are reported in Table 7. In setting 1, we observe that the Wasserstein distance is large while the NW distance is small. Thus, one can conclude that the two distributions differ only in mode proportions. In setting 2, both Wasserstein and NW distances are large. Thus, in this case, distributions differ in mixture components as well.
Setting  Wasserstein Distance  NW Distance 

Setting 1  1.51  0.06 
Setting 2  1.56  0.44 
7 Conclusion
In this paper, we first showed that the standard Wasserstein distance, due to its marginal constraints, can lead to undesired results when applied to mixture distributions with imbalanced mixture proportions. To resolve this issue, we proposed a new distance measure called the normalized Wasserstein distance. The key idea is to optimize mixture proportions in the Wasserstein formulation, effectively normalizing mixture imbalance. We demonstrated the usefulness of normalized Wasserstein distance in four machine learning tasks: GANs, domain adaptation, adversarial clustering and hypothesis testing for mixture distributions. Strong empirical results on all four problems highlight the effectiveness of the proposed distance measure. Normalized Wasserstein distance can be used in other machine learning problems involving distance computation between distributions. Moreover, similar ideas can be extended to other distance measures such as divergence measures as well.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 Berthelot et al. (2017) Berthelot, D., Schumm, T., and Metz, L. BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 Brock et al. (2018) Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. CoRR, abs/1809.11096, 2018.
 Chen et al. (2018) Chen, Q., Liu, Y., Wang, Z., Wassell, I., and Chetty, K. Reweighted adversarial adaptation network for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 Cover & Thomas (2012) Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012.
 Dziugaite et al. (2015) Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
 Ganin & Lempitsky (2015) Ganin, Y. and Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1180–1189, Lille, France, 07–09 Jul 2015. PMLR.
 Ghosh et al. (2017) Ghosh, A., Kulharia, V., Namboodiri, V. P., Torr, P. H., and Dokania, P. K. Multiagent diverse generative adversarial networks. CoRR, abs/1704.02906, 6:7, 2017.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
 Guo et al. (2017) Guo, X., Hong, J., Lin, T., and Yang, N. Relaxed wasserstein with applications to GANs. arXiv preprint arXiv:1705.07164, 2017.
 Hoang et al. (2018) Hoang, Q., Nguyen, T. D., Le, T., and Phung, D. MGAN: training generative adversarial nets with multiple generators. 2018.
 Li et al. (2015) Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 1718–1727, 2015.
 Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 Locatello et al. (2018) Locatello, F., Vincent, D., Tolstikhin, I. O., Rätsch, G., Gelly, S., and Schölkopf, B. Clustering meets implicit generative models. CoRR, abs/1804.11130, 2018.
 Long et al. (2015) Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, pp. 97–105, 2015.
 Long et al. (2016) Long, M., Wang, J., and Jordan, M. I. Unsupervised domain adaptation with residual transfer networks. CoRR, abs/1602.04433, 2016.
 Mao et al. (2016) Mao, X., Li, Q., Xie, H., Lau, R. Y., and Wang, Z. Multiclass generative adversarial networks with the l2 loss function. arXiv preprint arXiv:1611.04076, 2016.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Nowozin et al. (2016) Nowozin, S., Cseke, B., and Tomioka, R. fGAN: training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Peng et al. (2017) Peng, X., Usman, B., Kaushik, N., Hoffman, J., Wang, D., and Saenko, K. Visda: The visual domain adaptation challenge. CoRR, abs/1710.06924, 2017. URL http://arxiv.org/abs/1710.06924.
 Petzka et al. (2017) Petzka, H., Fischer, A., and Lukovnicov, D. On the regularization of Wasserstein GANs. arXiv preprint arXiv:1709.08894, 2017.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Shen et al. (2018) Shen, J., Qu, Y., Zhang, W., and Yu, Y. Wasserstein distance guided representation learning for domain adaptation. In AAAI, pp. 4058–4065. AAAI Press, 2018.
 Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Computer Vision and Pattern Recognition (CVPR), volume 1, pp. 4, 2017.
 Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 Yan et al. (2017) Yan, H., Ding, Y., Li, P., Wang, Q., Xu, Y., and Zuo, W. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 945–954, 2017.
 Yu & Szepesvári (2012) Yu, Y. and Szepesvári, C. Analysis of kernel mean matching under covariate shift. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012, 2012.
 Yu & Zhou (2018) Yu, Y. and Zhou, W.J. Mixture of gans for clustering. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pp. 3047–3053. International Joint Conferences on Artificial Intelligence Organization, 2018.
Appendix
Appendix A Architecture and hyperparameters
Implementation details including model architectures and hyperparameters are presented in this section:
a.1 Mixture models for Generative Adversarial Networks (GANs)
a.1.1 Mixture of Gaussians
As discussed in Section 3.1 of the main paper, the input dataset is a mixture of Gaussians with varying mode proportion. Normalized Wasserstein GAN was trained with linear generator and nonlinear discriminator models using the architectures and hyperparameters as presented in Table 8. The architecture used for training Vanilla WGAN is provided in Table 9. The same architecture is used for MGAN, however we do not use the nonlinearities in the Generator function (to make the generator affine so that the model is comparable to ours). For WGAN and MGAN, we use the hyperparameter details as provided in the respective papers – (Gulrajani et al., 2017) and (Hoang et al., 2018).
Generator  Discriminator 

Linear()  Linear() 
Linear()  LeakyReLU(0.2) 
Linear()  Linear() 
Linear()  LeakyReLU(0.2) 
Linear()  
Hyperparameters  
Discriminator learning rate  
Generator learning rate  
learning rate  
Batch size  
Optimizer  RMSProp 
Number of critic iters  
Weight clip 
Generator  Discriminator 

Linear() + ReLU  Linear() + ReLU 
Linear() + ReLU  Linear() + ReLU 
Linear() + ReLU  Linear() + ReLU 
Linear()  Linear() 
a.1.2 CIFAR10 + CelebA
To train models on CIFAR10 + CelebA dataset (Section 3.2 of the main paper), we used the Resnet architectures of WGANGP (Gulrajani et al., 2017) with the same hyperparameter configuration for the generator and the discriminator networks. In Normalized WGAN, the learning rate of mode proportion was times the learning rate of the discriminator.
a.2 Domain adaptation for mixture distributions
a.2.1 Digit classification
For MNISTMNISTM experiments (Section 4.1.1 of the main paper), following (Ganin & Lempitsky, 2015), a modified Lenet architecture was used for feature network, and a MLP network was used for domain classifier. The architectures and hyperparameters used in our method are given in Table 10. The same architectures are used for the compared approaches  Source only, DANN and Wasserstein.
Feature network  

Conv(, kernel) + ReLU + MaxPool(2)  
Conv(, kernel) + ReLU + MaxPool(2)  
Domain discriminator  Classifier 
Linear() + ReLU  Linear() + ReLU 
Linear()  Linear() + ReLU 
Linear()  
Hyperparameters  
Feature network learning rate  
Discriminator learning rate  
Classifier learning rate  
learning rate  
Batch size  
Optimizer  Adam 
Number of critic iters  
Weight clipping value  
a.2.2 Visda
For the experiments on VISDA dataset with three classes (Section 4.1.2 of the main paper), the architectures and hyperparameters used in our method are given in Table 11. The same architectures are used for the compared approaches: source only, Wasserstein and DANN.
Feature network  

Resnet18 model pretrained on ImageNet  
till the penultimate layer  
Domain discriminator  Classifier 
Linear() + LeakyReLU()  Linear() 
Linear() + LeakyReLU()  
Linear() + LeakyReLU()  
Linear()  
Hyperparameters  
Feature network learning rate  
Discriminator learning rate  
Classifier learning rate  
learning rate  
Batch size  
Optimizer  Adam 
Number of critic iters  
Weight clipping value  
a.2.3 Domain adaptation for Image denoising
The architectures and hyperparameters used in our method for image denoising experiment (Section 4.2 of the main paper) are presented in Table 12. To perform adaptation using Normalized Wasserstein distance, we need to train the intermediate distributions and (as discussed in Section 2, 4.2 of the main paper). We denote the generator and discriminator models corresponding to and as Generator (RW) and Discriminator (RW) respectively. In practice, we noticed that the Generator (RW) and Discriminator (RW) models need to be trained for a certain number of iterations first (which we call initial iterations) before performing adaptation. So, for these initial iterations, we set the adaptation parameter as . Note that the encoder, decoder, generator (RW) and discriminator (RW) models are trained during this phase, but the adaptation is not performed. After these initial iterations, we turn the adaptation term on. The hyperparameters and model architectures are given in Table 12. The same architectures are used for Source only and Wasserstein.
Encoder  Decoder 

Conv(, kernel)  Linear() 
+ReLU + MaxPool(2)  Conv(, kernel) 
Conv(, kernel)  + ReLU + Upsample(2) 
+ReLU + MaxPool(2)  Conv(, kernel) 
Conv(, kernel)  + ReLU + Upsample(4) 
+ReLU + MaxPool(2)  Conv(, kernel) 
Conv(, kernel)  
Linear()  
Domain discriminator  
Linear() + ReLU  
Linear() + ReLU  
Linear()  
Generator (RW)  Discriminator (RW) 
Linear()  Linear() + ReLU 
Linear()  Linear() + ReLU 
Linear()  Linear() 
Hyperparameters  
Encoder learning rate  
Decoder learning rate  
Domain Discriminator learning rate  
Generator (RW) learning rate  
Discriminator (RW) learning rate  
learning rate  
Batch size  
Optimizer  Adam 
Number of critic iters  
Initial iters  
Weight clipping value  
a.3 Adversarial clustering
For adversarial clustering in imbalanced MNIST dataset (Section 5 of the main paper), the architectures and hyperparameters used are given in Table 13.
Generator  Discriminator 

ConvTranspose(, kernel, stride 1)  Spectralnorm(Conv(, kernel, stride 2)) 
Batchnorm + ReLU  LeakyReLU(0.2) 
ConvTranspose(, kernel, stride 2)  Spectralnorm(Conv(, kernel, stride 2) 
Batchnorm + ReLU  LeakyReLU(0.2) 
ConvTranspose(, kernel, stride 2)  Spectralnorm(Conv(, kernel, stride 2) 
Batchnorm + ReLU  LeakyReLU(0.2) 
ConvTranspose(, kernel, stride 2)  Spectralnorm(Conv(, kernel, stride 1) 
Tanh()  
Hyperparameters  
Discriminator learning rate  
Generator learning rate  
learning rate  
Batch size  
Optimizer  RMSProp 
Number of critic iters  
Weight clip  
a.4 Hypothesis testing
For hypothesis testing experiment (Section 6 of the main paper), the same model architectures and hyperparameters as the MOG experiment (Table 8) was used.
Appendix B Learning
Computing normalized Wasserstein GAN requires optimizing , and . and are dimensional vectors in an simplex. Hence, the following constraints have to be satisfied while optimizing them:
To enforce this constraint in an endtoend optimization, we use softmax function.
The new variables and become the optimization variables. The softmax function ensures that the mixture probabilities and lie in a simplex.
Appendix C Additional results
c.1 Cifar10
We present results of training Normalized Wasserstein GAN on CIFAR10 dataset. We use WGANGP (Gulrajani et al., 2017) with Resnetbased generator and discriminator models as our baseline method. The proposed NWGAN was trained with modes using the same network architectures as the baseline. Sample generations produced by each mode of the NWGAN is shown in Figure 5. We observe that each generator model captures distinct variations of the entire dataset, thereby approximately disentangling different modes in input images. For quantitative evaluation, we compute inception scores for the baseline and the proposed NWGAN. The inception score for the baseline model is , whereas our model achieved an improved score of .
Config  3 modes  5 modes  10 modes 

Classes  
Proportion of source samples  
Proportion of target samples 
c.2 Domain adaptation under uniform mode proportions
In all experiments performed in the main paper, the source and the target domains varied in mode proportion. Our experiments suggest that minimizing NW distance helps resolve the negative transfer witnessed while minimizing wasserstein (or) adversarial distance on such data distributions . But, what happens when the source and the target distributions have uniform mode proportion? In fact, this is the setting considered in most of the previous unsupervised domain adaptation papers (Ganin & Lempitsky, 2015; Tzeng et al., 2017; Long et al., 2016; Shen et al., 2018), the one where the classical distance measures have shown to be successful. We intend to study the behaviour of NW distance in this setting.
Two adaptation experiments are considered  (1) MNIST MNISTM, each dataset having classes with uniform mode proportion, and (2) Synthetic to real adaptation on VISDA: source and target domains contain classes  aeroplane, horse and truck with uniform mode proportion. The results of performing adaptation using NW distance in comparison with classical distance measures are reported in Table 15 and 16. We observe that NW distance performs onpar with the compared methods on both the datasets. This experiment demonstrates the effectiveness of NW distance on a range of settings – when the source and target datasets are balanced in mode proportions, NW becomes equivalent to Wasserstein distance and minimizing it is no worse than minimizing the classical distance measures. On the other hand, when mode proportions of source and target domains differ, NW distance renormalizes the mode proportions and effectively performs domain adaptation. This illustrates the usefulness of NW distance in domain adaptation problems.
Method  Classification accuracy (in ) 

Source only  60.22 
DANN  85.24 
Wasserstein  83.47 
Normalized Wasserstein  84.16 
Method  Classification accuracy (in ) 

Source only  63.24 
DANN  84.71 
Wasserstein  90.08 
Normalized Wasserstein  90.72 
c.3 Adversarial clustering: Quantitative metrics

Cluster purity: Cluster purity measures the extent to which clusters are consistent i.e., if each cluster constains similar points or not. To compute the cluster purity, the cardinality of the majority class is computed for each cluster, and summed over and divided by the total number of samples.

ARI  Adjusted Rand Index: The rand index computes the similarity measure between two clusters by considering all pairs of samples, and counting the pairs of samples having the same cluster in the groundtruth and predicted cluster assignments. Adjusted rand index makes sure that ARI score is in the range (0, 1)

NMI  Normalized Mutual Information: NMI is the normalized version of the mutual information between the predicted and the ground truth cluster assignments.
c.4 Adversarial clustering of CIFAR+CelebA
In this section, we show the results of performing adversarial clustering on a mixture of CIFAR10 and CelebA datasets. The same dataset presented in Section 3.2 of the main paper is used in this experiment (i.e) the dataset contains CIFAR10 and CelebA samples in mode proportion. NWGAN was trained with modes  each employing Resnet based generatordiscriminator architectures (same architectures and hyperparameters used in Section 3.2 of main paper). Quantitative evaluation of our approach in comparison with is given in Table 17. We observe that our approach outperforms clustering. However, the clustering quality is poorer that the one obtained on imbalanced MNIST dataset. This is because the samples generated on MNIST dataset had much better quality than the one produced on CIFAR10. So, as long as the underlying GAN model produces good generations, our adversarial clustering algorithm performs well.
Method  Cluster Purity  NMI  ARI 

kmeans  0.667  0.038  0.049 
Normalized Wasserstein  0.870  0.505  0.547 