Algorithms implemented
######################
Most methods aim to learn a common representation space for source and target domain, splitting the classical
end-to-end deep neural network into a feature extractor with parameters :math:`\Phi` and a task classifier with parameters :math:`\theta_y`.
Alignment between source and target feature distributions is obtained by adding an alignment term :math:`L_d` to the usual task loss :math:`L_c`:
:math:`L = L_c + \lambda \cdot L_d`
This alignment term is controlled by a parameter :math:`\lambda` which grows from 0 to 1 during learning. Some algorithms use a third network with parameters :math:`\theta_d`
to parameterize the aligment term :math:`L_d`.
A typical algorithm may thus be represented with 2 or 3 blocks as in the figure below:
.. image:: images/ada_blocks.png
Three types of alignment terms are implemented in ADA, leading to 3 families of methods:
1. Adversarial methods, similar to DANN, use a so-called domain classifier with parameters θd as an adversary to align the features,
2. Optimal-transport based methods, in which the domain classifier, called a critic, is trained to minimize the divergence between the source and target feature distributions,
3. Kernel-based methods, which minimize the maximum mean discrepancy in the kernel space to align features.
DANN-like methods
-----------------
The common part of these methods is that they all use a gradient-reversal layer as described in the DANN paper.
- DANN architecture from Ganin, Yaroslav, et al. “Domain-adversarial training of neural networks.” The Journal of Machine Learning Research (2016) https://arxiv.org/abs/1505.07818
- CDAN: Long, Mingsheng, et al. “Conditional adversarial domain adaptation.” Advances in Neural Information Processing Systems. 2018. https://papers.nips.cc/paper/7436-conditional-adversarial-domain-adaptation.pdf
and its variant CDAN-E (with entropy weighting).
- FSDANN: a naive adaptation of DANN to the fewshot setting (using known target labels in the task loss)
- MME: Saito, Kuniaki, et al. “Semi-supervised domain adaptation via minimax entropy.” Proceedings of the IEEE International Conference on Computer Vision. 2019 https://arxiv.org/pdf/1904.06487.pdf
this method uses the GRL layer on the entropy of the task classifier output for target samples.
Optimal transport methods
-------------------------
Currently WDGRL is implemented, as described in Shen, Jian, et al. “Wasserstein distance guided representation learning for domain adaptation.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018. https://arxiv.org/pdf/1707.01217.pdf
Its variant WDGRLMod better fits the pytorch-lightning patterns. The difference is that the critic is optimized on `k_critic` different batches instead of `k_critic` times
on the same batch.
When the `beta_ratio` parameter is not zero, both these method also implement their asymmetric ($\beta$) variant described in:
Wu, Yifan, et al.
"Domain adaptation with asymmetrically-relaxed distribution alignment."
ICML (2019)
https://arxiv.org/pdf/1903.01689.pdf
MMD-based methods
-----------------
- DAN
Long, Mingsheng, et al.
"Learning Transferable Features with Deep Adaptation Networks."
International Conference on Machine Learning. 2015.
http://proceedings.mlr.press/v37/long15.pdf
- JAN
Long, Mingsheng, et al.
"Deep transfer learning with joint adaptation networks."
International Conference on Machine Learning, 2017.
https://arxiv.org/pdf/1605.06636.pdf
Both these methods have been implemented based on the authors code at https://github.com/thuml/Xlearn.
Implementation
--------------
The classes for the unsupervised domain architectures are organised like this, where each arrow denotes inheritance:
.. image:: images/ada_architecture_models.png
Most methods may be implemented by just writing the forward pass and the ``compute_loss`` method, which should return the
two components :math:`L_c` and :math:`L_d`, as well as the metrics to use for logging and evaluation.