Algorithms implemented
######################

Most methods aim to learn a common representation space for source and target domain, splitting the classical 
end-to-end deep neural network into a feature extractor with parameters :math:`\Phi` and a task classifier with parameters :math:`\theta_y`.
Alignment between source and target feature distributions is obtained by adding an alignment term :math:`L_d` to the usual task loss :math:`L_c`:

    :math:`L = L_c + \lambda \cdot L_d`

This alignment term is controlled by a parameter :math:`\lambda` which grows from 0 to 1 during learning. Some algorithms use a third network with parameters :math:`\theta_d`
to parameterize the aligment term :math:`L_d`.

A typical algorithm may thus be represented with 2 or 3 blocks as in the figure below:

.. image:: images/ada_blocks.png

Three types of alignment terms are implemented in ADA, leading to 3 families of methods:

  1. Adversarial methods, similar to DANN, use a so-called domain classifier with parameters θd as an adversary to align the features,
  2. Optimal-transport based methods, in which the domain classifier, called a critic, is trained to minimize the divergence between the source and target feature distributions,
  3. Kernel-based methods, which minimize the maximum mean discrepancy in the kernel space to align features.

DANN-like methods
-----------------

The common part of these methods is that they all use a gradient-reversal layer as described in the DANN paper.

- DANN architecture from Ganin, Yaroslav, et al. “Domain-adversarial training of neural networks.” The Journal of Machine Learning Research (2016) https://arxiv.org/abs/1505.07818
- CDAN: Long, Mingsheng, et al. “Conditional adversarial domain adaptation.” Advances in Neural Information Processing Systems. 2018. https://papers.nips.cc/paper/7436-conditional-adversarial-domain-adaptation.pdf
    and its variant CDAN-E (with entropy weighting).
- FSDANN: a naive adaptation of DANN to the fewshot setting (using known target labels in the task loss)
- MME:  Saito, Kuniaki, et al. “Semi-supervised domain adaptation via minimax entropy.” Proceedings of the IEEE International Conference on Computer Vision. 2019 https://arxiv.org/pdf/1904.06487.pdf 
    this method uses the GRL layer on the entropy of the task classifier output for target samples.

Optimal transport methods
-------------------------

Currently WDGRL is implemented, as described in Shen, Jian, et al. “Wasserstein distance guided representation learning for domain adaptation.” Thirty-Second AAAI Conference on Artificial Intelligence. 2018. https://arxiv.org/pdf/1707.01217.pdf

Its variant WDGRLMod better fits the pytorch-lightning patterns. The difference is that the critic is optimized on `k_critic` different batches instead of `k_critic` times
on the same batch.

When the `beta_ratio` parameter is not zero, both these method also implement their asymmetric ($\beta$) variant described in:
    Wu, Yifan, et al.
    "Domain adaptation with asymmetrically-relaxed distribution alignment."
    ICML (2019)
    https://arxiv.org/pdf/1903.01689.pdf


MMD-based methods
-----------------

 - DAN
    Long, Mingsheng, et al.
    "Learning Transferable Features with Deep Adaptation Networks."
    International Conference on Machine Learning. 2015.
    http://proceedings.mlr.press/v37/long15.pdf 

- JAN
    Long, Mingsheng, et al. 
    "Deep transfer learning with joint adaptation networks."
    International Conference on Machine Learning, 2017.
    https://arxiv.org/pdf/1605.06636.pdf


Both these methods have been implemented based on the authors code at https://github.com/thuml/Xlearn.

Implementation
--------------

The classes for the unsupervised domain architectures are organised like this, where each arrow denotes inheritance:

.. image:: images/ada_architecture_models.png

Most methods may be implemented by just writing the forward pass and the ``compute_loss`` method, which should return the
two components :math:`L_c` and :math:`L_d`, as well as the metrics to use for logging and evaluation.