Loss Functions
Flux provides a large number of common loss functions used for training machine learning models.
Loss functions for supervised learning typically expect as inputs a target y
, and a prediction ŷ
. In Flux's convention, the order of the arguments is the following
loss(ŷ, y)
Most loss functions in Flux have an optional argument agg
, denoting the type of aggregation performed over the batch:
loss(ŷ, y) # defaults to `mean`
loss(ŷ, y, agg=sum) # use `sum` for reduction
loss(ŷ, y, agg=x->sum(x, dims=2)) # partial reduction
loss(ŷ, y, agg=x->mean(w .* x)) # weighted mean
loss(ŷ, y, agg=identity) # no aggregation.
Losses Reference
Flux.mae
— Functionmae(ŷ, y; agg=mean)
Return the loss corresponding to mean absolute error:
agg(abs.(ŷ .- y))
Flux.mse
— Functionmse(ŷ, y; agg=mean)
Return the loss corresponding to mean square error:
agg((ŷ .- y).^2)
Flux.msle
— Functionmsle(ŷ, y; agg=mean, ϵ=eps(eltype(ŷ)))
The loss corresponding to mean squared logarithmic errors, calculated as
agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)).^2)
The ϵ
term provides numerical stability. Penalizes an under-predicted estimate more than an over-predicted estimate.
Flux.huber_loss
— Functionhuber_loss(ŷ, y; δ=1, agg=mean)
Return the mean of the Huber loss given the prediction ŷ
and true values y
.
| 0.5 * |ŷ - y|, for |ŷ - y| <= δ
Huber loss = |
| δ * (|ŷ - y| - 0.5 * δ), otherwise
Flux.crossentropy
— Functioncrossentropy(ŷ, y; weight=nothing, dims=1, ϵ=eps(eltype(ŷ)),
logits=false, agg=mean)
Return the cross entropy between the given probability distributions; calculated as
agg(.-sum(weight .* y .* log.(ŷ .+ ϵ); dims=dims))agg=mean,
weight
can be nothing
, a number or an array. weight=nothing
acts like weight=1
but is faster.
If logits=true
, the input ̂y
is first fed to a softmax
layer.
See also: Flux.logitcrossentropy
, Flux.binarycrossentropy
, Flux.logitbinarycrossentropy
Flux.logitcrossentropy
— Functionlogitcrossentropy(ŷ, y; weight=nothing, agg=mean, dims=1)
Return the crossentropy computed after a Flux.logsoftmax
operation; calculated as
agg(.-sum(weight .* y .* logsoftmax(ŷ; dims=dims); dims=dims))
logitcrossentropy(ŷ, y)
is mathematically equivalent to Flux.crossentropy(softmax(log.(ŷ)), y)
but it is more numerically stable.
See also: Flux.crossentropy
, Flux.binarycrossentropy
, Flux.logitbinarycrossentropy
Flux.binarycrossentropy
— Functionbinarycrossentropy(ŷ, y; agg=mean, ϵ=epseltype(ŷ), logits=false)
Return $-y*\log(ŷ + ϵ) - (1-y)*\log(1-ŷ + ϵ)$. The ϵ
term provides numerical stability.
Typically, the prediction ŷ
is given by the output of a sigmoid
activation. If logits=true
, the input ̂y
is first fed to a sigmoid
activation. See also: Flux.crossentropy
, Flux.logitcrossentropy
, Flux.logitbinarycrossentropy
Flux.logitbinarycrossentropy
— Functionlogitbinarycrossentropy(ŷ, y; agg=mean)
logitbinarycrossentropy(ŷ, y)
is mathematically equivalent to Flux.binarycrossentropy(σ(log(ŷ)), y)
but it is more numerically stable.
See also: Flux.crossentropy
, Flux.logitcrossentropy
, Flux.binarycrossentropy
Flux.kldivergence
— Functionkldivergence(ŷ, y; dims=1, agg=mean, ϵ=eps(eltype(ŷ)))
Return the Kullback-Leibler divergence between the given arrays interpreted as probability distributions.
KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative and zero only when both the distributions are equal everywhere.
Flux.poisson_loss
— Functionpoisson_loss(ŷ, y; agg=mean, ϵ=eps(eltype(ŷ))))
Loss function derived from likelihood for a Poisson random variable with mean ŷ
to take value y
. It is given by
agg(ŷ .- y .* log.(ŷ .+ ϵ))
Flux.hinge
— Functionhinge(ŷ, y; agg=mean)
Return the hinge loss given the prediction ŷ
and true labels y
(containing 1 or -1); calculated as
agg(max.(0, 1 .- ŷ .* y))
See also: squared_hinge
Flux.squared_hinge
— Functionsquared_hinge(ŷ, y; agg=mean)
Return the squared hinge loss given the prediction ŷ
and true labels y
(containing 1 or -1); calculated as
agg(max.(0, 1 .- ŷ .* y).^2)
See also: hinge
Flux.dice_coeff_loss
— Functiondice_coeff_loss(ŷ, y; smooth=1, dims=size(ŷ)[1:end-1], agg=mean)
Return a loss based on the Dice coefficient. Used in the V-Net architecture for image segmentation. Current implementation only works for the binary segmentation case.
The arrays ŷ
and y
contain the predicted and true probabilities respectively for the foreground to be present in a certain pixel. The loss is computed as
1 - (2*sum(ŷ .* y; dims) .+ smooth) ./ (sum(ŷ.^2 .+ y.^2; dims) .+ smooth)
and then aggregated with agg
over the batch.
Flux.tversky_loss
— Functiontversky_loss(ŷ, y; β=0.7, α=1-β, dims=size(ŷ)[1:end-1] agg=mean)
Return the Tversky loss for binary classification. The arrays ŷ
and y
contain the predicted and true probabilities respectively. Used with imbalanced data to give more weight to false negatives. Larger β
weigh recall higher than precision (by placing more emphasis on false negatives) Calculated as:
num = sum(y .* ŷ, dims=dims)
den = sum(@.(ŷ*y + α*ŷ*(1-y) + β*(1-ŷ)*y)), dims=dims)
tversky_loss = 1 - num/den
and then aggregated with agg
over the batch.
When α+β=1
, it is equal to 1-F_β
, where F_β
is an F-score.