Loss Functions

Flux provides a large number of common loss functions used for training machine learning models.

Loss functions for supervised learning typically expect as inputs a target y, and a prediction ŷ. In Flux's convention, the order of the arguments is the following

loss(ŷ, y)

Most loss functions in Flux have an optional argument agg, denoting the type of aggregation performed over the batch:

loss(ŷ, y)                         # defaults to `mean`
loss(ŷ, y, agg=sum)                # use `sum` for reduction
loss(ŷ, y, agg=x->sum(x, dims=2))  # partial reduction
loss(ŷ, y, agg=x->mean(w .* x))    # weighted mean
loss(ŷ, y, agg=identity)           # no aggregation.

Losses Reference

Flux.mae — Function

mae(ŷ, y; agg=mean)

Return the loss corresponding to mean absolute error:

agg(abs.(ŷ .- y))

source

Flux.mse — Function

mse(ŷ, y; agg=mean)

Return the loss corresponding to mean square error:

agg((ŷ .- y).^2)

source

Flux.msle — Function

msle(ŷ, y; agg=mean, ϵ=eps(eltype(ŷ)))

The loss corresponding to mean squared logarithmic errors, calculated as

agg((log.(ŷ .+ ϵ) .- log.(y .+ ϵ)).^2)

The ϵ term provides numerical stability. Penalizes an under-predicted estimate more than an over-predicted estimate.

source

Flux.huber_loss — Function

huber_loss(ŷ, y; δ=1, agg=mean)

Return the mean of the Huber loss given the prediction ŷ and true values y.

             | 0.5 * |ŷ - y|,            for |ŷ - y| <= δ
Huber loss = |
             |  δ * (|ŷ - y| - 0.5 * δ), otherwise

source

Flux.crossentropy — Function

crossentropy(ŷ, y; weight=nothing, dims=1, ϵ=eps(eltype(ŷ)), 
                   logits=false, agg=mean)

Return the cross entropy between the given probability distributions; calculated as

agg(.-sum(weight .* y .* log.(ŷ .+ ϵ); dims=dims))agg=mean,

weight can be nothing, a number or an array. weight=nothing acts like weight=1 but is faster.

If logits=true, the input ̂y is first fed to a softmax layer.

source

Flux.logitcrossentropy — Function

logitcrossentropy(ŷ, y; weight=nothing, agg=mean, dims=1)

Return the crossentropy computed after a Flux.logsoftmax operation; calculated as

agg(.-sum(weight .* y .* logsoftmax(ŷ; dims=dims); dims=dims))

logitcrossentropy(ŷ, y) is mathematically equivalent to Flux.crossentropy(softmax(log.(ŷ)), y) but it is more numerically stable.

source

Flux.binarycrossentropy — Function

binarycrossentropy(ŷ, y; agg=mean, ϵ=epseltype(ŷ), logits=false)

Return $-y*\log(ŷ + ϵ) - (1-y)*\log(1-ŷ + ϵ)$. The ϵ term provides numerical stability.

Typically, the prediction ŷ is given by the output of a sigmoid activation. If logits=true, the input ̂y is first fed to a sigmoid activation. See also: Flux.crossentropy, Flux.logitcrossentropy, Flux.logitbinarycrossentropy

source

Flux.logitbinarycrossentropy — Function

logitbinarycrossentropy(ŷ, y; agg=mean)

logitbinarycrossentropy(ŷ, y) is mathematically equivalent to Flux.binarycrossentropy(σ(log(ŷ)), y) but it is more numerically stable.

source

Flux.kldivergence — Function

kldivergence(ŷ, y; dims=1, agg=mean, ϵ=eps(eltype(ŷ)))

Return the Kullback-Leibler divergence between the given arrays interpreted as probability distributions.

KL divergence is a measure of how much one probability distribution is different from the other. It is always non-negative and zero only when both the distributions are equal everywhere.

source

Flux.poisson_loss — Function

poisson_loss(ŷ, y; agg=mean, ϵ=eps(eltype(ŷ))))

Loss function derived from likelihood for a Poisson random variable with mean ŷ to take value y. It is given by

agg(ŷ .- y .* log.(ŷ .+ ϵ))

More information..

source

Flux.hinge — Function

hinge(ŷ, y; agg=mean)

Return the hinge loss given the prediction ŷ and true labels y (containing 1 or -1); calculated as

agg(max.(0, 1 .- ŷ .* y))

See also: hinge

source

Flux.dice_coeff_loss — Function

dice_coeff_loss(ŷ, y; smooth=1, dims=size(ŷ)[1:end-1], agg=mean)

Return a loss based on the Dice coefficient. Used in the V-Net architecture for image segmentation. Current implementation only works for the binary segmentation case.

The arrays ŷ and y contain the predicted and true probabilities respectively for the foreground to be present in a certain pixel. The loss is computed as

1 - (2*sum(ŷ .* y; dims) .+ smooth) ./ (sum(ŷ.^2 .+ y.^2; dims) .+ smooth)

and then aggregated with agg over the batch.

source

Flux.tversky_loss — Function

tversky_loss(ŷ, y; β=0.7, α=1-β, dims=size(ŷ)[1:end-1] agg=mean)

Return the Tversky loss for binary classification. The arrays ŷ and y contain the predicted and true probabilities respectively. Used with imbalanced data to give more weight to false negatives. Larger β weigh recall higher than precision (by placing more emphasis on false negatives) Calculated as:

num = sum(y .* ŷ, dims=dims)
den = sum(@.(ŷ*y + α*ŷ*(1-y) + β*(1-ŷ)*y)), dims=dims)
tversky_loss = 1 - num/den

and then aggregated with agg over the batch.

When α+β=1, it is equal to 1-F_β, where F_β is an F-score.

source