What Are Logits? From Log-Odds to Deep Learning

What are logits? Where did they come from? Why is it so important in the context of classification?

Jun 21, 2025

Introduction

If you've ever peeked under the hood of a machine learning model - especially in classification - you've probably heard the term logits. But what exactly is a logit? Why do models use them instead of raw probabilities or odds? And how do they connect to functions like sigmoid and softmax?

In this post, we’ll unpack the idea of logits from its statistical origins to its modern deep learning applications - all without assuming a PhD in math.

We will also be using the example below to illustrate these ideas and build our way up to what logits are and how they are used in neural networks.

You’re building a machine learning model for an e-commerce website and the goal is to predict whether a customer will buy a featured product after visiting the page.

The Starting Point: Probability

So, your model outputs that there is a 70% chance of the user buying that featured product. Meaning that’s a probability of:

\(p = 0.7 \)

This output is not something that is outputted directly by a linear model.

But why can’t a machine learning model directly predict a probability?

Most machine learning models compute a weighted combination of input features. This is called a linear output - a number that can range from negative to positive infinity.

For example, your model might compute:

\(y = 0.02 \cdot \text{time_spent_on_page} + 0.8 \cdot \text{has_viewed_reviews} - 1.1 \)

This score is known as the logit - it’s just a raw number that reflects the model’s internal “belief“ of its confidence before converting it to a probability.

The key point is:

A linear-model has no built in awareness of proabilities, we need to apply a transformation and that’s where logits come in.

Odds: A Ratio of Likelihoods

But before we get to that, lets go over odds. Odds show us how likely something is to happen compared to not happening:

\(odds = \frac{p}{1-p} = \frac{0.7}{0.3} = 2.33\)

In our ecommerce example, a customer with a 70% chance of buying is 2.33 times more likely to buy the featured product than not buying.

Odds are easier to work with than raw probabilities, but they’re still not ideal for models because they are:

non-symmetric: 0.7 and 0.3 are complements of each other, but 2.33 and 0.43 are not.
non-linear: Small changes in probability can lead to disproportionately large changes in odds near 0 or 1.

Note: Probabilities are hard to model directly with linear functions because they must stay between 0 and 1. Odds, while useful, don’t have clean linear behavior. That’s why we use logits - they are linear outputs that can be trained easily, and they map smoothly to probabilities using sigmoid or softmax.

Logits: Log-Odds

Logits transform the probability into a log of odds. This is great because it is easy to transform the log-odds value into a probability that we can use for classification.

\(logit(p) = log(\frac{p}{1-p})\)

In our case:

\(logit(0.7)=log(2.33)\approx0.367\)

Instead of predicting a probability, we train a model to predict real-world logits. This allows the model to output a confidence level freely in either the positive or negative direction. Logits are also symmetric, meaning it treats values below and above a 0.5 probability in a balanced way.

Linear Combinations = Log-Odds

In models like logistic regression or neural networks, the linear-output z is calculated by:

\(z=w_1x_1+w_2x_2+...+b\)

where w is the weight, x is the input and b is the bias term

This output is treated as a logit and is equal to:

\(z=log(\frac{p}{1-p})\)

However, we can only perform a classification if we have a probability between 0 and 1, which is where the sigmoid function comes in.

Sigmoid: Turning Logits into Probabilities

The sigmoid function is an inverse of the log-odds function, meaning it can turn the logit outputted by the model into a probability that can be easily classified. The sigmoid function is shown below:

\(p=\frac{1}{1+e^{-z}}\)

In summary, the sigmoid of logit(p) produces the probability and the logit of sigmoid z produces the z value as shown below.

\(σ(logit(p))=p\text{ and } logit(σ(z))=z\)

You’re probably starting to notice that the sigmoid function has some really nice properties:

It will squash any number into the range (0, 1).
It’s smooth and differentiable at every point, allowing the gradient to be calculated at every point which is critical for backpropagation to work effectively.
It will map z=0 to p=0.5 which is super nice symmetry.

Multiclass: Using Softmax Instead

If you’re predicting more than one class (e.g. which product will the user buy), sigmoid isn’t enough. We use softmax which generalizes sigmoid to multiple cases. In the case of softmax, the output z, from the linear combination mentioned earlier will be a vector. You’re probably wondering how that’s the case since we’ve only seen scalar outputs from these linear combinations.

At it’s core, a linear combination is calculated by taking each input feature, multiplying it with a corresponding weight, and then adding a bias term.

However, when you’re dealing with multiple features, or predicting multiple outputs at once, this process is handled using matrix multiplication. In this case:

The weights are organized into a matrix, W
The inputs are packed into a vector, x
The model multiplies the weight matrix with the input vector and adds a bias term, b

\(\textbf{z}=W\textbf{x}+b\)

The result of this calculation is the final output, our beloved z, which is in a vector form where each value in that vector is the logit of each feature.

\(softmax(z_i)=\frac{e^{z_i}}{\sum^{C}_{j=1}{e^{z_j}}}\)

This gives the probability distribution over C classes.

Summary: Why Logits Work so Well

To wrap things up, let’s go over the key concepts we discussed and how they relate to each other:

Probability is what we ultimately want to know, as this tells us how likely something is to happen.
Odds represent how likely something will happen versus not. They’re useful, but not great for modeling because they can’t go below 0 and don’t grow smooth, in a balanced way.
Logits are the model-friendly version of odds. They’re easier to work with because they go both in the negative and positive direction and they still represent how confident a model is.
Linear Output is what the model actually computes from the input. This score is what becomes the logit.
Sigmoid or Softmax functions are used at the very end to convert the model’s internal score into something humans can read: a clean, bounded probability.

Final Thoughts

A logit might sound abstract at first, but it really is just a way of expressing the odds of something happening that is friendly to models. It’s a powerful idea that unlocks the way modern classification models are built.

In the e-commerce example we’ve been using, the model doesn’t say “this customer has a 70% chance of buying the product” right away. Instead, it first produces an score based on the input features. This score (the logit) represents the model’s confidence in the form of log-odds. Only after that do we convert it to a probability for classification.

Because logits can take on any value, they allow models to learn more easily and generalize better. We also get the human-readable probabilities after a simple transformation.

Thanks for Reading

I hope this post helped you understand logits more deeply - not just as a technical term but as a powerful idea behind how modern machine learning models reason about uncertainty. Whether you build models yourself or just want to learn about them under the hood, I hope you came away with something useful.

Thanks for reading and until next time!

Thanks for reading A Dose of Knowledge! This post is public so feel free to share it.

A Dose of Knowledge