I Trained a Neural Network on XOR in Java. Here's How it Connects to Every LLM You've Ever Used.
That 4-line truth table is the same fundamental engine behind GPT, Gemini, and Claude, just scaled by a factor of billions.
A few days ago I dusted off a piece of Java code, a simple feedforward neural network trained to predict the output of an XOR gate. Nothing flashy. Two inputs, one hidden layer, one output. It was pure Java, not a machine learning library. The idea was to see how the math works during training.
Watching it converge after a few hundred epochs is quietly satisfying. Why? Well, why not? There ain’t nothing wrong with learning.
The accuracy after 100,000 training iterations was pretty good.
FINAL NETWORK TESTING
============================================================
Input: [0.0, 0.0] → Prediction: 0.0074, Expected: 0
Input: [0.0, 1.0] → Prediction: 0.9902, Expected: 1
Input: [1.0, 0.0] → Prediction: 0.9906, Expected: 1
Input: [1.0, 1.0] → Prediction: 0.0131, Expected: 0
Is this really any different from what’s inside a large language model? The answer is both yes and no, the “no” part is more profound than most people realise.
The XOR problem and why it matters
XOR (exclusive OR) is the classic benchmark because it’s not linearly separable. A single neuron with a straight-line decision boundary can’t solve it. You need at least one hidden layer to create the non-linear mapping.
Solving XOR forces you to build a Multi-Layer Perceptron (MLP), the same architectural primitive at the heart of every modern AI model.
What the Java implementation looks like
The core training loop in my implementation follows the standard forward-pass → loss → backprop cycle, here’s the general pseudocode:
// Forward pass
double[] hidden = activate(multiply(weights1, input));
double output = activate(dot(weights2, hidden));
// Backprop — compute deltas and update weights
double outputError = target - output;
double outputDelta = outputError * sigmoidDerivative(output);
for (int i = 0; i < hidden.length; i++) {
double hiddenError = outputDelta * weights2[i];
double hiddenDelta = hiddenError * sigmoidDerivative(hidden[i]);
weights1[i] += learningRate * hiddenDelta * input[i];
}
weights2 += learningRate * outputDelta;
Three ingredients: weighted sums, an activation function (sigmoid here), and gradient descent via backpropagation. That’s the whole recipe.
A minimal MLP: 2 inputs → 3 hidden neurons → 1 output
Now scale that up, welcome to WebScale GenAI!
Here’s where it gets interesting. Modern large language models like GPT-4, Llama 3, or Claude are built on the Transformer architecture. But inside every Transformer block, there’s a component called the Feed-Forward Network (FFN) and it is, structurally, a MLP.
In a Transformer, each attention layer is followed by a two-layer MLP with a non-linear activation. This is where the model “stores” knowledge and applies reasoning not in the attention heads alone.
Well when I say reasoning…. it can’t reason. It’s just a numerical context of what reasoning is.
The attention mechanism decides what to focus on. The MLP then decides what to do with that information. They work in tandem across potentially hundreds of layers.
In a model like Llama 3 8B, the FFN alone accounts for roughly two-thirds of the total parameters. Your XOR network has ~10 weights. GPT-4 is estimated at ~1.8 trillion. Same structure, vastly different scale.
The gap between XOR and GPT isn’t architecture it’s data, depth, and compute. The mathematics is the same.
Activation functions: from sigmoid to SwiGLU
My Java network uses a sigmoid activation, smooth, bounded between 0 and 1, and the classic choice for learning XOR.
Modern LLMs have moved on. The dominant activation in 2024–2025 FFN layers is SwiGLU, a gated variant of the Swish function. It empirically outperforms ReLU and GELU on large-scale language tasks. But the principle is identical: introduce non-linearity so the network can learn complex mappings.
Here’s the conceptual progression:
Sigmoid (XOR, 1980s) → ReLU (deep learning, 2010s) → GELU (BERT, GPT-2) → SwiGLU (LLaMA, Mistral, Gemma)
Backpropagation hasn’t changed, the hardware has
The algorithm that trains my Java network backpropagation with stochastic gradient descent is the same algorithm used to train every LLM. Rumelhart, Hinton, and Williams described it in 1986. We haven’t replaced it; we’ve just parallelised it across thousands of GPU cores with frameworks like PyTorch and JAX.
What has changed:
Batch size: My XOR network trains on 4 examples. LLMs train on trillions of tokens in massive parallel batches.
Optimiser: Vanilla SGD → AdamW with learning rate schedulers, gradient clipping, and warm-up phases.
Regularisation: Nothing in XOR. LLMs use dropout, weight decay, and layer normalisation throughout.
Why this matters if you’re building with AI
Understanding the MLP as a foundational primitive helps demystify a lot of what GenAI does and doesn’t do. When an LLM “hallucinates”, part of that is the FFN layers retrieving learned associations that don’t generalise correctly to your prompt. When you engineer prompts, you’re influencing which weights get activated through the attention + FFN pipeline.
My thoughts on model hallucinations and the use of the term in LLMs is widely known on Linkedin. It may have caused a bit of a stir.
It also reframes RAG (Retrieval Augmented Generation). You’re essentially compensating for the fixed parametric memory of the MLP layers by injecting external context giving the model facts its weights never learned.
It’s all numbers. That’s it.
If you’ve never trained a neural network from scratch
Do it! In any language, it doesn’t matter about language, whatever works best for you. Even a toy XOR network wires your intuition for what gradient descent, loss curves, and weight initialisation actually mean, the things that matter enormously when evaluating and fine-tuning foundation models.
The XOR gate may be a teaching toy, but the lesson is real: you’re one abstraction layer away from the engines running the most capable AI systems in the world.
That’s not a small thing, and you’ll sound really clever at meetups.
You can take a look at the Java code on my Github Gist.




