Making 1.5 Bits Work for Large Language Models
What you’ll learn:
- What is BitNet 1.58?
- Why is [−1,0,1] significant to large language models (LLMs)?
- What do 1-bit LLMs mean for machine learning and IoT?
Large language models (LLMs) are just one type of artificial intelligence/machine learning (AI/ML), but they along with chatbots have changed the way people use computers. Like most artificial neural networks (ANNs), LLMs use arrays of weights as part of matrix operations of multiple layers within their AI model. The weights can be integers or floating-point numbers. They usually aren’t the same precision as that used for calculations, although they can go there.
The better the precision, the better the results—in theory. This tends to be true for almost any numerical calculation. Issues like overflow and underflow also must be considered. Various numerical implementations such as variable length integers and ratios are useful in certain applications.
In general, higher precision requires more storage space and it takes longer to manipulate and calculate, so there’s a tradeoff between accuracy and power/performance. Optimizations in AI/ML applications like DeepSeek were accomplished by tailoring precision and accuracy to meet the demands of the application while staying within the limits of the hardware.
What’s driving this diversity in design is that the tradeoff between weight precision and the output of the models isn’t linear. Significantly shrinking the precision can result in minimal degradation of the model’s results overall. Optimizing the model architecture can improve overall performance even with a reduction in weight precision.
What is BitNet 1.58?
BitNet 1.58 is just one of models designed to minimize the size of the weight used. As noted, higher-precision floating point has advantages, but at a cost of space and performance that in turn affects system power requirements. Most floating-point hardware implementations used to follow the IEEE 754 standard. However, AI has had emerging de facto standards like BFloat, FP8, and FP16.
Integer weights have also shrunk with the 8-bit whole number integer, INT8 (−128 to 127), becoming the benchmark standard for AI/ML hardware acceleration. Of course, the smallest integer is a single bit, but that’s just 0 or 1. It takes 1.5 bits to encode −1, 0, and 1, which is where BitNet 1.58 and other models come into play.
>>Check out these TechXchanges for similar articles and videos
Keep in mind that the encoding is more related to storage and moving data rather than the calculation involved. For example, adding a 1-bit value to a 64-bit register actually adds two 64-bit values together, where one is limited to 63 bits of 0 and the 1-bit value.
Looking at the storage of a bunch of 1.5-bit values can just be a matter of two values sharing a bit. For example, a 3-bit value encodes eight states whereas our [−1, 0, 1] value range requires only three states. Check out the BitNet website if you need to see the details.
Why is [−1,0,1] Significant to Large Language Models?
Storage is a big factor in LLMs simply because of their size. Gigabytes of data is a killer for smaller microcontrollers because they simply don’t have that capacity even if the models are given the majority of storage in the hardware. Reducing the size of the weights reduces the memory footprint that in turn will reduce the overall power requirements due to storage.
Of course, storage is just part of the issue. Moving that data around takes power and instruction cycles. Applications that move lots of data often benefit from compression algorithms even with the overhead of compression and decompression. This is similar to what’s being done here with smaller weights, although the size reduction doesn’t use compression. Still, an encode/decode process is significantly simpler and easily implemented on most CPUs.
The big gain in efficiency and power actually comes from the calculations involved, which translate from a more expensive arithmetic operation to a simpler one. Consider multiplying by −1, 0, or 1. In two of the three cases, the result is a sign change, not a value change.
While things are a little more complex when talking about LLMs, the idea is the same in terms of optimization. What it does is make the operations so simple that AI hardware acceleration isn’t required to make the models usable on conventional CPUs.
What Do 1-Bit LLMs Mean for Machine Learning and IoT
The significant reduction in storage requirements and similar reduction in computational requirements means that suitably compact LLMs can work on microcontrollers, including those designed for the Internet of Things (IoT). Sensors are able to be processed on the edge. This AI edge computing can allow for local control or reduce the amount of data or frequency of communication with the cloud.
Keep in mind that LLMs are still large, so very small microcontrollers may lack the space or computational performance to handle many models. However, even these smaller platforms may be sufficient for this type of reduced LLM or even other ANN models that might have similar computational and storage reductions.
As with most software solutions, what can fit within the limits of the desired resources is something designers and programmers should consider. Many models will not fit these constraints, but now it’s possible that more solutions will work for a given set of hardware. Likewise, AI hardware acceleration may make other applications practical. The amazing thing these days is the amount of research being done to improve the software enough so that it’s practical on existing hardware.