Fusing modalities multiplies inference cost. Processing text, images, and audio in a single request isn't a simple sum of three API calls; it's a cascade of transformer computations across specialized encoders and a unified cross-attention layer. This architectural complexity explodes the FLOPs per token.














