Language and perception operate over distinct representational regimes: language (especially in its textual form used to train models) tends to promote more discrete, symbolic processing, whereas perception and action are more continuous, gradient, and grounded in embodied experience. This distinction shapes how humans and machines learn, generate, and evaluate information. To illustrate this account, I present three studies examining how these signal properties shape computational models and human–AI interaction. First, using recurrent neural networks, I show that sequential models encode linguistic and gestural inputs differently, reflecting how discrete and continuous signals support distinct learning dynamics. Second, I examine human evaluations of AI-generated image captions, demonstrating that cross-modal perceptual cues mitigate linguistic heuristics that arise under text-only evaluation. Third, I introduce Visual Narrative Freedom (VNF), showing that in text-to-image systems, under-specified textual inputs permit multiple plausible visual realizations. Varying the level of textual constraint systematically modulates this underdetermination, producing predictable changes in the diversity and flexibility of generated images and, in turn, user preferences in high-stakes civic domains. Overall, these findings show that multimodal intelligence is not merely additive, but is fundamentally shaped by the structure and constraints of the representations through which information is encoded.
Speaker
Joyce Jiang