While watching movies, the brain integrates the visual information and the musical soundtrack into a coherent percept. Multisensory integration can lead to emotion elicitation on which soundtrack valences may have a modulatory impact. Here, dynamic kissing scenes from romantic comedies were presented to 22 participants (13 females) during functional magnetic resonance imaging scanning. The kissing scenes were either accompanied by happy music, sad music or no music. Evidence from cross-modal studies motivated a predefined three-region network for multisensory integration of emotion, consisting of fusiform gyrus (FG), amygdala (AMY) and anterior superior temporal gyrus (aSTG). The interactions in this network were investigated using dynamic causal models of effective connectivity. This revealed bilinear modulations by happy and sad music with suppression effects on the connectivity from FG and AMY to aSTG. Non-linear dynamic causal modeling showed a suppressive gating effect of aSTG on fusiform–amygdalar connectivity. In conclusion, fusiform to amygdala coupling strength is modulated via feedback through aSTG as region for multisensory integration of emotional material. This mechanism was emotion-specific and more pronounced for sad music. Therefore, soundtrack valences may modulate emotion elicitation in movies by differentially changing preprocessed visual information to the amygdala.