[ICLR 2025]SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watchin

论文网址：SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

论文代码：https://github.com/metrics-lab/sim

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.1. Base architectures

2.4.2. Decoding network

2.5. Experimental methods

1. 心得

（1）还可以，模型比较别致（虽然空间投影这东西比较...难评就是哪哪都能投，它不是一个绝对的这样就好的理论）

（2）看得出来实验非常用心，对实验整体的设计也不错（不知道是自创还是遵循前人）（主要细节很不错，而不是大体）

（3）任务设计也是比较完备的，考虑的方面比较多

（4）读到文尾发现作者在这短短十页单栏里面塞了好多细节信息，属于是没什么废话了。作者，从头到尾感觉没天花乱坠吹自己模型效果，而是更倾向就谈事实一点。挺好的

（5）篇幅较短加上引用是authoryear的情况下其实写不了几句话。作者更多的选择实验表现（表格数据）都放在附录而不是在正文中挡着理论的位置是很明智的选择。这个领域也不是一个SOTA就杀穿顶会的说法。

2. 论文逐段精读

2.1. Abstract

①Limitation: train and test model on the same dataset

②Task: predict movie clips the subjects watch through udio, video, and fMRI

2.2. Introduction

①为什么fMRI和皮层活动可以放在一起说啊，有无大佬解释一下。（ds: 虽然 fMRI 通常覆盖全脑，7T cortical fMRI 特别关注皮层，尤其是浅层皮层的活动。）

②Target: predict cortical activity by stimulus or predict stimulus by cortical activity

③⭐They can predict unseen movie clips of unseen subjects

④Framework of SIM:

2.3. Related works

①⭐This study contains more subjects, rather than repeatly conduct multiple experiments on the same subject

2.4. Methods

①Spatial and temporal cortical fMRI signals: S(v,t) where $(v\in V_6)$ is 6th-order icosahedron I_6=(V_6,F_6) with |V_6|=40962 and |F_6|=81920

2.4.1. Base architectures

（1）SiT

①fMRI mapping: ribbon-constrained weighted averaging V=59292

②Sphere resampling: reshape to V=40962 and patched it by low-resolution icosphere ( I_3 with |F_3|=1280 )

③Triangular patches: $P=\{p^1,p^2,...p^{|N|}\}$ and $p^i\subset V_6,|p^i|=45$

④Feature embedding: $X^0=\left[X_1^0,...,X_N^0\right]\in\mathbb{R}^{N\times D}$

⑤Positional embedding: $E_{pos}=\{E_{i}\}_{i=1}^{N}$

⑥Final sequence: $\mathcal{X}^{(0)}=\left[X_{1}^{0}+E_{1},...,X_{N}^{0}+E_{N}\right]$

⑦fMRI feature extraction: consecutive transformer encoder blocks of Multi-Head Self-Attention (MHSA) and Feed Forward Network (FFN) layers, with residual layers in-between. Then get $\mathcal{X}_{fMRI}\in\mathbb{R}^{N\times D}$

（2）vsMAE

①Original fMRI frames:

②Masking ratio: $\rho$

③Encoder: $\Phi_{enc}^{fMRI}$ , including linear layers and Transformer block

④Mask replacing: replace masked tokens by random embeddings

⑤Reshape sequence and add positional encoding

⑥Decoder: $\Phi_{dec}^{fMRI}$ to get $T\times|p^{i}|$

⑦Loss: MSE

2.4.2. Decoding network

①Video and audio input: $\mathcal{X}_v\in\mathbb{R}^{N_v\times D_v}$ and $\mathcal{X}_{A}\in\mathbb{R}^{N_{A}\times D_{A}}$

②Embedding model: videoMAE and wav2vec2.0 for the two modalities, respectively

③Multimodal mappers: $f^{mod}_{\theta}$ , including GeLU activation, dropout and residual connections, to align dimention of modalities:

$y_{fMRI}=f_{\theta}^{fMRI}(\mathcal{X}_{fMRI}),\\y_{A}=f_{\theta}^{A}(\mathcal{X}_{A})\\y_{V}=f_{\theta}^{p}(\mathcal{X}_{V}),\\y_{fMRI},y_{V},y_{A}\in\mathbb{R}^{D_{GLP}}$