Bootstrap

[ICLR 2025]SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watchin

论文网址:SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments

论文代码:https://github.com/metrics-lab/sim

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related works

2.4. Methods

2.4.1. Base architectures

2.4.2. Decoding network

2.5. Experimental methods

2.5.1. Dataset

2.5.2. Training

2.5.3. Inference

2.5.4. Evaluation

2.6. Results

2.7. Discussion

1. 心得

(1)还可以,模型比较别致(虽然空间投影这东西比较...难评就是哪哪都能投,它不是一个绝对的这样就好的理论)

(2)看得出来实验非常用心,对实验整体的设计也不错(不知道是自创还是遵循前人)(主要细节很不错,而不是大体)

(3)任务设计也是比较完备的,考虑的方面比较多

(4)读到文尾发现作者在这短短十页单栏里面塞了好多细节信息,属于是没什么废话了。作者,从头到尾感觉没天花乱坠吹自己模型效果,而是更倾向就谈事实一点。挺好的

(5)篇幅较短加上引用是authoryear的情况下其实写不了几句话。作者更多的选择实验表现(表格数据)都放在附录而不是在正文中挡着理论的位置是很明智的选择。这个领域也不是一个SOTA就杀穿顶会的说法。

2. 论文逐段精读

2.1. Abstract

        ①Limitation: train and test model on the same dataset

        ②Task: predict movie clips the subjects watch through udio, video, and fMRI

2.2. Introduction

        ①为什么fMRI和皮层活动可以放在一起说啊,有无大佬解释一下。(ds: 虽然 fMRI 通常覆盖全脑,7T cortical fMRI 特别关注皮层,尤其是浅层皮层的活动。)

        ②Target: predict cortical activity by stimulus or predict stimulus by cortical activity

        ③⭐They can predict unseen movie clips of unseen subjects

        ④Framework of SIM:

2.3. Related works

        ①⭐This study contains more subjects, rather than repeatly conduct multiple experiments on the same subject

2.4. Methods

        ①Spatial and temporal cortical fMRI signals: S(v,t) where (v\in V_6) is 6th-order icosahedron I_6=(V_6,F_6) with |V_6|=40962 and |F_6|=81920

2.4.1. Base architectures

(1)SiT

        ①fMRI mapping: ribbon-constrained weighted averaging V=59292

        ②Sphere resampling: reshape to V=40962 and patched it by low-resolution icosphere (I_3 with |F_3|=1280)

        ③Triangular patches: P=\{p^1,p^2,...p^{|N|}\} and p^i\subset V_6,|p^i|=45 

        ④Feature embedding: X^0=\left[X_1^0,...,X_N^0\right]\in\mathbb{R}^{N\times D}

        ⑤Positional embedding: E_{pos}=\{E_{i}\}_{i=1}^{N}

        ⑥Final sequence: \mathcal{X}^{(0)}=\left[X_{1}^{0}+E_{1},...,X_{N}^{0}+E_{N}\right]

        ⑦fMRI feature extraction: L consecutive transformer encoder blocks of H Multi-Head Self-Attention (MHSA) and Feed Forward Network (FFN) layers, with residual layers in-between. Then get \mathcal{X}_{fMRI}\in\mathbb{R}^{N\times D}

(2)vsMAE

        ①Original fMRI frames: T

        ②Masking ratio: \rho

        ③Encoder: \Phi_{enc}^{fMRI}, including linear layers and Transformer block

        ④Mask replacing: replace masked tokens by random embeddings

        ⑤Reshape sequence and add positional encoding

        ⑥Decoder: \Phi_{dec}^{fMRI} to get T\times|p^{i}|

        ⑦Loss: MSE

2.4.2. Decoding network

        ①Video and audio input: \mathcal{X}_v\in\mathbb{R}^{N_v\times D_v} and \mathcal{X}_{A}\in\mathbb{R}^{N_{A}\times D_{A}}

        ②Embedding model: videoMAE and wav2vec2.0 for the two modalities, respectively

        ③Multimodal mappers: f^{mod}_{\theta}, including GeLU activation, dropout and residual connections, to align dimention of modalities:

y_{fMRI}=f_{\theta}^{fMRI}(\mathcal{X}_{fMRI}),\\y_{A}=f_{\theta}^{A}(\mathcal{X}_{A})\\y_{V}=f_{\theta}^{p}(\mathcal{X}_{V}),\\y_{fMRI},y_{V},y_{A}\in\mathbb{R}^{D_{GLP}}

        ④Positive triplet: fMRI, audio, and video are from the same 3s movie clip

        ⑤Negative triplet: fMRI, audio, and video are from the different 3s movie clip

        ⑥Cosine similarity between paired different modalities:

z_{a,b}(i,j)=\langle y_a^i,y_b^j\rangle)

then calculate the probability by:

P_{a,b}(i,j)=\frac{\exp(z_{a,b}(i,j)/\tau)}{\sum_{k=1}^{M}\exp(z_{a,b}(i,k)/\tau)}

where \tau denotes temperature hyperparameter

        ⑦BCE loss:

L_{a\rightarrow b}=-\frac{1}{M}\sum_{i=1}^{M}\log P_{a,b}(i,j)

but they got 3 modalities so they had:

L=(L_{\mathrm{f_{MRI}}\to V}+L_{V\to\mathrm{f_{MRI}}}+L_{\mathrm{f_{MRI}}\to A}+L_{A\to\mathrm{f_{MRI}}}+L_{A\to V}+L_{V\to A})/6

        ⑧Visual reconstruction:

2.5. Experimental methods

2.5.1. Dataset

        ①Subjects: 174 with 68 male and 106 female

        ②Recording sessions: 4 with 15 mins

        ③Task of subjects: watching movie (1-4.3 mins in length) and having 20s inverval for each

        ④Participants were told to fixate on a cross on a blank screen

        ⑤Audio equipment: earbuds

        ⑥Size of shot: 16:9 with 1024×720

        ⑦fMRI parameters: TR=1s, TE=22.2ms, spatial resolution=1.6⁢m⁢m^3其他的不说了

        ⑧Preprocessing: HCP minimal processing surface pipelines

2.5.2. Training

        ①Data split: 124/25/25 for train/val/test

        ②Split sex, age, left and right brain hemispheres

        ③Movie clips: 3s (corresponding to 16 frames of movie stimuli and 3 frames from the cortical fMRI)

        ④Sampling temporal lag: 6s

        ⑤SiT backbone: DeiT-small

        ⑥Experimental settings: AdamW, 3e-4 learning rate, cosine decay, 64 batch size

        ⑦Training: freeze video and audio encoders and unfreeze multimodal mappers

2.5.3. Inference

        ①Soft-negative: negative from different movies only; hard negative: negative from the same movies only

2.5.4. Evaluation

        ①Experimental setting: 上一张图的(b),实验(1)意味着测试集中有新的被试,实验(2)意味着测试集中有新的电影片段,实验(3)意味着测试集中同时有新的被试和新的片段

        ②Stimulus of movie to brain:

2.6. Results

        ①Performance of experiment 1:

        ②Soft negative performance:

2.7. Discussion

        ①Movie clips can be longer

        ②Types of movie

        ③Race bias

;