Bootstrap

python爬取电子病历_利用 BERT 模型解析电子病历

项目原始地址

项目地址

本项目改编自此 Github 项目,鸣谢作者。

问题描述

我们希望能从患者住院期间的临床记录来预测该患者未来30天内是否会再次入院,该预测可以辅助医生更好的选择治疗方案并对手术风险进行评估。在临床中治疗手段常见而预后情况难以控制管理的情况屡见不鲜。比如关节置换手术作为治疗老年骨性关节炎等疾病的最终方法在临床中取得了极大成功,但是与手术相关的并发症以及由此导致的再入院情况也并不少见。患者的自身因素如心脏病、糖尿病、肥胖等情况也会增加关节置换术后的再入院风险。当接受关节置换手术的人群的年龄越来越大,健康状况越来越差的情况下,会出现更多的并发症并且增加再次入院风险。

通过电子病历的相关记录,观察到对于某些疾病或者手术来说,30天内再次入院的患者各方面的风险都明显增加。因此对与前次住院原因相同,且前次出院与下次入院间隔未超过30天的再一次住院视为同一次住院的情况进行了筛选标注,训练模型来尝试解决这个问题。

数据选取与数据清洗

选取于 Medical Information Mart for Intensive Care III 数据集,也称 MIMIC-III,是在NIH资助下,由MIT、哈佛医学院BID医学中心、飞利浦医疗联合开发维护的多参数重症监护数据库。该数据集免费向研究人员开放,但是需要进行申请。我们在进行实验的时候将数据部署在 Postgre SQL 中。首先从admission表中取出所有数据,针对每一条记录计算同个subject_id下一次出现时的时间间隔,若小于30天则给该条记录添加标签Label=1,否则Label=0。然后再计算该次住院的时长(出院日期-入院日期),并抽取其中住院时长>2的样本。将上述抽出的所有样本的HADM_ID按照0.8:0.1:0.1的比例随机分配形成训练集、验证集和测试集。之后再从noteevents表中按照之前分配好的HADM_ID获取各个数据集的文本内容(即表noteevents中的TEXT列)。整理好的训练集、验证集和测试集均含有三列,分别为TEXT(文本内容),ID(即HADM_ID),Label(0或1)。

预训练模型

原项目使用的预训练模型。基于 BERT 训练。在NLP(自然语言处理)领域BERT模型有着里程碑式的意义。2018年的10月11日,Google发布的论文《Pre-training of Deep Bidirectional Transformers for Language Understanding》,成功在 11 项 NLP 任务中取得 state of the art 的结果,赢得自然语言处理学界的一片赞誉之声。BERT模型在文本分类、文本预测等多个领域都取得了很好的效果。

更多关于BERT模型的内容可参考链接

BERT算法的原理主要由两部分组成:第一步,通过对大量未标注的语料进行非监督的预训练,来学习其中的表达法。

其次,使用少量标记的训练数据以监督方式微调(fine tuning)预训练模型以进行各种监督任务。

ClinicalBERT 模型根据含有标记的临床记录对BERT模型进行微调,从而得到一个可以用于医疗领域文本分析的模型。细节请参考原项目链接

环境安装!pip install -U pytorch-pretrained-bert -i https://pypi.tuna.tsinghua.edu.cn/simple

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity='all'

数据查看

让我们来看看被预测的数据是什么格式import pandas as pd

sample = pd.read_csv('/home/input/MIMIC_note3519/BERT/sample.csv')

sampleTEXT

ID

Label

0

Nursing Progress Note 1900-0700 hours:\n** Ful...

176088

1

1

Nursing Progress Note 1900-0700 hours:\n** Ful...

135568

1

2

NPN:\n\nNeuro: Alert and oriented X2-3, Sleepi...

188180

0

3

RESPIRATORY CARE:\n\n35 yo m adm from osh for ...

110655

0

4

NEURO: A+OX3 pleasant, mae, following commands...

139362

0

5

Nursing Note\nSee Flowsheet\n\nNeuro: Propofol...

176981

0

可以看到在 TEXT 字段下存放了几条非结构的文本数据,让我们来取出一条看看在说什么。text = sample['TEXT'][0]

print(text)Nursing Progress Note 1900-0700 hours:

** Full code

** allergy: nkda

** access: #18 piv to right FA, #18 piv to right FA.

** diagnosis: angioedema

In Brief: Pt is a 51yo F with pmh significant for: COPD, HTN, diabetes insipidus, hypothyroidism, OSA (on bipap at home), restrictive lung disease, pulm artery hypertension attributed to COPD/OSA, ASD with shunt, down syndrome, CHF with LVEF >60%. Also, 45pk-yr smoker (quit in [**2112**]).

Pt brought to [**Hospital1 2**] by EMS after family found with decreased LOC. Pt presented with facial swelling and mental status changes. In [**Name (NI) **], pt with enlarged lips and with sats 99% on 2-4l. Her pupils were pinpoint so given narcan. She c/o LLQ abd pain and also developed a severe HA. ABG with profound resp acidosis 7.18/108/71. Given benadryl, nebs, solumedrol. Difficult intubation-req'd being taken to OR to have fiberoptic used. Also found to have ARF. On admit to ICU-denied pain in abdomen, denied HA. Denied any pain. Pt understands basic english but also used [**Name (NI) **] interpretor to determine these findings. Head CT on [**Name6 (MD) **] [**Name8 (MD) 20**] md as pt was able to nod yes and no and follow commands.

NEURO: pt is sedate on fent at 50mcg/hr and versed at 0.5mg/hr-able to arouse on this level of sedation. PEARL 2mm/brisk. Able to move all ext's, nod yes and no to questions. Occasional cough.

CARDIAC: sb-nsr with hr high 50's to 70's. Ace inhibitors (pt takes at home) on hold right now as unclear as to what meds or other cause of angioedema. no ectopy. SBP >100 with MAPs > 60.

RESP: nasally intubated. #6.0 tube which is sutured in place. Confirmed by xray for proper placement (5cm above carina). ** some resp events overnight: on 3 occasions thus far, pt noted to have vent alarm 'apnea' though on AC mode and then alarms 'pressure limited/not constant'. At that time-pt appears comfortably sedate (not bucking vent) but dropping TV's into 100's (from 400's), MV to 3.0 and then desats to 60's and 70's with no chest rise and fall noted. Given 100% 02 first two times with immediate elevation of o2 sat to >92%. The third time RT ambubagged to see if it

;