Bootstrap

Python和MATLAB示例临床因素分析

🌵Python片段

为了演示临床因素的分析,让我们模拟一个数据集并执行一些基本的统计和机器学习分析。我们将重点关注以下步骤:

  • 模拟数据集:创建具有年龄、性别、BMI、吸烟状况和疾病结果等特征的临床数据。
  • 描述性统计:使用平均值、标准差和分布总结数据。
  • 相关性分析:识别临床因素之间的关系。
  • 预测模型:建立逻辑回归模型,根据临床特征预测疾病结果。
  • 可视化:绘制重要关系和模型性能。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Simulate the dataset
np.random.seed(42)

# Simulating clinical factors
n_samples = 500
age = np.random.normal(50, 12, n_samples).clip(18, 90)  # Age between 18 and 90
gender = np.random.choice(['Male', 'Female'], n_samples)  # Binary gender
bmi = np.random.normal(25, 5, n_samples).clip(15, 50)    # BMI between 15 and 50
smoking_status = np.random.choice(['Smoker', 'Non-Smoker'], n_samples, p=[0.3, 0.7])
disease_outcome = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])  # Disease prevalence of 30%

# Combine into a DataFrame
data = pd.DataFrame({
    'Age': age,
    'Gender': gender,
    'BMI': bmi,
    'Smoking_Status': smoking_status,
    'Disease_Outcome': disease_outcome
})

# Encode categorical variables
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})
data['Smoking_Status'] = data['Smoking_Status'].map({'Smoker': 1, 'Non-Smoker': 0})

data.head()
Result
         Age  Gender        BMI  Smoking_Status  Disease_Outcome
0  55.960570       0  21.478282               0                0
1  48.340828       0  17.957694               0                0
2  57.772262       0  17.216854               0                0
3  68.276358       1  28.030050               0                1
4  47.190160       0  18.597853               1                1

模拟数据集包含 500 个样本,包含以下列:

  • 年龄:表示患者年龄的连续变量。
  • 性别:二元变量(1 = 男性,0 = 女性)。
  • BMI:表示体重指数的连续变量。
  • 吸烟状况:二元变量(1 = 吸烟者,0 = 非吸烟者)。
  • 疾病结果:二元结果变量(1 = 有病,0 = 无病)。

下一步:

  • 描述性统计和可视化
  • 相关性分析
  • 预测模型

让我们计算描述性统计并可视化数据分布

# Descriptive statistics
desc_stats = data.describe()

# Visualizations
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.histplot(data['Age'], kde=True, bins=20, color='blue')
plt.title('Age Distribution')

plt.subplot(1, 3, 2)
sns.histplot(data['BMI'], kde=True, bins=20, color='green')
plt.title('BMI Distribution')

plt.subplot(1, 3, 3)
sns.countplot(x='Smoking_Status', data=data, palette='pastel')
plt.title('Smoking Status')

plt.tight_layout()
plt.show()

desc_stats
Result
              Age    Gender         BMI  Smoking_Status  Disease_Outcome
count  500.000000  500.0000  500.000000      500.000000       500.000000
mean    50.083381    0.5200   25.540743        0.306000         0.298000
std     11.687369    0.5001    4.864940        0.461291         0.457838
min     18.000000    0.0000   15.000000        0.000000         0.000000
25%     41.596311    0.0000   22.200960        0.000000         0.000000
50%     50.153566    1.0000   25.652685        0.000000         0.000000
75%     57.641399    1.0000   28.568301        1.000000         1.000000
max     90.000000    1.0000   38.161910        1.000000         1.000000

描述性统计:

  • 年龄:平均年龄约为 50 岁,标准差约为 11.7 岁。范围为 18-90 岁。
  • BMI:平均年龄约为 25.5,与正常 BMI 范围一致,标准差约为 4.9。
  • 吸烟状况:约 30.6% 的人口是吸烟者。
  • 疾病结果:约 29.8% 患有该疾病,与模拟患病率一致。

可视化观察结果:

  • 年龄分布:以平均年龄(~50)为中心的对称分布。
  • BMI 分布:以~25 为中心的正态分布。
  • 吸烟状况:非吸烟者占主导地位(~70% 的样本)。

接下来,我们分析这些因素与疾病结果之间的相关性。

# Correlation matrix (including only numerical variables)
correlation_matrix = data.corr()

# Plotting the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

correlation_matrix
Result
                      Age    Gender       BMI  Smoking_Status  Disease_Outcome
Age              1.000000  0.062456 -0.005557       -0.064929        -0.065476
Gender           0.062456  1.000000 -0.071286        0.029883         0.074571
BMI             -0.005557 -0.071286  1.000000       -0.034530        -0.087904
Smoking_Status  -0.064929  0.029883 -0.034530        1.000000         0.013341
Disease_Outcome -0.065476  0.074571 -0.087904        0.013341         1.00000

相关性分析:

  • 疾病结果:

    • 与性别呈弱正相关性((~0.075)),表明男性患病率可能略高。
    • 与年龄((-0.065))和 BMI((-0.088))呈非常弱的负相关性,表明没有很强的线性关系。
  • 其他特征

    • 其他特征之间的相关性非常小,表明多重共线性较低

我们将建立一个逻辑回归模型,根据临床特征预测疾病结果。这将深入了解有助于疾病预测的因素。

# Splitting the dataset into training and testing sets
X = data[['Age', 'Gender', 'BMI', 'Smoking_Status']]
y = data['Disease_Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions and evaluation
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

accuracy, conf_matrix, class_report

🌵MATLAB片段

% Simulate Clinical Dataset
rng(42); % For reproducibility
numPatients = 500;

% Generate clinical factors
Age = randi([30, 80], numPatients, 1); % Age in years
BMI = 18 + 15*rand(numPatients, 1); % BMI
BP = randi([90, 160], numPatients, 1); % Blood Pressure
Cholesterol = randi([150, 300], numPatients, 1); % Cholesterol levels
Smoking = randi([0, 1], numPatients, 1); % Smoking status (0=No, 1=Yes)

% Generate binary outcome: Heart Disease (1) or No Heart Disease (0)
% Using a logistic model for simulation
coeff = [0.03, 0.04, 0.05, 0.02, 0.5]; % Coefficients for logistic function
X = [Age, BMI, BP, Cholesterol, Smoking];
logit_prob = 1 ./ (1 + exp(-(X * coeff' - 10))); % Logistic probability
HeartDisease = double(rand(numPatients, 1) < logit_prob); % Generate outcome

% Combine into a table
clinicalData = table(Age, BMI, BP, Cholesterol, Smoking, HeartDisease);

% Data Preprocessing: Normalize numeric variables
numericVars = {'Age', 'BMI', 'BP', 'Cholesterol'};
for i = 1:length(numericVars)
    clinicalData.(numericVars{i}) = (clinicalData.(numericVars{i}) - ...
        mean(clinicalData.(numericVars{i}))) / std(clinicalData.(numericVars{i}));
end

% Statistical Analysis
disp('Correlation Matrix:');
corrMatrix = corr(table2array(clinicalData(:, 1:5))); % Correlation of predictors
disp(corrMatrix);

% Logistic Regression Model
X = table2array(clinicalData(:, 1:5)); % Features
y = clinicalData.HeartDisease; % Outcome
mdl = fitglm(X, y, 'Distribution', 'binomial', 'Link', 'logit');

% Display Model Coefficients
disp('Model Coefficients:');
disp(mdl.Coefficients);

% Evaluate Model: Predict on the dataset
predictedProb = predict(mdl, X);
predictedOutcome = predictedProb > 0.5;

% Confusion Matrix and Accuracy
confMat = confusionmat(y, predictedOutcome);
disp('Confusion Matrix:');
disp(confMat);

accuracy = sum(diag(confMat)) / sum(confMat(:));
disp(['Accuracy: ', num2str(accuracy)]);

% Plot Receiver Operating Characteristic (ROC) Curve
[Xroc, Yroc, T, AUC] = perfcurve(y, predictedProb, 1);
disp(['AUC: ', num2str(AUC)]);

figure;
plot(Xroc, Yroc);
xlabel('False Positive Rate');
ylabel('True Positive Rate');
title('ROC Curve');
grid on;

👉更新:亚图跨际

;