Python和MATLAB示例临床因素分析

🌵Python片段

为了演示临床因素的分析，让我们模拟一个数据集并执行一些基本的统计和机器学习分析。我们将重点关注以下步骤：

模拟数据集：创建具有年龄、性别、BMI、吸烟状况和疾病结果等特征的临床数据。
描述性统计：使用平均值、标准差和分布总结数据。
相关性分析：识别临床因素之间的关系。
预测模型：建立逻辑回归模型，根据临床特征预测疾病结果。
可视化：绘制重要关系和模型性能。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Simulate the dataset
np.random.seed(42)

# Simulating clinical factors
n_samples = 500
age = np.random.normal(50, 12, n_samples).clip(18, 90)  # Age between 18 and 90
gender = np.random.choice(['Male', 'Female'], n_samples)  # Binary gender
bmi = np.random.normal(25, 5, n_samples).clip(15, 50)    # BMI between 15 and 50
smoking_status = np.random.choice(['Smoker', 'Non-Smoker'], n_samples, p=[0.3, 0.7])
disease_outcome = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])  # Disease prevalence of 30%

# Combine into a DataFrame
data = pd.DataFrame({
    'Age': age,
    'Gender': gender,
    'BMI': bmi,
    'Smoking_Status': smoking_status,
    'Disease_Outcome': disease_outcome
})

# Encode categorical variables
data['Gender'] = data['Gender'].map({'Male': 1, 'Female': 0})
data['Smoking_Status'] = data['Smoking_Status'].map({'Smoker': 1, 'Non-Smoker': 0})

data.head()

Result
         Age  Gender        BMI  Smoking_Status  Disease_Outcome
0  55.960570       0  21.478282               0                0
1  48.340828       0  17.957694               0                0
2  57.772262       0  17.216854               0                0
3  68.276358       1  28.030050               0                1
4  47.190160       0  18.597853               1                1

模拟数据集包含 500 个样本，包含以下列：

年龄：表示患者年龄的连续变量。
性别：二元变量（1 = 男性，0 = 女性）。
BMI：表示体重指数的连续变量。
吸烟状况：二元变量（1 = 吸烟者，0 = 非吸烟者）。
疾病结果：二元结果变量（1 = 有病，0 = 无病）。

下一步：

描述性统计和可视化
相关性分析
预测模型

让我们计算描述性统计并可视化数据分布

# Descriptive statistics
desc_stats = data.describe()

# Visualizations
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.histplot(data['Age'], kde=True, bins=20, color='blue')
plt.title('Age Distribution')

plt.subplot(1, 3, 2)
sns.histplot(data['BMI'], kde=True, bins=20, color='green')
plt.title('BMI Distribution')

plt.subplot(1, 3, 3)
sns.countplot(x='Smoking_Status', data=data, palette='pastel')
plt.title('Smoking Status')

plt.tight_layout()
plt.show()

desc_stats

Result
              Age    Gender         BMI  Smoking_Status  Disease_Outcome
count  500.000000  500.0000  500.000000      500.000000       500.000000
mean    50.083381    0.5200   25.540743        0.306000         0.298000
std     11.687369    0.5001    4.864940        0.461291         0.457838
min     18.000000    0.0000   15.000000        0.000000         0.000000
25%     41.596311    0.0000   22.200960        0.000000         0.000000
50%     50.153566    1.0000   25.652685        0.000000         0.000000
75%     57.641399    1.0000   28.568301        1.000000         1.000000
max     90.000000    1.0000   38.161910        1.000000         1.000000

描述性统计：

年龄：平均年龄约为 50 岁，标准差约为 11.7 岁。范围为 18-90 岁。
BMI：平均年龄约为 25.5，与正常 BMI 范围一致，标准差约为 4.9。
吸烟状况：约 30.6% 的人口是吸烟者。
疾病结果：约 29.8% 患有该疾病，与模拟患病率一致。

可视化观察结果：

年龄分布：以平均年龄（~50）为中心的对称分布。
BMI 分布：以~25 为中心的正态分布。
吸烟状况：非吸烟者占主导地位（~70% 的样本）。

接下来，我们分析这些因素与疾病结果之间的相关性。

# Correlation matrix (including only numerical variables)
correlation_matrix = data.corr()

# Plotting the correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

correlation_matrix

Result
                      Age    Gender       BMI  Smoking_Status  Disease_Outcome
Age              1.000000  0.062456 -0.005557       -0.064929        -0.065476
Gender           0.062456  1.000000 -0.071286        0.029883         0.074571
BMI             -0.005557 -0.071286  1.000000       -0.034530        -0.087904
Smoking_Status  -0.064929  0.029883 -0.034530        1.000000         0.013341
Disease_Outcome -0.065476  0.074571 -0.087904        0.013341         1.00000

🌵MATLAB片段

% Simulate Clinical Dataset
rng(42); % For reproducibility
numPatients = 500;

% Generate clinical factors
Age = randi([30, 80], numPatients, 1); % Age in years
BMI = 18 + 15*rand(numPatients, 1); % BMI
BP = randi([90, 160], numPatients, 1); % Blood Pressure
Cholesterol = randi([150, 300], numPatients, 1); % Cholesterol levels
Smoking = randi([0, 1], numPatients, 1); % Smoking status (0=No, 1=Yes)

% Generate binary outcome: Heart Disease (1) or No Heart Disease (0)
% Using a logistic model for simulation
coeff = [0.03, 0.04, 0.05, 0.02, 0.5]; % Coefficients for logistic function
X = [Age, BMI, BP, Cholesterol, Smoking];
logit_prob = 1 ./ (1 + exp(-(X * coeff' - 10))); % Logistic probability
HeartDisease = double(rand(numPatients, 1) < logit_prob); % Generate outcome

% Combine into a table
clinicalData = table(Age, BMI, BP, Cholesterol, Smoking, HeartDisease);

% Data Preprocessing: Normalize numeric variables
numericVars = {'Age', 'BMI', 'BP', 'Cholesterol'};
for i = 1:length(numericVars)
    clinicalData.(numericVars{i}) = (clinicalData.(numericVars{i}) - ...
        mean(clinicalData.(numericVars{i}))) / std(clinicalData.(numericVars{i}));
end

% Statistical Analysis
disp('Correlation Matrix:');
corrMatrix = corr(table2array(clinicalData(:, 1:5))); % Correlation of predictors
disp(corrMatrix);

% Logistic Regression Model
X = table2array(clinicalData(:, 1:5)); % Features
y = clinicalData.HeartDisease; % Outcome
mdl = fitglm(X, y, 'Distribution', 'binomial', 'Link', 'logit');

% Display Model Coefficients
disp('Model Coefficients:');
disp(mdl.Coefficients);

% Evaluate Model: Predict on the dataset
predictedProb = predict(mdl, X);
predictedOutcome = predictedProb > 0.5;

% Confusion Matrix and Accuracy
confMat = confusionmat(y, predictedOutcome);
disp('Confusion Matrix:');
disp(confMat);

accuracy = sum(diag(confMat)) / sum(confMat(:));
disp(['Accuracy: ', num2str(accuracy)]);

% Plot Receiver Operating Characteristic (ROC) Curve
[Xroc, Yroc, T, AUC] = perfcurve(y, predictedProb, 1);
disp(['AUC: ', num2str(AUC)]);

figure;
plot(Xroc, Yroc);
xlabel('False Positive Rate');
ylabel('True Positive Rate');
title('ROC Curve');
grid on;

Python和MATLAB示例临床因素分析

🌵Python片段

🌵MATLAB片段

👉更新：亚图跨际

悦读