[BDA 데이터 분석 모델링반 (ML 1) 9회차] 선형회귀 응용

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

데이터 사이언스 공부할래

[BDA 데이터 분석 모델링반 (ML 1) 9회차] 선형회귀 응용 본문

B.D.A

[BDA 데이터 분석 모델링반 (ML 1) 9회차] 선형회귀 응용

SeonHo Yoo 2024. 6. 29. 16:34

캘리포니아 집값 데이터를 활용하여 선형회귀 가정을 확인하고, 이상치를 제거하면서 주택 가격을 예측해본다.

선형회귀 가정 확인

import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.datasets import fetch_california_housing
from scipy.stats import shapiro
from statsmodels.stats.stattools import durbin_watson

캘리포니아 집값 데이터셋으로 선형회귀를 진행한다.

# 캘리포니아 주택 데이터셋 로드
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

# 필요한 컬럼 선택
x = data['MedInc']
y = data['MedHouseVal']

# 상수항 추가
x = sm.add_constant(x)

#선형회귀 모델 적합
model = sm.OLS(y, sm.add_constant(x)).fit()

#잔차의 플롯
residuals = model.resid # 잔차를 뽑아서
fitted = model.fittedvalues

선형회귀 가정을 확인해본다. (각각에 대한 설명은 8회차 글에서 확인할 수 있다)

## 1번 
# scatter로 그래프 확인
plt.scatter(fitted, residuals) # 잔차 그래프
plt.xlabel('Fitted Values')
plt.ylabel('Residual')
plt.axhline(0, color= 'red', linestyle='--')
plt.show()

# 2번 정규성 검정 : shapiro - Wilk 테스트
stat, p = shapiro(residuals)
print('Shapiro-Wilk Test: Statistics=%.3f, p=%.3f' % (stat, p))

'''
Shapiro-Wilk Test: Statistics=0.922, p=0.000
'''

# 3번 더빈 왓슨 테스트
dw_stat = durbin_watson(residuals)
print('Durbin-Watson Statistics', dw_stat)

'''
Durbin-Watson Statistics 0.6545256909553094
'''

# 4번 정규성 검정 Q-Q플롯
fig =sm.qqplot(residuals, line='s')
plt.title('Q-Q plot of Residuals')
plt.show()

# 5번 등분산성 검정 :잔차의 분산 플롯

sns.residplot(x=fitted, y = residuals, lowess= True , line_kws = {'color':'red', 'lw':1})
plt.title('Residual plot with Lowess Line')

이상치 제거 및 가격 예측

from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest

# 캘리포니아 주택 데이터셋 로드
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

# 필요한 컬럼 선택
x = data['MedInc'] #중앙값 수입
y = data['MedHouseVal'] #중앙값 집값

X_train, X_test, y_train, y_test= train_test_split(x,y, test_size = 0.2, random_state=111)

이상치를 제거하고 비교해본다.

# Isolation Forest 
iso = IsolationForest(contamination = 0.3)
yhat = iso.fit_predict(np.array(X_train).reshape(-1,1))

# 이상치 제거 전 후 그래프 비교

mask = yhat != -1
X_train_filtered, y_train_filtered = X_train[mask], y_train[mask]

# 시각화를 통해 이상치 제거 전 후 

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.scatter(X_train, y_train, c='blue', label='Original')
plt.title('Original Data')

plt.subplot(1,2,2)
plt.scatter(X_train_filtered, y_train_filtered, c='green', label='Filtered')
plt.title('Outlier Removal')

원본 데이터와 이상치 제거 데이터를 학습시켜 정확도를 비교해본다.

##  원본 데이터 학습
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(np.array(X_train).reshape(-1,1), y_train)
y_pred = model.predict(np.array(X_test).reshape(-1,1))
mse_original = mean_squared_error(y_test, y_pred)

# 이상치제거 데이터 학습
model_filtered = LinearRegression()
model_filtered.fit(np.array(X_train_filtered).reshape(-1,1), y_train_filtered)
y_pred_filtered = model_filtered.predict(np.array(X_test).reshape(-1,1))
mse_filtered = mean_squared_error(y_test, y_pred_filtered)


print('Original MSE:', mse_original)
print('Filtered MSE:', mse_filtered)

'''
Original MSE: 0.7282634839897112
Filtered MSE: 0.7305928153635892
'''

실제로 Outlier를 제거한 데이터에서 정확도가 원본데이터를 학습한 모델의 정확도보다 높게 나왔다.

plt.figure(figsize=(10,5))
plt.scatter(X_train, y_train , c='blue', label = 'Before Outlier Removal')
plt.scatter(X_train_filtered, y_train_filtered , c='red', label = 'After Outlier Removal')
plt.xlabel("Median")
plt.ylabel("House Value")
plt.legend()

IQR Method, Isolation Forest, Local Outlier Factor를 활용하여 이상치 범위를 조절하고 시각화를 진행해본다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# 데이터 로드
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=["Target"])

# 특성 선택
X = X[['MedInc']]

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# IQR 방법을 사용하는 함수
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    mask = ~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)
    return data[mask], data[~mask]

# 이상치 탐지
X_train_filtered_iqr, outliers_iqr = detect_outliers_iqr(X_train)

# Isolation Forest
iso = IsolationForest(contamination=0.1)
yhat_iso = iso.fit_predict(X_train)
mask_iso = yhat_iso != -1
X_train_filtered_iso, outliers_iso = X_train[mask_iso], X_train[~mask_iso]

# Local Outlier Factor
lof = LocalOutlierFactor()
yhat_lof = lof.fit_predict(X_train)
mask_lof = yhat_lof != -1
X_train_filtered_lof, outliers_lof = X_train[mask_lof], X_train[~mask_lof]

# 시각화
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X_train_filtered_iqr, y_train.loc[X_train_filtered_iqr.index], color='blue', label='Filtered Data')
plt.scatter(outliers_iqr, y_train.loc[outliers_iqr.index], color='red', label='Outliers')
plt.title('IQR Method')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.legend()

plt.subplot(1, 3, 2)
plt.scatter(X_train_filtered_iso, y_train.loc[X_train_filtered_iso.index], color='blue', label='Filtered Data')
plt.scatter(outliers_iso, y_train.loc[outliers_iso.index], color='red', label='Outliers')
plt.title('Isolation Forest')

plt.subplot(1, 3, 3)
plt.scatter(X_train_filtered_lof, y_train.loc[X_train_filtered_lof.index], color='blue', label='Filtered Data')
plt.scatter(outliers_lof, y_train.loc[outliers_lof.index], color='red', label='Outliers')
plt.title('Local Outlier Factor')

plt.show()

각각 분류된 데이터를 기준으로 모델을 평가해본다.

# 데이터 로드 및 초기화
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=["Target"])
X = X[['MedInc']]

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 선형 회귀 모델 평가 함수
def evaluate_model(X_train, y_train, X_test, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred_test = model.predict(X_test)
    mse_test = mean_squared_error(y_test, y_pred_test)
    return mse_test

# 원본 데이터 평가
original_mse = evaluate_model(X_train, y_train, X_test, y_test)

# 이상치 제거 함수
def detect_outliers(method, X, y, contamination=0.1):
    if method == 'IQR':
        Q1 = X.quantile(0.25)
        Q3 = X.quantile(0.75)
        IQR = Q3 - Q1
        mask = ~((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1)
    elif method == 'IsolationForest':
        iso = IsolationForest(contamination=contamination)
        mask = iso.fit_predict(X) != -1
    elif method == 'LOF':
        lof = LocalOutlierFactor()
        mask = lof.fit_predict(X) != -1
    return X[mask], y[mask]

# 이상치 제거 및 데이터 저장
methods = ['IQR', 'IsolationForest', 'LOF']
results = {}
for method in methods:
    X_filtered, y_filtered = detect_outliers(method, X_train, y_train)
    mse_test = evaluate_model(X_filtered, y_filtered, X_test, y_test)
    results[method] = mse_test

# 결과 출력
print(f"Original MSE: {original_mse:.4f}")
for method, mse in results.items():
    print(f"{method} MSE: {mse:.4f}")
    
'''
Original MSE: 0.7091
IQR MSE: 0.7133
IsolationForest MSE: 0.7130
LOF MSE: 0.7092
'''

# 시각화
plt.figure(figsize=(10, 6))
methods = ['Original'] + list(results.keys())
mse_scores = [original_mse] + list(results.values())
plt.bar(methods, mse_scores, color=['blue', 'green', 'orange', 'red'])
plt.xlabel('Method')
plt.ylabel('MSE on Test Set')
plt.title('Comparison of MSE After Outlier Removal')
plt.show()

모든 method에서 Orginal보다 높은 정확도를 보였으며, IQR MSE가 가장 높았다.

이상치 제거 데이터에 따른 선형회귀 모델을 활용하여 선형회귀선을 확인해본다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# 데이터 로드 및 초기화
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="Target")
X = X[['MedInc']]

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 이상치 제거 및 모델 학습/예측 함수
def detect_and_plot_outliers(method, X, y, ax):
    if method == 'Original':
        X_filtered = X
        y_filtered = y
    elif method == 'IQR':
        Q1 = X.quantile(0.25)
        Q3 = X.quantile(0.75)
        IQR = Q3 - Q1
        mask = ~((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1)
        X_filtered = X[mask]
        y_filtered = y[mask]
    elif method == 'IsolationForest':
        iso = IsolationForest(contamination=0.1)
        mask = iso.fit_predict(X) != -1
        X_filtered = X[mask]
        y_filtered = y[mask]
    elif method == 'LOF':
        lof = LocalOutlierFactor()
        mask = lof.fit_predict(X) != -1
        X_filtered = X[mask]
        y_filtered = y[mask]

    # 선형 회귀 모델 훈련 및 예측
    model = LinearRegression()
    model.fit(X_filtered, y_filtered)
    y_pred = model.predict(X_filtered)

    # 시각화
    ax.scatter(X_filtered, y_filtered, color='blue', alpha=0.5, label='Filtered Data')
    ax.plot(X_filtered, y_pred, color='red', label='Regression Line - ' + method)
    ax.set_title(method)
    ax.set_xlabel('Median Income')
    ax.set_ylabel('Median House Value')
    ax.legend()

# 그래프 준비
fig, axes = plt.subplots(1, 4, figsize=(24, 6))

# 이상치 제거 방법별로 시각화
methods = ['Original', 'IQR', 'IsolationForest', 'LOF']
for ax, method in zip(axes, methods):
    detect_and_plot_outliers(method, X_train, y_train, ax)

plt.tight_layout()
plt.show()

Cook's Distance

Cook's Distance는 하나의 값이 회귀 모델에 미치는 영향력을 나타내는 측도로, 이상치 제거 데이터에 따른 선형회귀 모델을 활용하여 선형회귀선을 확인해본다.

$$C_i = \frac{\sum_{j=1}^{n}(\hat{y_j}-\hat{y_{j(i)}})^2}{\hat{\sigma ^2}(p+1)}, i = 1,2,\cdots , n$$

임계값을 정할 때 통상적인 기준은 1이지만, 데이터에 따라 달라질 수 있다.경우에 따라 $\frac{4}{n}$으로 하거나, $\frac{4}{n-k-1}$ (n은 관측값, k는 독립변수)로 임계값을 조정할 수 있다.

낮은 Cook's Distance는 해당 데이터가 모델에 미치는 영향은 거의 없음을 나타내고, 이상치가 아닐 확률이 높다.

높은 Cook's Distance는 데이터가 회귀선에 큰 영향을 미치고 있음을 나타내며, 일반적은로 Distance가 $\frac{4}{n}$보다 큰 경우, 해당 데이터는 영향력 있는 이상치로 간주된다.

## IQR
data =fetch_california_housing()
X = pd.DataFrame(data.data, columns = data.feature_names)
y = pd.DataFrame(data.target, columns=['Target'])
X = X[['MedInc']]


X_train, X_test, y_train, y_test= train_test_split(X,y, test_size = 0.2, random_state=111)

## Cook's Distance

X_train = sm.add_constant(X_train)

#선형회귀 모델 학습

model = sm.OLS(y_train, X_train)
results = model.fit()

# Cook's Distance 계산

influence = results.get_influence()
cooks_d = influence.cooks_distance[0]

# 시각화

plt.figure(figsize=(10,6))
plt.stem(cooks_d, use_line_collection=True)
plt.show()

# 상위 이상치 포인트 식별
influential_points= np.where(cooks_d > 4 /len(X_train))[0]
print('이상치 포인트', influential_points)

'''
이상치 포인트 [    6    18    25    31    73    84   132   142   147   150   188   200
   233   259   267   292   296   325   348   354   357   364   384   418
   ...
 16217 16250 16262 16267 16284 16292 16307 16359 16368 16376 16378 16422
 16437 16438 16448 16457 16461 16498]
'''

OLS 요약 확인

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# 데이터 로드 및 초기화
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.DataFrame(data.target, columns=["Target"])
X = X[['MedInc']]  # 'MedInc' 특성 사용

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 이상치 제거 및 모델 학습/예측 함수
def fit_ols_and_summary(method, X, y):
    if method == 'Original':
        X_filtered = X
        y_filtered = y
    elif method == 'IQR':
        Q1 = X.quantile(0.25)
        Q3 = X.quantile(0.75)
        IQR = Q3 - Q1
        mask = ~((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))).any(axis=1)
        X_filtered = X[mask]
        y_filtered = y[mask]
    elif method == 'IsolationForest':
        iso = IsolationForest(contamination=0.1)
        mask = iso.fit_predict(X) != -1
        X_filtered = X[mask]
        y_filtered = y[mask]
    elif method == 'LOF':
        lof = LocalOutlierFactor()
        mask = lof.fit_predict(X) != -1
        X_filtered = X[mask]
        y_filtered = y[mask]

    # 선형 회귀 모델 훈련 및 요약 출력
    X_filtered = sm.add_constant(X_filtered)  # adding a constant
    model = sm.OLS(y_filtered, X_filtered)
    results = model.fit()
    print(method + " Summary")
    print(results.summary())

# 이상치 제거 방법별로 OLS 요약 출력
methods = ['Original', 'IQR', 'IsolationForest', 'LOF']
for method in methods:
    fit_ols_and_summary(method, X_train, y_train)

OLS 출력 결과는 다음과 같다.

Original Summary
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Target   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.477
Method:                 Least Squares   F-statistic:                 1.506e+04
Date:                Sun, 26 May 2024   Prob (F-statistic):               0.00
Time:                        19:02:31   Log-Likelihood:                -20475.
No. Observations:               16512   AIC:                         4.095e+04
Df Residuals:                   16510   BIC:                         4.097e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4446      0.015     30.096      0.000       0.416       0.474
MedInc         0.4193      0.003    122.709      0.000       0.413       0.426
==============================================================================
Omnibus:                     3353.131   Durbin-Watson:                   1.982
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7339.541
Skew:                           1.175   Prob(JB):                         0.00
Kurtosis:                       5.268   Cond. No.                         10.2
==============================================================================


IQR Summary
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Target   R-squared:                       0.394
Model:                            OLS   Adj. R-squared:                  0.394
Method:                 Least Squares   F-statistic:                 1.041e+04
Date:                Sun, 26 May 2024   Prob (F-statistic):               0.00
Time:                        19:02:31   Log-Likelihood:                -19734.
No. Observations:               15983   AIC:                         3.947e+04
Df Residuals:                   15981   BIC:                         3.949e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3449      0.017     19.832      0.000       0.311       0.379
MedInc         0.4473      0.004    102.007      0.000       0.439       0.456
==============================================================================
Omnibus:                     3520.708   Durbin-Watson:                   1.991
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7660.197
Skew:                           1.274   Prob(JB):                         0.00
Kurtosis:                       5.237   Cond. No.                         11.1
==============================================================================


IsolationForest Summary
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Target   R-squared:                       0.318
Model:                            OLS   Adj. R-squared:                  0.318
Method:                 Least Squares   F-statistic:                     6940.
Date:                Sun, 26 May 2024   Prob (F-statistic):               0.00
Time:                        19:02:31   Log-Likelihood:                -18301.
No. Observations:               14863   AIC:                         3.661e+04
Df Residuals:                   14861   BIC:                         3.662e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3379      0.020     16.608      0.000       0.298       0.378
MedInc         0.4445      0.005     83.307      0.000       0.434       0.455
==============================================================================
Omnibus:                     3365.010   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7339.279
Skew:                           1.308   Prob(JB):                         0.00
Kurtosis:                       5.237   Cond. No.                         12.1
==============================================================================


LOF Summary
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Target   R-squared:                       0.478
Model:                            OLS   Adj. R-squared:                  0.478
Method:                 Least Squares   F-statistic:                 1.493e+04
Date:                Sun, 26 May 2024   Prob (F-statistic):               0.00
Time:                        19:02:32   Log-Likelihood:                -20202.
No. Observations:               16287   AIC:                         4.041e+04
Df Residuals:                   16285   BIC:                         4.042e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4386      0.015     29.425      0.000       0.409       0.468
MedInc         0.4212      0.003    122.209      0.000       0.414       0.428
==============================================================================
Omnibus:                     3301.919   Durbin-Watson:                   1.989
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7244.625
Skew:                           1.172   Prob(JB):                         0.00
Kurtosis:                       5.276   Cond. No.                         10.3
==============================================================================

OLS 출력 내용 설명

상단 요약 정보

Dep. Variable : 종속 변수의 이름

Model : 모델 유형

Method : 모델을 적합시키기 위해 사용한 방법

Date & Time : 분석이 수행된 날짜 및 시간

No. Observations : 관측치의 수

Df Residuals : 잔차의 자유도 (관측치 수 - 독립 변수 수 - 1)

Df Model : 모델의 자유도 (독립 변수의 수)

Covariance Type : 공분산 유형 (보통의 경우 nonrobust)

적합도 지표

R-squared : 결정 계수, 종속 변수의 변동성을 얼마나 설명하는지를 나타낸다. 0과 1 사이의 값이며, 1에 가까울수록 모델이 데이터를 잘 설명한다.
Adj. R-squared : 수정된 결정 계수, 결정계수의 단점을 보완하여 독립 변수의 수를 고려한 지표
F-statistic : 모델의 전체 유의성을 검정하기 위한 F 통계량, 높은 값일수록 모델이 유의미함을 나타낸다.
Prob (F-statistic) : F-검정의 p-값, 일반적으로 0.05 이하이면 모델이 유의미하다고 해석한다.
Log-Likelihood : 로그 우도, 모델이 데이터에 얼마나 적합한지를 나타내는 지표로, 값이 클수록 모델이 데이터를 잘 설명한다.
AIC : 모델의 적합성과 복잡성을 평가하는 지표, 값이 작을수록 좋은 모델을 의미한다.
BIC : AIC와 유사하나, 더 엄격하게 모델의 복잡성을 벌점화한다. 값이 작을수록 좋은 모델을 의미한다.

회귀 계수 테이블

coef : 각 독립 변수의 회귀 계수, 종속 변수에 대한 각 독립 변수의 영향력을 나타낸다.
std err : 계수의 표준 오차, 회귀 계수의 불확실성을 나타낸다.
t : t-통계량, 각 회귀 계수가 0과 유의하게 다른지를 검정한다.
P>|t| : t-검정의 p-값, 일반적으로 0.05 이하이면 해당 계수가 유의미하다고 해석한다.
[0.025 0.975] : 회귀 계수의 95% 신뢰 구간, 이 구간이 0을 포함하지 않으면 해당 계수가 유의미하다고 해석한다.

진단 지표

Omnibus : 잔차의 정규성을 검정하는 Omnibus 검정 통계량, 값이 작을수록 잔차가 정규분포를 따른다.
Prob(Omnibus) : Omnibus 검정의 p-값, 일반적으로 0.05 이상이면 잔차가 정규분포를 따른다고 해석한다.
Skew : 잔차의 왜도, 0에 가까울수록 대칭적이다.
Kurtosis : 잔차의 첨도, 3에 가까울수록 정규분포에 가깝다.
Durbin-Watson : 잔차의 자기 상관을 검정하는 Durbin-Watson 통계량, 값이 2에 가까울수록 자기 상관이 없다.
Jarque-Bera (JB) : 잔차의 정규성을 검정하는 Jarque-Bera 통계량, 값이 작을수록 잔차가 정규분포를 따른다.
Prob(JB) : Jarque-Bera 검정의 p-값, 일반적으로 0.05 이상이면 잔차가 정규분포를 따른다고 해석한다.
Cond. No. : 독립 변수의 조건수, 값이 높을수록 다중공선성의 문제가 있을 수 있다.

'B.D.A' 카테고리의 다른 글

[BDA 데이터 분석 모델링반 (ML 1) 11회차] 라쏘 회귀, 릿지 회귀 (4)	2024.07.17
[BDA 데이터 분석 모델링반 (ML 1) 10회차] 다항회귀, 다중회귀 (2)	2024.07.11
[BDA 데이터 분석 모델링반 (ML 1) 8회차] 선형회귀 (5)	2024.06.28
[BDA 데이터 분석 모델링반 (ML 1) 7회차] RFM 분석 (6)	2024.06.05
[BDA 데이터 분석 모델링반 (ML 1) 6회차] K-means 클러스터링 (3)	2024.06.04

'B.D.A' Related Articles

데이터 사이언스 공부할래

[BDA 데이터 분석 모델링반 (ML 1) 9회차] 선형회귀 응용 본문

[BDA 데이터 분석 모델링반 (ML 1) 9회차] 선형회귀 응용

선형회귀 가정 확인

이상치 제거 및 가격 예측

Cook's Distance

OLS 요약 확인

OLS 출력 내용 설명

'B.D.A' 카테고리의 다른 글

티스토리툴바