집 >백엔드 개발 >파이썬 튜토리얼 >AdaBoost - 앙상블 방법, 분류: 지도 머신 러닝

AdaBoost - 앙상블 방법, 분류: 지도 머신 러닝

王林원래의: 2024-07-18 21:00:01825검색

부스팅

정의 및 목적

부스팅은 머신러닝에서 모델의 정확도를 높이기 위해 사용하는 앙상블 학습 기법입니다. 여러 개의 약한 분류기(무작위 추측보다 성능이 약간 더 나은 모델)를 결합하여 강력한 분류기를 만듭니다. 부스팅의 주요 목적은 약한 분류기를 데이터에 순차적으로 적용하고 이전 분류기에서 발생한 오류를 수정하여 전반적인 성능을 향상시키는 것입니다.

주요 목표:

정확도 향상: 여러 약한 분류기의 출력을 결합하여 예측 정확도를 향상합니다.
편향 및 분산 감소: 편향 및 분산 문제를 해결하여 모델을 더 효과적으로 일반화합니다.
복잡한 데이터 처리: 데이터의 복잡한 관계를 효과적으로 모델링합니다.

AdaBoost(적응형 부스팅)

정의 및 목적

Adaptive Boosting의 약자인 AdaBoost는 널리 사용되는 부스팅 알고리즘입니다. 후속 분류자가 어려운 사례에 더 집중할 수 있도록 잘못 분류된 인스턴스의 가중치를 조정합니다. AdaBoost의 주요 목적은 각 반복에서 분류하기 어려운 예제를 강조하여 약한 분류기의 성능을 향상시키는 것입니다.

주요 목표:

가중치 조정: 다음 분류자가 해당 인스턴스에 집중할 수 있도록 잘못 분류된 인스턴스의 가중치를 늘립니다.
순차 학습: 각각의 새로운 분류자가 이전 분류자의 오류를 수정하는 분류자를 순차적으로 구축합니다.
향상된 성능: 약한 분류자를 결합하여 더 나은 예측력을 갖춘 강력한 분류자를 형성합니다.

AdaBoost 작동 방식

가중치 초기화:
- 모든 훈련 인스턴스에 동일한 가중치를 할당합니다. n개의 인스턴스가 있는 데이터 세트의 경우 각 인스턴스의 가중치는 1/n입니다.
약한 분류기 훈련:
- 가중치 데이터세트를 사용하여 약한 분류기를 훈련합니다.
분류자 오류 계산:
- 잘못 분류된 인스턴스의 가중치 합인 약한 분류기의 오류를 계산합니다.
계산 분류자 가중치:
- 오류를 기준으로 분류기의 가중치를 계산합니다. 무게는 다음과 같이 주어진다: 알파 = 0.5 * log((1 - 오류) / 오류)
- 오류가 낮을수록 분류기 가중치가 높아집니다.
인스턴스 가중치 업데이트:
- 인스턴스의 가중치를 조정합니다. 잘못 분류된 인스턴스의 가중치를 높이고 올바르게 분류된 인스턴스의 가중치를 줄입니다.
- 인스턴스 i의 업데이트된 가중치는 다음과 같습니다. 가중치[i] = 가중치[i] * exp(alpha * (잘못 분류됨 ? 1 : -1))
- 가중치의 합이 1이 되도록 정규화합니다.
약한 분류자 결합:
- 최종 강력한 분류자는 약한 분류자의 가중치 합입니다. 최종 분류기 = sign(sum(alpha * Weak_classifier))
- 부호 함수는 합계를 기준으로 클래스 레이블을 결정합니다.

AdaBoost(이진 분류) 예

Adaptive Boosting의 약자인 AdaBoost는 여러 개의 약한 분류기를 결합하여 강력한 분류기를 만드는 앙상블 기술입니다. 이 예에서는 합성 데이터를 사용하여 이진 분류를 위해 AdaBoost를 구현하고, 모델 성능을 평가하고, 결정 경계를 시각화하는 방법을 보여줍니다.

Python 코드 예

1. 라이브러리 가져오기

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

이 블록은 데이터 조작, 플로팅 및 기계 학습에 필요한 라이브러리를 가져옵니다.

2. 샘플 데이터 생성

np.random.seed(42)  # For reproducibility

# Generate synthetic data for 2 classes
n_samples = 1000
n_samples_per_class = n_samples // 2

# Class 0: Centered around (-1, -1)
X0 = np.random.randn(n_samples_per_class, 2) * 0.7 + [-1, -1]

# Class 1: Centered around (1, 1)
X1 = np.random.randn(n_samples_per_class, 2) * 0.7 + [1, 1]

# Combine the data
X = np.vstack([X0, X1])
y = np.hstack([np.zeros(n_samples_per_class), np.ones(n_samples_per_class)])

# Shuffle the dataset
shuffle_idx = np.random.permutation(n_samples)
X, y = X[shuffle_idx], y[shuffle_idx]

이 블록은 두 가지 특성을 가진 합성 데이터를 생성합니다. 여기서 대상 변수 y는 클래스 중심을 기반으로 정의되어 이진 분류 시나리오를 시뮬레이션합니다.

3. 데이터세트 분할

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

이 블록은 모델 평가를 위해 데이터세트를 훈련 세트와 테스트 세트로 분할합니다.

4. AdaBoost 분류기 생성 및 학습

base_estimator = DecisionTreeClassifier(max_depth=1)  # Decision stump
model = AdaBoostClassifier(estimator=base_estimator, n_estimators=3, random_state=42)
model.fit(X_train, y_train)

이 블록은 결정 그루터기를 기본 추정기로 사용하여 AdaBoost 모델을 초기화하고 훈련 데이터 세트를 사용하여 훈련합니다.

5. 예측

y_pred = model.predict(X_test)

이 블록은 훈련된 모델을 사용하여 테스트 세트에 대해 예측합니다.

6. 모델 평가

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

출력:

Accuracy: 0.9400

Confusion Matrix:
[[96  8]
 [ 4 92]]

Classification Report:
              precision    recall  f1-score   support

         0.0       0.96      0.92      0.94       104
         1.0       0.92      0.96      0.94        96

    accuracy                           0.94       200
   macro avg       0.94      0.94      0.94       200
weighted avg       0.94      0.94      0.94       200

이 블록은 정확도, 혼동 행렬, 분류 보고서를 계산하고 인쇄하여 모델 성능에 대한 통찰력을 제공합니다.

7. 의사결정 경계 시각화

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolor='black')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("AdaBoost Binary Classification")
plt.colorbar(scatter)
plt.show()

This block visualizes the decision boundary created by the AdaBoost model, illustrating how the model separates the two classes in the feature space.

Output:

AdaBoost Binary Classification

This structured approach demonstrates how to implement and evaluate AdaBoost for binary classification tasks, providing a clear understanding of its capabilities. The visualization of the decision boundary aids in interpreting the model's predictions.

AdaBoost (Multiclass Classification) Example

AdaBoost is an ensemble learning technique that combines multiple weak classifiers to create a strong classifier. This example demonstrates how to implement AdaBoost for multiclass classification using synthetic data, evaluate the model's performance, and visualize the decision boundary for five classes.

Python Code Example

1. Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data with 5 Classes

np.random.seed(42)  # For reproducibility
n_samples = 2500  # Total number of samples
n_samples_per_class = n_samples // 5  # Ensure this is exactly n_samples // 5

# Class 0: Centered around (-2, -2)
X0 = np.random.randn(n_samples_per_class, 2) * 0.5 + [-2, -2]

# Class 1: Centered around (0, -2)
X1 = np.random.randn(n_samples_per_class, 2) * 0.5 + [0, -2]

# Class 2: Centered around (2, -2)
X2 = np.random.randn(n_samples_per_class, 2) * 0.5 + [2, -2]

# Class 3: Centered around (-1, 2)
X3 = np.random.randn(n_samples_per_class, 2) * 0.5 + [-1, 2]

# Class 4: Centered around (1, 2)
X4 = np.random.randn(n_samples_per_class, 2) * 0.5 + [1, 2]

# Combine the data
X = np.vstack([X0, X1, X2, X3, X4])
y = np.hstack([np.zeros(n_samples_per_class), 
               np.ones(n_samples_per_class),
               np.full(n_samples_per_class, 2),
               np.full(n_samples_per_class, 3),
               np.full(n_samples_per_class, 4)])

# Shuffle the dataset
shuffle_idx = np.random.permutation(n_samples)
X, y = X[shuffle_idx], y[shuffle_idx]

This block generates synthetic data for five classes located in different regions of the feature space.

3. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This block splits the dataset into training and testing sets for model evaluation.

4. Create and Train the AdaBoost Classifier

base_estimator = DecisionTreeClassifier(max_depth=1)  # Decision stump
model = AdaBoostClassifier(estimator=base_estimator, n_estimators=10, random_state=42)
model.fit(X_train, y_train)

This block initializes the AdaBoost classifier with a weak learner (decision stump) and trains it using the training dataset.

5. Make Predictions

y_pred = model.predict(X_test)

This block uses the trained model to make predictions on the test set.

6. Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

Output:

Accuracy: 0.9540

Confusion Matrix:
[[ 97   2   0   0   0]
 [  0  92   3   0   0]
 [  0   4  92   0   0]
 [  0   0   0  86  14]
 [  0   0   0   0 110]]

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99        99
         1.0       0.94      0.97      0.95        95
         2.0       0.97      0.96      0.96        96
         3.0       1.00      0.86      0.92       100
         4.0       0.89      1.00      0.94       110

    accuracy                           0.95       500
   macro avg       0.96      0.95      0.95       500
weighted avg       0.96      0.95      0.95       500

This block calculates and prints the accuracy, confusion matrix, and classification report, providing insights into the model's performance.

7. Visualize the Decision Boundary

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(12, 10))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='viridis')
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='black')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("AdaBoost Multiclass Classification (5 Classes)")
plt.colorbar(scatter)
plt.show()

This block visualizes the decision boundaries created by the AdaBoost classifier, illustrating how the model separates the five classes in the feature space.

Output:

AdaBoost - Ensemble Method, Classification: Supervised Machine Learning

This structured approach demonstrates how to implement and evaluate AdaBoost for multiclass classification tasks, providing a clear understanding of its capabilities and the effectiveness of visualizing decision boundaries.

위 내용은 AdaBoost - 앙상블 방법, 분류: 지도 머신 러닝의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

Python for Error using class function this boosting

성명：

이전 기사：Polars: Python에서 대규모 데이터 분석 강화다음 기사：Polars: Python에서 대규모 데이터 분석 강화