1 minute read

목차

  1. 오버샘플링
    • SMOTE
    • Borderline-SMOTE
    • Random Over-Sampling
    • ADASYN
  2. 언더샘플링
    • Random Under-Sampling
    • Tomek Links
    • Condensed Nearest Neighnour
    • One Sided Selection
    • Edited Nearest Neighbours
    • Neighbourhood Cleaning Rule

데이터생성

1
2
3
4
5
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE

X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0, n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)

오버샘플링

  • SMOTE

SMOTE (Synthetic Minority Oversampling TEchnique)은 존재하는 minority class를 활용하여 새로운 값들을 만들어내는 방법이며 보통 k-nearest neighbors를 사용하여 조합한다.

1
2
3
4
5
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto')
X_smote, y_smote = smote.fit_resample(X,y)
print(Counter(y_smote))
  • Borderline-SMOTE

기존의 SMOTE 기법은 minority class에서 랜덤하게 생성했다면, Borderline-SMOTE기법은 다른 class와의 경계(Borderline)에 있는 샘플들을 늘려 분류하기 더 어려운 부분에 집중했다.

1
2
3
4
5
6
7
8
9
10
11
12
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE

X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0,
                            n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)

print('현재 데이터의 크기 %s' % Counter(y))

sm = BorderlineSMOTE(random_state = 123)
X_res, y_res = sm.fit_resample(X,y)
print('Borderline SMOTE 적용 이후 데이터의 크기 %s' % Counter(y_res))

출처: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning”

  • Random Over-Sampling
1
2
3
4
5
6
from random import Random
import imblearn
from imblearn.over_sampling import RandomOverSampler

Random_oversampling = RandomOverSampler() 
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_oversampling.fit_resample(X,y)
  • ADASYN SMOTE기법과 유사하지만, 소수의 클래스에서 가장 가까운 K개의 데이터 중 무작위로 선택하여 클래스를 만드는 기법이다.
1
2
3
4
5
from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state=123)
X_res, y_res = ada.fit_resample(X,y)
print(Counter(y_res))

언더샘플링

  • Random Under-Sampling
1
2
3
4
5
6
from random import Random
import imblearn
from imblearn.under_sampling import RandomUnderSampler

Random_undersampling = RandomUnderSampler(return_indices = True) # Initialize to return indices of dropped row
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_undersampling(X, y)
  • Tomek Links

    Tomek links는 거리가 가장 가깝지만 다른 Class를 가진 인자끼리 짝을지어 제거함으로서, Class사이의 공간을 확보하는 방법이다.

1
2
3
4
5
6
from imblearn.under_sampling import TomekLinks

tomek = TomekLinks(sampling_strategy='auto')
X_tomek, y_tomek = tomek.fit_resample(X,y)

print(Counter(y_tomek))
  • Condensed Nearest Neighnour
  • One Sided Selection
  • Edited Nearest Neighbours
  • Neighbourhood Cleaning Rule

그 외

  • SMOTE + ENN
  • SMOTE + Tomek

Reference