목차
- 오버샘플링
- SMOTE
- Borderline-SMOTE
- Random Over-Sampling
- ADASYN
- 언더샘플링
- Random Under-Sampling
- Tomek Links
- Condensed Nearest Neighnour
- One Sided Selection
- Edited Nearest Neighbours
- Neighbourhood Cleaning Rule
데이터생성
1
2
3
4
5
| from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0, n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)
|
오버샘플링
SMOTE (Synthetic Minority Oversampling TEchnique)은 존재하는 minority class를 활용하여 새로운 값들을 만들어내는 방법이며 보통 k-nearest neighbors를 사용하여 조합한다.
1
2
3
4
5
| from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto')
X_smote, y_smote = smote.fit_resample(X,y)
print(Counter(y_smote))
|
기존의 SMOTE 기법은 minority class에서 랜덤하게 생성했다면, Borderline-SMOTE기법은 다른 class와의 경계(Borderline)에 있는 샘플들을 늘려 분류하기 더 어려운 부분에 집중했다.
1
2
3
4
5
6
7
8
9
10
11
12
| from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
X, y = make_classification(n_classes = 2, class_sep = 2, weights = [0.1, 0.9], n_informative = 3, n_redundant=1, flip_y = 0,
n_features = 20, n_clusters_per_class=1, n_samples = 1000, random_state = 123)
print('현재 데이터의 크기 %s' % Counter(y))
sm = BorderlineSMOTE(random_state = 123)
X_res, y_res = sm.fit_resample(X,y)
print('Borderline SMOTE 적용 이후 데이터의 크기 %s' % Counter(y_res))
|
출처: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning”
1
2
3
4
5
6
| from random import Random
import imblearn
from imblearn.over_sampling import RandomOverSampler
Random_oversampling = RandomOverSampler()
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_oversampling.fit_resample(X,y)
|
- ADASYN
SMOTE기법과 유사하지만, 소수의 클래스에서 가장 가까운 K개의 데이터 중 무작위로 선택하여 클래스를 만드는 기법이다.
1
2
3
4
5
| from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=123)
X_res, y_res = ada.fit_resample(X,y)
print(Counter(y_res))
|
언더샘플링
1
2
3
4
5
6
| from random import Random
import imblearn
from imblearn.under_sampling import RandomUnderSampler
Random_undersampling = RandomUnderSampler(return_indices = True) # Initialize to return indices of dropped row
X_RandUnderSampled, Y_RandUnderSampled, dropped = Random_undersampling(X, y)
|
- Tomek Links
Tomek links는 거리가 가장 가깝지만 다른 Class를 가진 인자끼리 짝을지어 제거함으로서, Class사이의 공간을 확보하는 방법이다.
1
2
3
4
5
6
| from imblearn.under_sampling import TomekLinks
tomek = TomekLinks(sampling_strategy='auto')
X_tomek, y_tomek = tomek.fit_resample(X,y)
print(Counter(y_tomek))
|
- Condensed Nearest Neighnour
- One Sided Selection
- Edited Nearest Neighbours
- Neighbourhood Cleaning Rule
그 외
- SMOTE + ENN
- SMOTE + Tomek
Reference