한량처럼 살고 싶다

07. 머신러닝 - 분류 본문

빅데이터 분석

07. 머신러닝 - 분류

투영 2023. 4. 26. 14:29

1. 로지스틱 회귀

1) 배우는 내용

  1. 로지스틱 회귀
  2. 전처리: Scaling, PCA
  3. 결정트리
  4. 분류 성능 지표
  5. encoding

 

2) classification

  • X: 특징 벡터
  • y: class (nominal) - regression 에서는 y가 numeric value
  • 가장 단순한 분류 ZeroR: 다수인 클래스로 판단

 

2. 로지스틱 회귀 실습

1) 로지스틱 회귀

  • 분류 방법: sigmoid 함수, X에 값에 따라 특정 지점에서 y가 0 또는 1의 값을 갖도록: 2 class

  • multi class 일 때는 softmax: 합이 1이 됨. 가장 값이 높은 것이 각 클래스의 확률이 된다

1) 데이터 획득하기

import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
b_cancer = load_breast_cancer()

2) 데이터 탐색하고 데이터프레임으로 생성하기

print(b_cancer.DESCR)
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names) #데이터프레임화

3) 도수 세기

b_cancer_df['diagnosis']= b_cancer.target
b_cancer_df['diagnosis' ].value_counts()

4) 시각화 하기

import matplotlib.pyplot as plt
import seaborn as sns

sns.kdeplot(b_cancer_df[ b_cancer_df ['diagnosis' ]==0]['mean radius'])
sns.kdeplot(b_cancer_df[ b_cancer_df ['diagnosis' ]==1]['mean radius'])

2-1) 전처리: scaling

  • scaling: 각 특징 별 값의 범위를 맞춰준다
  • standard scaler: z = (x-u)/s, -a~+a
  • min-max scaler: z = (x-min)/(max-min), 0~1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

b_cancer_scaled = scaler.fit_transform(b_cancer.data) #차원 조절, data에 맞춰서
  • scale 결과 확인
import matplotlib.pyplot as plt
import seaborn as sns

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))

ax1.set_title('Before Scaling')
sns.kdeplot(b_cancer.data[:,0], ax=ax1) # mean r
sns.kdeplot(b_cancer.data[:,2], ax=ax1) # mean perimadiuseter
ax2.set_title('After Standard Scaler')
sns.kdeplot(b_cancer_scaled[:,0], ax=ax2)
sns.kdeplot(b_cancer_scaled[:,2], ax=ax2)
plt.show()

  • 모델 학습
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# X, Y 설정하기
Y = b_cancer_df['diagnosis']
X = b_cancer_scaled 

Y.value_counts()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
  • 성능 평가
# 로지스틱 회귀 분석 : (3)평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr_b_cancer.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# 혼돈행렬 : \ 직선은 정답(60, 107), / 직선이 오답(3,1)
print (confusion_matrix(Y_test, Y_predict)) #현재 결과는 정확도가 굉장히 높은 편
  • recall과 precision을 하나의 지표로 만든 F1-score
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
  • scale 하지 않은 데이터와 성능 비교하기
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

b_cancer = load_breast_cancer()
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names)
b_cancer_df['diagnosis']= b_cancer.target

#스케일링 하지 않은 데이터
Y_raw=b_cancer_df['diagnosis']
X_raw = b_cancer.data
model = LogisticRegression(max_iter=5000)
X_trainraw, X_testraw, Y_train, Y_test = train_test_split(X_raw, Y_raw, test_size=0.3, random_state=0)
model.fit(X_trainraw, Y_train)
Y_predictraw = model.predict(X_testraw)
acccuracy_raw = accuracy_score(Y_test, Y_predictraw)
precision_raw = precision_score(Y_test, Y_predictraw)
recall_raw = recall_score(Y_test, Y_predictraw)
f1_raw = f1_score(Y_test, Y_predictraw)
print('raw 정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f},  F1: {3:.3f}'.format(acccuracy_raw,precision_raw,recall_raw,f1_raw))

#스케일링 한 데이터
b_cancer = load_breast_cancer()
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names)
b_cancer_df['diagnosis']= b_cancer.target
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data)
Y=b_cancer_df['diagnosis']
X=b_cancer_scaled
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
Y_predict = lr_b_cancer.predict(X_test)
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f},  F1: {3:.3f}'.format(acccuracy,precision,recall,f1))

2-2) 전처리: PCA

  • 차원 축소 알고리즘
  • 차원 축소 시 생기는 장점
    • 처리속도 향상
    • 시각화
    • 노이즈 제거
  • 2차원 -> 1차원
    • 30차원의 데이터는 시각화 할 수 없음
    • 극단적으로 30차원에서 2차원으로 줄이면 시각화 가능
from sklearn import decomposition
pca = decomposition.PCA(n_components=8)
pca.fit(b_cancer_scaled)
b_cancer_pca = pca.transform(b_cancer_scaled)
# X, Y 설정하기
Y = b_cancer_df['diagnosis']
X = b_cancer_pca
# 훈련용 데이터와 평가용 데이터 분할하기
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

# 로지스틱 회귀 분석 : (1)모델 생성
lr_b_cancer = LogisticRegression()

# 로지스틱 회귀 분석 : (2)모델 훈련
lr_b_cancer.fit(X_train, Y_train)

# 로지스틱 회귀 분석 : (3)평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr_b_cancer.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
roc_auc = roc_auc_score(Y_test, Y_predict)

print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f},  F1: {3:.3f}'.format(acccuracy,precision,recall,f1))
  • 응용하기

PCA 차원을 3으로 늘리고 성능을 측정 -> 모델을 svm으로 변경하여 성능을 측정하라

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

b_cancer = load_breast_cancer()
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data)

Y=b_cancer.target
X=b_cancer_scaled

pca = decomposition.PCA(n_components = 8)
pca.fit(X)
X = pca.transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

model = svm.SVC()
model.fit(X_train, Y_train)

accuracy=accuracy_score(Y_test, Y_predict)
precision=precision_score(Y_test, Y_predict)
recall=recall_score(Y_test, Y_predict)
f1=f1_score(Y_test, Y_predict)
roc_auc=roc_auc_score(Y_test, Y_predict)

print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f},  F1: {3:.3f}'.format(acccuracy,precision,recall,f1))

 

3) 결정 트리

- 데이터 수집: https://archive.ics.uci.edu/ml/machine-learning-databases/00240/

 

Index of /ml/machine-learning-databases/00240

 

archive.ics.uci.edu

- 데이터 탐색

import numpy as np
import pandas as pd

# 피처 이름 파일 읽어오기 
feature_name_df = pd.read_csv('UCI_HAR_Dataset/features.txt', sep='\s+',  header=None, names=['index', 'feature_name'], engine='python')

#index 제거하고, feature_name만 리스트로 저장
feature_name = feature_name_df.iloc[:, 1].values.tolist()

#각 파일로 만들기
X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None,  engine='python')
X_test = pd.read_csv('UCI_HAR_Dataset/test/X_test.txt', sep='\s+', header=None, engine='python')

Y_train = pd.read_csv('UCI_HAR_Dataset/train/y_train.txt', sep='\s+', header=None,  engine='python')
Y_test = pd.read_csv('UCI_HAR_Dataset/test/y_test.txt', sep='\s+', header=None,  engine='python')
label_name_df = pd.read_csv('UCI_HAR_Dataset/activity_labels.txt', sep='\s+',  header=None, names=['index', 'label'], engine='python')

# index 제거하고, class_name만 리스트로 저장
label_name = label_name_df.iloc[:, 1].values.tolist()

label_name

 

- 모델 구축: 결정 트리 모델

from sklearn.tree import DecisionTreeClassifier

# 결정 트리 분류 분석 : 1) 모델 생성
dt_HAR = DecisionTreeClassifier(random_state=156)

# 결정 트리 분류 분석 : 2) 모델 훈련
dt_HAR.fit(X_train, Y_train)

# 결정 트리 분류 분석 : 3) 평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = dt_HAR.predict(X_test)

from sklearn.metrics import confusion_matrix
print (confusion_matrix(Y_test, Y_predict))

 

- 결과 분석

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(Y_test, Y_predict)
print('결정 트리 예측 정확도 : {0:.4f}'.format(accuracy))

 

- 최적 파라미터 값 찾기

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [ 8, 16, 20 ],
    'min_samples_split' : [ 8, 16, 24 ]
}

grid_cv = GridSearchCV(dt_HAR, param_grid=params, scoring='accuracy', 
                       cv=5, return_train_score=True)
grid_cv.fit(X_train , Y_train)

cv_results_df = pd.DataFrame(grid_cv.cv_results_)
cv_results_df[['param_max_depth','param_min_samples_split', 'mean_test_score', 'mean_train_score']]

print('최고 평균 정확도 : {0:.4f}, 최적 하이퍼 파라미터 :{1}'.format(grid_cv.best_score_ , grid_cv.best_params_))

best_dt_HAR = grid_cv.best_estimator_
best_Y_predict = best_dt_HAR.predict(X_test)
best_accuracy = accuracy_score(Y_test, best_Y_predict)

print('best 결정 트리 예측 정확도 : {0:.4f}'.format(best_accuracy))

 

- 중요 피쳐 확인하기

import seaborn as sns
import matplotlib.pyplot as plt

X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None,  engine='python')
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)

feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
print (feature_top10.index )

sns.barplot(y=feature_top10, x= [feature_name[i] for i in feature_top10.index])
plt.xticks(rotation=60)
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)
feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
 
    
dt_HAR2 = DecisionTreeClassifier(random_state=156)    
dt_HAR2.fit(X_train.iloc[:,feature_top10.index],Y_train)

y_predict2 = dt_HAR2.predict(X_test.iloc[:,feature_top10.index])
accuracy = accuracy_score(Y_test, y_predict2)
print('중요 피처 결정 트리 예측 정확도 : {0:.4f}'.format(accuracy))
  • 응용: 중요 피처 10개만 사용했을 경우와 성능 비교
#중요 feature 10개 사용 성능
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

#X, Y 정의
Y=b_cancer.target
Xraw=b_cancer.data

#top 10개 feature 구하기
X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None,  engine='python')
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)
feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
X_trainraw = X_train.iloc[:,feature_top10.index]

#모델 생성
X_trainraw, X_testraw, Y_train, Y_test=train_test_split(Xraw, Y, test_size=0.3)
model_logist = LogisticRegression(max_iter=5000)
model_logist.fit(X_trainraw, Y_train)
Y_predictraw=model_logist.predict(X_testraw)

accuracy_raw=accuracy_score(Y_test, Y_predictraw)
precision_raw=precision_score(Y_test, Y_predictraw)
recall_raw=recall_score(Y_test, Y_predictraw)
f1_raw=f1_score(Y_test, Y_predictraw)

print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f},  F1: {3:.3f}'.format(acccuracy,precision_raw,recall_raw,f1_raw))
#기존 성능 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data) #차원 조절, data에 맞춰서
Y = b_cancer_df['diagnosis']
X = b_cancer_scaled 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
Y_predict = lr_b_cancer.predict(X_test)

acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)

print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율:{2:.3f},  F1: {3:.3f}'.format(acccuracy,precision,recall,f1))

 

4) nominal feature coding

데이터 가공

import seaborn as sns
import pandas as pd

titanic= sns.load_dataset("titanic")
titanic

#안쓸 칼럼 날리기
titanic = titanic.drop(['pclass', 'deck', 'embark_town', 'alive', 'alone'], axis=1, inplace=False)
titanic.info()

#남은 결측치 처리
titanic.age= titanic.age.fillna( titanic.age.median())
titanic= titanic.dropna()  # 모든 필드에 대해 na가 있으면 record drop
titanic.info()

 

nominal을 그대로 넣을 경우 nominal 데이터 때문에 에러가 발생한다

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Y= titanic['survived']
X= titanic.drop(['survived', 'sex', 'embarked', 'who', 'adult_male', 'class'], axis=1, inplace=False)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2)

 

성능 평가

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

model = LogisticRegression() #limit 정해주기, 1번 문제처럼
model.fit (X_train, Y_train)
Y_predict= model.predict ( X_test)
print (confusion_matrix(Y_test, Y_predict))
f1 = f1_score(Y_test, Y_predict)
print ("f1 score:",  f1)

 

5) nominal data encoding

  1. label encoding: ordered nominal은 1,2,3 같은 숫자를 붙여준다
  2. one-hot encoding: pure nominal은 1000 0100 같이 1개만 가지는 n차원 feature가 된다

- label encoding

  • from sklearn.preprocessing import LabelEncoder
  • le = LabelEncoder()
  • result = le.fit_transform(df['컬럼'])
  • column 대체
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #label 순서 지정해주는 법 찾아보기
result = le.fit_transform(titanic['class'])
print (result)
print (le.classes_)

 

- one-hot encoding

  • from sklearn.preprocessing import OneHotEncoder
  • ohe = OneHotEncoder(sparse_output=False)
  • fit_transform(df[['컬럼']])
  • 데이터 프레임화 하여 컬럼 추가 & 기존 컬럼 삭제
# onehot vector로 변경
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
sex_ohe = ohe.fit_transform(titanic[['sex']])
print (sex_ohe)

# data frame column으로 변경
sex_df = pd.DataFrame ( sex_ohe, columns=['sex_m', 'sex_f'])
sex_df

# 바꿔주기
print (titanic.shape, sex_df.shape)
titanic_ec=pd.concat([titanic.drop(columns=['sex']),  sex_df], axis=1) 
titanic_ec

titanic.reset_index( )

print (titanic.shape, sex_df.shape)
titanic_ec=pd.concat([titanic.reset_index().drop(columns=['index', 'sex']),  sex_df], axis=1)
titanic_ec

 

한꺼번에 여러 컬럼 처리

ohe = OneHotEncoder(sparse=False)
ec = ohe.fit_transform(titanic_ec[['embarked', 'who', 'adult_male']])
ec_df = pd.DataFrame ( ec, columns=['e1','e2', 'e3', 'w1','w2','w3', 'a1', 'a2'])
ec_df
titanic_ec=pd.concat([titanic_ec.drop(columns=['embarked', 'who', 'adult_male']),  ec_df], axis=1) 
titanic_ec
from sklearn.model_selection import train_test_split
Y= titanic_ec['survived']
X= titanic_ec.drop(['survived' ], axis=1, inplace=False)
print (X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

결과 분석

from sklearn.metrics import *
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

from sklearn import svm
model=svm.SVC(C=100)

model.fit (X_train, Y_train)
Y_predict=model.predict ( X_test)
print (confusion_matrix(Y_test, Y_predict))
f1 = f1_score(Y_test, Y_predict)
print ("f1 score:",  f1)