Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 |
Tags
- 구현
- 최단경로
- 코테
- 프로그래머스
- 타입스크립트
- 이진탐색
- 코딩일기
- 개발자북클럽
- c++
- react-native
- BOJ
- 알고리즘
- 이것이코딩테스트다
- 빅데이터분석
- 다이나믹프로그래밍
- DP
- ps
- Typescript
- 정렬
- dfs
- bfs
- 노마드코더
- 그리디
- TS
- SQL
- 앱개발
- 백준
- 코딩테스트
- 이코테
- 백준온라인저지
Archives
- Today
- Total
한량처럼 살고 싶다
07. 머신러닝 - 분류 본문
1. 로지스틱 회귀
1) 배우는 내용
- 로지스틱 회귀
- 전처리: Scaling, PCA
- 결정트리
- 분류 성능 지표
- encoding
2) classification
- X: 특징 벡터
- y: class (nominal) - regression 에서는 y가 numeric value
- 가장 단순한 분류 ZeroR: 다수인 클래스로 판단
2. 로지스틱 회귀 실습
1) 로지스틱 회귀
- 분류 방법: sigmoid 함수, X에 값에 따라 특정 지점에서 y가 0 또는 1의 값을 갖도록: 2 class
- multi class 일 때는 softmax: 합이 1이 됨. 가장 값이 높은 것이 각 클래스의 확률이 된다
1) 데이터 획득하기
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
b_cancer = load_breast_cancer()
2) 데이터 탐색하고 데이터프레임으로 생성하기
print(b_cancer.DESCR)
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names) #데이터프레임화
3) 도수 세기
b_cancer_df['diagnosis']= b_cancer.target
b_cancer_df['diagnosis' ].value_counts()
4) 시각화 하기
import matplotlib.pyplot as plt
import seaborn as sns
sns.kdeplot(b_cancer_df[ b_cancer_df ['diagnosis' ]==0]['mean radius'])
sns.kdeplot(b_cancer_df[ b_cancer_df ['diagnosis' ]==1]['mean radius'])
2-1) 전처리: scaling
- scaling: 각 특징 별 값의 범위를 맞춰준다
- standard scaler: z = (x-u)/s, -a~+a
- min-max scaler: z = (x-min)/(max-min), 0~1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data) #차원 조절, data에 맞춰서
- scale 결과 확인
import matplotlib.pyplot as plt
import seaborn as sns
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))
ax1.set_title('Before Scaling')
sns.kdeplot(b_cancer.data[:,0], ax=ax1) # mean r
sns.kdeplot(b_cancer.data[:,2], ax=ax1) # mean perimadiuseter
ax2.set_title('After Standard Scaler')
sns.kdeplot(b_cancer_scaled[:,0], ax=ax2)
sns.kdeplot(b_cancer_scaled[:,2], ax=ax2)
plt.show()
- 모델 학습
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# X, Y 설정하기
Y = b_cancer_df['diagnosis']
X = b_cancer_scaled
Y.value_counts()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
- 성능 평가
# 로지스틱 회귀 분석 : (3)평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr_b_cancer.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
# 혼돈행렬 : \ 직선은 정답(60, 107), / 직선이 오답(3,1)
print (confusion_matrix(Y_test, Y_predict)) #현재 결과는 정확도가 굉장히 높은 편
- recall과 precision을 하나의 지표로 만든 F1-score
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
- scale 하지 않은 데이터와 성능 비교하기
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
b_cancer = load_breast_cancer()
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names)
b_cancer_df['diagnosis']= b_cancer.target
#스케일링 하지 않은 데이터
Y_raw=b_cancer_df['diagnosis']
X_raw = b_cancer.data
model = LogisticRegression(max_iter=5000)
X_trainraw, X_testraw, Y_train, Y_test = train_test_split(X_raw, Y_raw, test_size=0.3, random_state=0)
model.fit(X_trainraw, Y_train)
Y_predictraw = model.predict(X_testraw)
acccuracy_raw = accuracy_score(Y_test, Y_predictraw)
precision_raw = precision_score(Y_test, Y_predictraw)
recall_raw = recall_score(Y_test, Y_predictraw)
f1_raw = f1_score(Y_test, Y_predictraw)
print('raw 정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f}, F1: {3:.3f}'.format(acccuracy_raw,precision_raw,recall_raw,f1_raw))
#스케일링 한 데이터
b_cancer = load_breast_cancer()
b_cancer_df= pd.DataFrame(b_cancer.data, columns = b_cancer.feature_names)
b_cancer_df['diagnosis']= b_cancer.target
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data)
Y=b_cancer_df['diagnosis']
X=b_cancer_scaled
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
Y_predict = lr_b_cancer.predict(X_test)
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f}, F1: {3:.3f}'.format(acccuracy,precision,recall,f1))
2-2) 전처리: PCA
- 차원 축소 알고리즘
- 차원 축소 시 생기는 장점
- 처리속도 향상
- 시각화
- 노이즈 제거
- 2차원 -> 1차원
- 30차원의 데이터는 시각화 할 수 없음
- 극단적으로 30차원에서 2차원으로 줄이면 시각화 가능
from sklearn import decomposition
pca = decomposition.PCA(n_components=8)
pca.fit(b_cancer_scaled)
b_cancer_pca = pca.transform(b_cancer_scaled)
# X, Y 설정하기
Y = b_cancer_df['diagnosis']
X = b_cancer_pca
# 훈련용 데이터와 평가용 데이터 분할하기
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# 로지스틱 회귀 분석 : (1)모델 생성
lr_b_cancer = LogisticRegression()
# 로지스틱 회귀 분석 : (2)모델 훈련
lr_b_cancer.fit(X_train, Y_train)
# 로지스틱 회귀 분석 : (3)평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = lr_b_cancer.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
roc_auc = roc_auc_score(Y_test, Y_predict)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f}, F1: {3:.3f}'.format(acccuracy,precision,recall,f1))
- 응용하기
PCA 차원을 3으로 늘리고 성능을 측정 -> 모델을 svm으로 변경하여 성능을 측정하라
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
b_cancer = load_breast_cancer()
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data)
Y=b_cancer.target
X=b_cancer_scaled
pca = decomposition.PCA(n_components = 8)
pca.fit(X)
X = pca.transform(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
model = svm.SVC()
model.fit(X_train, Y_train)
accuracy=accuracy_score(Y_test, Y_predict)
precision=precision_score(Y_test, Y_predict)
recall=recall_score(Y_test, Y_predict)
f1=f1_score(Y_test, Y_predict)
roc_auc=roc_auc_score(Y_test, Y_predict)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f}, F1: {3:.3f}'.format(acccuracy,precision,recall,f1))
3) 결정 트리
- 데이터 수집: https://archive.ics.uci.edu/ml/machine-learning-databases/00240/
Index of /ml/machine-learning-databases/00240
archive.ics.uci.edu
- 데이터 탐색
import numpy as np
import pandas as pd
# 피처 이름 파일 읽어오기
feature_name_df = pd.read_csv('UCI_HAR_Dataset/features.txt', sep='\s+', header=None, names=['index', 'feature_name'], engine='python')
#index 제거하고, feature_name만 리스트로 저장
feature_name = feature_name_df.iloc[:, 1].values.tolist()
#각 파일로 만들기
X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None, engine='python')
X_test = pd.read_csv('UCI_HAR_Dataset/test/X_test.txt', sep='\s+', header=None, engine='python')
Y_train = pd.read_csv('UCI_HAR_Dataset/train/y_train.txt', sep='\s+', header=None, engine='python')
Y_test = pd.read_csv('UCI_HAR_Dataset/test/y_test.txt', sep='\s+', header=None, engine='python')
label_name_df = pd.read_csv('UCI_HAR_Dataset/activity_labels.txt', sep='\s+', header=None, names=['index', 'label'], engine='python')
# index 제거하고, class_name만 리스트로 저장
label_name = label_name_df.iloc[:, 1].values.tolist()
label_name
- 모델 구축: 결정 트리 모델
from sklearn.tree import DecisionTreeClassifier
# 결정 트리 분류 분석 : 1) 모델 생성
dt_HAR = DecisionTreeClassifier(random_state=156)
# 결정 트리 분류 분석 : 2) 모델 훈련
dt_HAR.fit(X_train, Y_train)
# 결정 트리 분류 분석 : 3) 평가 데이터에 대한 예측 수행 -> 예측 결과 Y_predict 구하기
Y_predict = dt_HAR.predict(X_test)
from sklearn.metrics import confusion_matrix
print (confusion_matrix(Y_test, Y_predict))
- 결과 분석
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(Y_test, Y_predict)
print('결정 트리 예측 정확도 : {0:.4f}'.format(accuracy))
- 최적 파라미터 값 찾기
from sklearn.model_selection import GridSearchCV
params = {
'max_depth' : [ 8, 16, 20 ],
'min_samples_split' : [ 8, 16, 24 ]
}
grid_cv = GridSearchCV(dt_HAR, param_grid=params, scoring='accuracy',
cv=5, return_train_score=True)
grid_cv.fit(X_train , Y_train)
cv_results_df = pd.DataFrame(grid_cv.cv_results_)
cv_results_df[['param_max_depth','param_min_samples_split', 'mean_test_score', 'mean_train_score']]
print('최고 평균 정확도 : {0:.4f}, 최적 하이퍼 파라미터 :{1}'.format(grid_cv.best_score_ , grid_cv.best_params_))
best_dt_HAR = grid_cv.best_estimator_
best_Y_predict = best_dt_HAR.predict(X_test)
best_accuracy = accuracy_score(Y_test, best_Y_predict)
print('best 결정 트리 예측 정확도 : {0:.4f}'.format(best_accuracy))
- 중요 피쳐 확인하기
import seaborn as sns
import matplotlib.pyplot as plt
X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None, engine='python')
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)
feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
print (feature_top10.index )
sns.barplot(y=feature_top10, x= [feature_name[i] for i in feature_top10.index])
plt.xticks(rotation=60)
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)
feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
dt_HAR2 = DecisionTreeClassifier(random_state=156)
dt_HAR2.fit(X_train.iloc[:,feature_top10.index],Y_train)
y_predict2 = dt_HAR2.predict(X_test.iloc[:,feature_top10.index])
accuracy = accuracy_score(Y_test, y_predict2)
print('중요 피처 결정 트리 예측 정확도 : {0:.4f}'.format(accuracy))
- 응용: 중요 피처 10개만 사용했을 경우와 성능 비교
#중요 feature 10개 사용 성능
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
#X, Y 정의
Y=b_cancer.target
Xraw=b_cancer.data
#top 10개 feature 구하기
X_train = pd.read_csv('UCI_HAR_Dataset/train/X_train.txt', sep='\s+',header=None, engine='python')
feature_importance_values = dt_HAR.feature_importances_
feature_importance_values_s = pd.Series(feature_importance_values, index=X_train.columns)
feature_top10 = feature_importance_values_s.sort_values(ascending=False)[:10]
X_trainraw = X_train.iloc[:,feature_top10.index]
#모델 생성
X_trainraw, X_testraw, Y_train, Y_test=train_test_split(Xraw, Y, test_size=0.3)
model_logist = LogisticRegression(max_iter=5000)
model_logist.fit(X_trainraw, Y_train)
Y_predictraw=model_logist.predict(X_testraw)
accuracy_raw=accuracy_score(Y_test, Y_predictraw)
precision_raw=precision_score(Y_test, Y_predictraw)
recall_raw=recall_score(Y_test, Y_predictraw)
f1_raw=f1_score(Y_test, Y_predictraw)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율: {2:.3f}, F1: {3:.3f}'.format(acccuracy,precision_raw,recall_raw,f1_raw))
#기존 성능
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
scaler = StandardScaler()
b_cancer_scaled = scaler.fit_transform(b_cancer.data) #차원 조절, data에 맞춰서
Y = b_cancer_df['diagnosis']
X = b_cancer_scaled
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
lr_b_cancer = LogisticRegression()
lr_b_cancer.fit(X_train, Y_train)
Y_predict = lr_b_cancer.predict(X_test)
acccuracy = accuracy_score(Y_test, Y_predict)
precision = precision_score(Y_test, Y_predict)
recall = recall_score(Y_test, Y_predict)
f1 = f1_score(Y_test, Y_predict)
print('정확도: {0:.3f}, 정밀도: {1:.3f}, 재현율:{2:.3f}, F1: {3:.3f}'.format(acccuracy,precision,recall,f1))
4) nominal feature coding
데이터 가공
import seaborn as sns
import pandas as pd
titanic= sns.load_dataset("titanic")
titanic
#안쓸 칼럼 날리기
titanic = titanic.drop(['pclass', 'deck', 'embark_town', 'alive', 'alone'], axis=1, inplace=False)
titanic.info()
#남은 결측치 처리
titanic.age= titanic.age.fillna( titanic.age.median())
titanic= titanic.dropna() # 모든 필드에 대해 na가 있으면 record drop
titanic.info()
nominal을 그대로 넣을 경우 nominal 데이터 때문에 에러가 발생한다
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Y= titanic['survived']
X= titanic.drop(['survived', 'sex', 'embarked', 'who', 'adult_male', 'class'], axis=1, inplace=False)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2)
성능 평가
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
model = LogisticRegression() #limit 정해주기, 1번 문제처럼
model.fit (X_train, Y_train)
Y_predict= model.predict ( X_test)
print (confusion_matrix(Y_test, Y_predict))
f1 = f1_score(Y_test, Y_predict)
print ("f1 score:", f1)
5) nominal data encoding
- label encoding: ordered nominal은 1,2,3 같은 숫자를 붙여준다
- one-hot encoding: pure nominal은 1000 0100 같이 1개만 가지는 n차원 feature가 된다
- label encoding
- from sklearn.preprocessing import LabelEncoder
- le = LabelEncoder()
- result = le.fit_transform(df['컬럼'])
- column 대체
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #label 순서 지정해주는 법 찾아보기
result = le.fit_transform(titanic['class'])
print (result)
print (le.classes_)
- one-hot encoding
- from sklearn.preprocessing import OneHotEncoder
- ohe = OneHotEncoder(sparse_output=False)
- fit_transform(df[['컬럼']])
- 데이터 프레임화 하여 컬럼 추가 & 기존 컬럼 삭제
# onehot vector로 변경
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
sex_ohe = ohe.fit_transform(titanic[['sex']])
print (sex_ohe)
# data frame column으로 변경
sex_df = pd.DataFrame ( sex_ohe, columns=['sex_m', 'sex_f'])
sex_df
# 바꿔주기
print (titanic.shape, sex_df.shape)
titanic_ec=pd.concat([titanic.drop(columns=['sex']), sex_df], axis=1)
titanic_ec
titanic.reset_index( )
print (titanic.shape, sex_df.shape)
titanic_ec=pd.concat([titanic.reset_index().drop(columns=['index', 'sex']), sex_df], axis=1)
titanic_ec
한꺼번에 여러 컬럼 처리
ohe = OneHotEncoder(sparse=False)
ec = ohe.fit_transform(titanic_ec[['embarked', 'who', 'adult_male']])
ec_df = pd.DataFrame ( ec, columns=['e1','e2', 'e3', 'w1','w2','w3', 'a1', 'a2'])
ec_df
titanic_ec=pd.concat([titanic_ec.drop(columns=['embarked', 'who', 'adult_male']), ec_df], axis=1)
titanic_ec
from sklearn.model_selection import train_test_split
Y= titanic_ec['survived']
X= titanic_ec.drop(['survived' ], axis=1, inplace=False)
print (X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
결과 분석
from sklearn.metrics import *
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
from sklearn import svm
model=svm.SVC(C=100)
model.fit (X_train, Y_train)
Y_predict=model.predict ( X_test)
print (confusion_matrix(Y_test, Y_predict))
f1 = f1_score(Y_test, Y_predict)
print ("f1 score:", f1)
'빅데이터 분석' 카테고리의 다른 글
[오류 기록]kmeans.fit_predict nonetype object has no attribute 'split' 해결하기 (0) | 2023.04.27 |
---|---|
08. 데이터 처리할 때 자주 발생한 오류 (0) | 2023.04.26 |
06. 머신러닝-Regression(회귀예측): 선형회귀 모델 (0) | 2023.04.26 |
05. 변수 간 관계 분석, 회귀 분석, 데이터 예측 (0) | 2023.04.26 |
04. 기술 통계 (0) | 2023.04.25 |