관리 메뉴

솜씨좋은장씨

[DACON] 소설 작가 분류 AI 경진대회 7일차! 본문

DACON/소설 작가 분류 AI 경진대회

[DACON] 소설 작가 분류 AI 경진대회 7일차!

솜씨좋은장씨 2020. 11. 5. 23:57
728x90
반응형

 

소설 작가 분류 AI 경진대회

출처 : DACON - Data Science Competition

dacon.io

7일차! 7! 뭔가 행운이 찾아올 것 같은 기분이 들었던 7일차 도전의 날이었습니다.

 

오늘도 역시 aihub에서 지원받은 GPU서버 환경에서 진행하였습니다.

 

오늘은 생각보다 시간이 없어서 원래는 Glove 임베딩을 활용해서 결과를 내보려했지만

 

잠시 뒤로 미뤄두고 표제어추출도 활용해보고 학습데이터에서 validation 데이터를 비율을 줄여보기도하고

이것저것 여러 하이퍼 파라미터를 변경해보면서 시도해보았습니다.

 

첫번째 시도해 보았던 것은 앞의 전처리 과정은 6일차와 동일하고

가장 좋았던 모델에서 임베딩 차원만 128 -> 256 으로 변경하여 시도해보았습니다.

 

여러 시도 중에 가장 validaion loss 가 좋은 모델로 결과를 도출하고 제출하였습니다.

#파라미터 설정
vocab_size = 20445
embedding_dim = 256
max_length = 204
padding_type='post'
from keras_preprocessing.text import Tokenizer
#tokenizer에 fit
tokenizer = Tokenizer(num_words = vocab_size)#, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index
from keras_preprocessing.sequence import pad_sequences
#데이터를 sequence로 변환해주고 padding 해줍니다.
train_sequences = tokenizer.texts_to_sequences(X_train)
train_padded = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length)

test_sequences = tokenizer.texts_to_sequences(X_test)
test_padded = pad_sequences(test_sequences, padding=padding_type, maxlen=max_length)
import tensorflow as tf
#가벼운 NLP모델 생성
model23 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5, activation='softmax')
])

# compile model
model23.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
print(model23.summary())

# fit model
num_epochs = 15
history23 = model23.fit(train_padded, y_train, 
                    epochs=num_epochs, batch_size=256,
                    validation_split=0.2)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 204, 256)          5233920   
_________________________________________________________________
global_average_pooling1d_1 ( (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 24)                6168      
_________________________________________________________________
dropout_1 (Dropout)          (None, 24)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 125       
=================================================================
Total params: 5,240,213
Trainable params: 5,240,213
Non-trainable params: 0
_________________________________________________________________
None
Train on 43903 samples, validate on 10976 samples
Epoch 1/15
43903/43903 [==============================] - 9s 207us/sample - loss: 1.5643 - accuracy: 0.2751 - val_loss: 1.5410 - val_accuracy: 0.2866
Epoch 2/15
43903/43903 [==============================] - 8s 180us/sample - loss: 1.4696 - accuracy: 0.3736 - val_loss: 1.3521 - val_accuracy: 0.4535
...
Epoch 14/15
43903/43903 [==============================] - 9s 204us/sample - loss: 0.5464 - accuracy: 0.8036 - val_loss: 0.7127 - val_accuracy: 0.7365
Epoch 15/15
43903/43903 [==============================] - 8s 187us/sample - loss: 0.5265 - accuracy: 0.8110 - val_loss: 0.7137 - val_accuracy: 0.7343

결과 도출

# predict values
sample_submission = pd.read_csv("./sample_submission.csv")
pred = model23.predict_proba(test_padded)
sample_submission[['0','1','2','3','4']] = pred
sample_submission.to_csv('submission_19.csv', index = False, encoding = 'utf-8')

 

DACON 제출 결과

아쉽게도 0.4618681292 로 6일차 최고 점수보다 아쉬운 점수가 나왔습니다.

 

이번에는 Porterstemmer 대신 WordNetLemmatizer을 활용하여 전처리하고 결과를 도출해보기로 했습니다.

from nltk.tokenize import word_tokenize
# from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()
import re
from tqdm import tqdm

X_train = []

train_clear_text = list(train_dataset['clear_text'])

for i in tqdm(range(len(train_clear_text))):
    temp = word_tokenize(train_clear_text[i])
    temp = [stemmer.lemmatize(word) for word in temp]
    temp = [word for word in temp if len(word) > 1]
    
    X_train.append(temp)

X_test = []

test_clear_text = list(test_dataset['clear_text'])

for i in tqdm(range(len(test_clear_text))):
    temp = word_tokenize(test_clear_text[i])
    temp = [stemmer.lemmatize(word) for word in temp]
    temp = [word for word in temp if len(word) > 1]
    
    X_test.append(temp)
word_list = []

for i in tqdm(range(len(X_train))):
    for j in range(len(X_train[i])):
        word_list.append(X_train[i][j])
len(list(set(word_list)))
29620

PorterStemmer와 LancasterStemmer 보다 훨씬 많은 단어로 구성되어있었습니다.

import numpy as np
y_train = np.array([x for x in train_dataset['author']])
import matplotlib.pyplot as plt 
print("최대 길이 :" , max(len(l) for l in X_train)) 
print("평균 길이 : ", sum(map(len, X_train))/ len(X_train)) 
plt.hist([len(s) for s in X_train], bins=50) 
plt.xlabel('length of Data') 
plt.ylabel('number of Data') 
plt.show()

학습 데이터 중에서 가장 길이가 긴 것이 199, 평균 길이는 19.24 인것을 확인하였습니다.

이를 바탕으로 파라미터를 설정하였습니다.

#파라미터 설정
vocab_size = 29620
embedding_dim = 128
max_length = 200
padding_type='post'
from keras_preprocessing.text import Tokenizer
#tokenizer에 fit
tokenizer = Tokenizer(num_words = vocab_size)#, oov_token=oov_tok)
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index
from keras_preprocessing.sequence import pad_sequences
#데이터를 sequence로 변환해주고 padding 해줍니다.
train_sequences = tokenizer.texts_to_sequences(X_train)
train_padded = pad_sequences(train_sequences, padding=padding_type, maxlen=max_length)

test_sequences = tokenizer.texts_to_sequences(X_test)
test_padded = pad_sequences(test_sequences, padding=padding_type, maxlen=max_length)

 

이번에는 중간중간 validation loss 값이 가장 좋아지면 그때그때마다 체크포인트를 생성하도록 하는 

callback을 활용하여 학습한 후에 가장 결과가 좋아보이는 체크포인트를 다시 불러와서

결과를 도출하고 제출하였습니다.

import tensorflow as tf
from keras.callbacks import ModelCheckpoint
import os



MODEL_SAVE_FOLDER_PATH = './model07_30/'
if not os.path.exists(MODEL_SAVE_FOLDER_PATH):
      os.mkdir(MODEL_SAVE_FOLDER_PATH)

model_path = MODEL_SAVE_FOLDER_PATH + '{epoch:02d}-{val_loss:.4f}.hdf5'

cb_checkpoint = ModelCheckpoint(filepath=model_path, monitor='val_loss',
                                verbose=1, save_best_only=True)

#가벼운 NLP모델 생성
model30 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(5, activation='softmax')
])

# compile model
model30.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
print(model30.summary())

# fit model
num_epochs = 15
history30 = model30.fit(train_padded, y_train, 
                    epochs=num_epochs, batch_size=256,
                    validation_split=0.2, callbacks=[cb_checkpoint])
Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_6 (Embedding)      (None, 200, 128)          3791360   
_________________________________________________________________
global_average_pooling1d_6 ( (None, 128)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 24)                3096      
_________________________________________________________________
dropout_6 (Dropout)          (None, 24)                0         
_________________________________________________________________
dense_13 (Dense)             (None, 5)                 125       
=================================================================
Total params: 3,794,581
Trainable params: 3,794,581
Non-trainable params: 0
_________________________________________________________________
None
Train on 43903 samples, validate on 10976 samples
Epoch 1/15
43520/43903 [============================>.] - ETA: 0s - loss: 1.5630 - accuracy: 0.2871
Epoch 00001: val_loss improved from inf to 1.53889, saving model to ./model07_30/01-1.5389.hdf5
43903/43903 [==============================] - 6s 133us/sample - loss: 1.5631 - accuracy: 0.2873 - val_loss: 1.5389 - val_accuracy: 0.4144
Epoch 2/15
43776/43903 [============================>.] - ETA: 0s - loss: 1.4604 - accuracy: 0.3932
...
Epoch 00014: val_loss improved from 0.71672 to 0.71297, saving model to ./model07_30/14-0.7130.hdf5
43903/43903 [==============================] - 5s 124us/sample - loss: 0.4765 - accuracy: 0.8323 - val_loss: 0.7130 - val_accuracy: 0.7347
Epoch 15/15
43520/43903 [============================>.] - ETA: 0s - loss: 0.4525 - accuracy: 0.8410
Epoch 00015: val_loss did not improve from 0.71297
43903/43903 [==============================] - 5s 119us/sample - loss: 0.4525 - accuracy: 0.8411 - val_loss: 0.7148 - val_accuracy: 0.7397

여기서 validation loss가 0.7139까지 떨어졌을때 저장된 체크포인트를 불러와서 결과를 도출했습니다.

 

결과 도출

from tensorflow.keras.models import load_model

best_model_path = "./model07_30/14-0.7130.hdf5"
best_model = load_model(best_model_path)
# predict values
sample_submission = pd.read_csv("./sample_submission.csv")
pred = best_model.predict_proba(test_padded)
sample_submission[['0','1','2','3','4']] = pred
sample_submission.to_csv('submission_20.csv', index = False, encoding = 'utf-8')

 

DACON 제출 결과

 

엄... 너무 시간에 쫒겨 제출해서 그런것인지 더 안좋은 결과를 얻게 되었습니다.

 

이번엔 다시 PorterStemmer를 활용해서 전처리한 데이터를 가지고

학습데이터 중 validation 데이터 비율을 20%에서 10%로 줄이고

좀 전과 동일하게 체크포인트를 중간중간 저장하고 가장 좋아보이는 체크포인트를 불러와서 결과를 도출했습니다.

#파라미터 설정
vocab_size = 20445
embedding_dim = 128
max_length = 204
padding_type='post'
import tensorflow as tf
from keras.callbacks import ModelCheckpoint
import os



MODEL_SAVE_FOLDER_PATH = './model07_41/'
if not os.path.exists(MODEL_SAVE_FOLDER_PATH):
      os.mkdir(MODEL_SAVE_FOLDER_PATH)

model_path = MODEL_SAVE_FOLDER_PATH + '{epoch:02d}-{val_loss:.4f}.hdf5'

cb_checkpoint = ModelCheckpoint(filepath=model_path, monitor='val_loss',
                                verbose=1, save_best_only=True)

#가벼운 NLP모델 생성
model41 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(5, activation='softmax')
])

# compile model
model41.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
print(model41.summary())

# fit model
num_epochs = 15
history41 = model41.fit(train_padded, y_train, 
                    epochs=num_epochs, batch_size=256,
                    validation_split=0.1, callbacks=[cb_checkpoint])
Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_14 (Embedding)     (None, 204, 128)          2616960   
_________________________________________________________________
global_average_pooling1d_13  (None, 128)               0         
_________________________________________________________________
dense_28 (Dense)             (None, 24)                3096      
_________________________________________________________________
dropout_14 (Dropout)         (None, 24)                0         
_________________________________________________________________
dense_29 (Dense)             (None, 5)                 125       
=================================================================
Total params: 2,620,181
Trainable params: 2,620,181
Non-trainable params: 0
_________________________________________________________________
None
Train on 49391 samples, validate on 5488 samples
Epoch 1/15
48640/49391 [============================>.] - ETA: 0s - loss: 1.5703 - accuracy: 0.2639
Epoch 00001: val_loss improved from inf to 1.56145, saving model to ./model07_41/01-1.5615.hdf5
49391/49391 [==============================] - 5s 96us/sample - loss: 1.5698 - accuracy: 0.2645 - val_loss: 1.5615 - val_accuracy: 0.2562
Epoch 2/15
48640/49391 [============================>.] - ETA: 0s - loss: 1.4843 - accuracy: 0.3704
...
Epoch 14/15
49152/49391 [============================>.] - ETA: 0s - loss: 0.5429 - accuracy: 0.8011
Epoch 00014: val_loss improved from 0.69746 to 0.69702, saving model to ./model07_41/14-0.6970.hdf5
49391/49391 [==============================] - 5s 96us/sample - loss: 0.5425 - accuracy: 0.8013 - val_loss: 0.6970 - val_accuracy: 0.7418
Epoch 15/15
48896/49391 [============================>.] - ETA: 0s - loss: 0.5230 - accuracy: 0.8095
Epoch 00015: val_loss improved from 0.69702 to 0.69151, saving model to ./model07_41/15-0.6915.hdf5
49391/49391 [==============================] - 5s 101us/sample - loss: 0.5236 - accuracy: 0.8093 - val_loss: 0.6915 - val_accuracy: 0.7422

여기서 validaion loss가 0.6915로 가장 낮게나와 좋아보이는 체크포인트를 활용하여 결과를 도출해보았습니다.

 

결과 도출

best_model_path = "./model07_41/15-0.6915.hdf5"
best_model = load_model(best_model_path)
# predict values
sample_submission = pd.read_csv("./sample_submission.csv")
pred = best_model.predict_proba(test_padded)
sample_submission[['0','1','2','3','4']] = pred
sample_submission.to_csv('submission_21.csv', index = False, encoding = 'utf-8')

 

DACON 제출 결과

 

7일차의 최종 결과는 0.4467410131로 아주 근소한 차이로 최고 기록은 세우지 못했습니다.

 

오늘은 비록 못했지만!

Glove임베딩 활용 등 다양한 방법을 찾아보고 적용하여 최고의 모델을 만들어보고 싶습니다.

대회가 끝나는 그날까지! 계속 노력하고자 합니다.

읽어주셔서 감사합니다.

Comments