관리 메뉴

솜씨좋은장씨

[Kaggle DAY08]Real or Not? NLP with Disaster Tweets! 본문

Kaggle/Real or Not? NLP with Disaster Tweets

[Kaggle DAY08]Real or Not? NLP with Disaster Tweets!

솜씨좋은장씨 2020. 3. 5. 18:43
728x90
반응형

Kaggle 도전 8회차!

오늘은 결과가 가장 좋았던 7회차 모델에 데이터 전처리 방식을 달리하여 제출해보았습니다.

 

데이터 전처리는 https://~~ 를 정규식을 활용하여 LINK로 변경하여 넣어주었습니다.

from tqdm import tqdm
import re

text_list = list(train_data['text'])

clear_text_list = []

for i in tqdm(range(len(text_list))):
  clear_text = text_list[i].lower()
  pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+/(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
  clear_text = re.sub(pattern=pattern, repl='LINK', string=clear_text)
  pattern = '(http|ftp|https)[\w+]'
  clear_text_list.append(clear_text)
  

train_data['clear_text'] = clear_text_list
train_data

 

첫번째 제출

model = Sequential()
model.add(Embedding(max_words, 100, input_length=24)) # 임베딩 벡터의 차원은 32
model.add(Dropout(0.2))
model.add(Conv1D(256,
                 3,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
history = model.fit(X_train_vec, y_train, epochs=3, batch_size=32, validation_split=0.1)

결과

 

두번째 제출

model2 = Sequential()
model2.add(Embedding(max_words, 128, input_length=24))
model2.add(Dropout(0.2))
model2.add(Conv1D(256,
                 3,
                 padding='valid',
                 activation='relu',
                 strides=1))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(128, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(2, activation='sigmoid'))
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
history = model2.fit(X_train_vec, y_train, epochs=1, batch_size=32, validation_split=0.1)

결과

 

세번째 제출

model2 = Sequential()
model2.add(Embedding(max_words, 128, input_length=24))
model2.add(Dropout(0.2))
model2.add(Conv1D(256,
                 3,
                 padding='valid',
                 activation='relu',
                 strides=1))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(64, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(2, activation='sigmoid'))
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
history2 = model2.fit(X_train_vec, y_train, epochs=1, batch_size=32, validation_split=0.1)

결과

 

네번째 제출

model2 = Sequential()
model2.add(Embedding(max_words, 128, input_length=24))
model2.add(Dropout(0.2))
model2.add(Conv1D(256,
                 3,
                 padding='valid',
                 activation='relu',
                 strides=1))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(32, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(2, activation='sigmoid'))
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
history2 = model2.fit(X_train_vec, y_train, epochs=1, batch_size=32, validation_split=0.1)

결과

 

다섯번째 제출

model2 = Sequential()
model2.add(Embedding(max_words, 128, input_length=24))
model2.add(Dropout(0.2))
model2.add(Conv1D(256,
                 3,
                 padding='valid',
                 activation='relu',
                 strides=1))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(32, activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(2, activation='sigmoid'))
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) 
history2 = model2.fit(X_train_vec, y_train, epochs=1, batch_size=16, validation_split=0.1)

결과

Comments