[Kaggle DAY12]Real or Not? NLP with Disaster Tweets!

Notice

[블로그 업데이트 공지] 코드 블럭 내용 복사⋯

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

솜씨좋은장씨

[Kaggle DAY12]Real or Not? NLP with Disaster Tweets! 본문

Kaggle/Real or Not? NLP with Disaster Tweets

[Kaggle DAY12]Real or Not? NLP with Disaster Tweets!

솜씨좋은장씨 2020. 3. 9. 06:15

728x90

Kaggle 도전 12회차!

오늘은 11회차에서 데이터 전처리 시 잘못 설정했던 부분들을 수정하여 다시 도전해보았습니다.

from tqdm import tqdm
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
god_list = ['buddha', 'allah', 'jesus']
train_text_list = list(train['text'])
text_list_corpus = ''

for i in tqdm(range(len(train_text_list))):
  text_list_corpus = text_list_corpus + train_text_list[i]
  text_list_corpus  = text_list_corpus.lower()
  pattern = '(http|ftp|https)://(?:[-\w.]|(?:%[\da-fA-F]{2}))+/(?:[-\w.]|(?:%[\da-fA-F]{2}))+' 
  clear_text = re.sub(pattern=pattern, repl='', string=text_list_corpus)
  clear_text = clear_text.replace('\n', ' ').replace('\t', ' ')
  clear_text = re.sub('[-=+,#/\?:^$.@*\"※~&%ㆍ!』\\‘|\(\)\[\]\<\>`\'…》;]', ' ', clear_text)
  clear_text = re.sub('[0-9]', ' ', clear_text)
for i in range(len(alphabets)):
  clear_text = re.sub(alphabets[i]+'{3,}', alphabets[i], clear_text)
for i in range(len(god_list)):
  clear_text = clear_text.replace(god_list[i], 'god')
clear_text

https / http / ftp 주소나 특수문자, 숫자, 개행문자 등을 정규식으로 제거할 때

그냥 없애는 것이 아니라 한칸의 공백으로 치환해주었습니다.

word_list = word_tokenize(clear_text) 
word_list = [word for word in word_list if len(word) > 2]
word_list = [word for word in word_list if word not in stop_words]
word_list = [stemmer.stem(word) for word in word_list]
len(list(set(word_list)))

stemming 을 맨 마지막에 실행하여 불용어 처리가 제대로 될 수 있도록 하였습니다.

이를 토대로 워드클라우드를 다시 그려보면 다음과 같습니다.

전체 학습 데이터 워드클라우드

여기서 다시 진짜 트윗 / 거짓 트윗을 구분하여 워드클라우드를 그려보면 다음과 같았습니다.

진짜 재난 트윗 워드클라우드

가짜 재난 트윗 워드클라우드

특수문자 / 개행문자 들을 치환할때 공백 한칸으로 치환하니

워드클라우드와 Counter dictionary로 출력하여 살펴보았을때

기존에 세개의 단어가 겹쳐져서 토큰화가 제대로 되지 못했던 부분들이 잘 해소가 된 것을 볼 수 있었습니다.

모델은 11회차와 동일한 모델을 사용했습니다.

첫번째 제출

sess = K.get_session()
uninitialized_variables = set([i.decode('ascii') for i in sess.run(tf.report_uninitialized_variables())])
init = tf.variables_initializer([v for v in tf.global_variables() if v.name.split(':')[0] in uninitialized_variables])
sess.run(init)

bert_model = get_bert_finetuning_model(model)
history = bert_model.fit(train_x, train_y_new, epochs=2, batch_size=16, verbose = 1, validation_split=0.05, shuffle=True)

결과

두번째 제출

bert_model2 = get_bert_finetuning_model(model)
history2 = bert_model2.fit(train_x, train_y_new, epochs=2, batch_size=32, verbose = 1, validation_split=0.05, shuffle=True)

결과

세번째 제출

bert_model3 = get_bert_finetuning_model(model)
history2 = bert_model3.fit(train_x, train_y_new, epochs=3, batch_size=32, verbose = 1, validation_split=0.05, shuffle=True)

제출

네번째 제출

bert_model4 = get_bert_finetuning_model(model)
history2 = bert_model4.fit(train_x, train_y_new, epochs=5, batch_size=32, verbose = 1, validation_split=0.05, shuffle=True)

결과

다섯번째 제출

bert_model5 = get_bert_finetuning_model(model)
history2 = bert_model5.fit(train_x, train_y_new, epochs=5, batch_size=16, verbose = 1, validation_split=0.05, shuffle=True)

결과

'Kaggle > Real or Not? NLP with Disaster Tweets' 카테고리의 다른 글

[Kaggle DAY14]Real or Not? NLP with Disaster Tweets! (0)	2020.03.11
[Kaggle DAY13]Real or Not? NLP with Disaster Tweets! (0)	2020.03.10
[Kaggle DAY11]Real or Not? NLP with Disaster Tweets! (0)	2020.03.08
[Kaggle DAY10]Real or Not? NLP with Disaster Tweets! (0)	2020.03.07
[Kaggle DAY09]Real or Not? NLP with Disaster Tweets! (0)	2020.03.07

'Kaggle/Real or Not? NLP with Disaster Tweets' Related Articles

Comments