๊ด€๋ฆฌ ๋ฉ”๋‰ด

์†œ์”จ์ข‹์€์žฅ์”จ

[Python] Selenium๊ณผ BeautifulSoup์„ ํ™œ์šฉํ•˜์—ฌ ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ•! ๋ณธ๋ฌธ

Programming/Python

[Python] Selenium๊ณผ BeautifulSoup์„ ํ™œ์šฉํ•˜์—ฌ ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ•!

์†œ์”จ์ข‹์€์žฅ์”จ 2022. 1. 15. 19:08
728x90
๋ฐ˜์‘ํ˜•

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ์ˆ˜์ง‘์„ ๋ถ€ํƒํ•ด!

๋งˆ์ผ€ํŒ… / ํ™๋ณด ๋Œ€ํ–‰ ํšŒ์‚ฌ์—์„œ ์ธํ„ด์„ ํ•˜๋Š” ์นœ๊ตฌ๊ฐ€ ์—…๋ฌด๋ฅผ ๋ฐ›์•˜๋Š”๋ฐ ํŠน์ • ๊ธฐ์—…์— ๋Œ€ํ•œ O์›” O์ผ ~ O์›” O์ผ ๊นŒ์ง€์˜

๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ๊ฐ๊ฐ์˜ ๊ธฐ์‚ฌ๊ฐ€ ๊ธฐํš ๊ธฐ์‚ฌ์ธ์ง€, ๋ถ€์ • ๊ธฐ์‚ฌ์ธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ํ•ด์•ผํ•˜๋Š”๋ฐ 

์ˆ˜์ง‘ํ•ด์•ผ ํ•  ๋‰ด์Šค๊ธฐ์‚ฌ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๋‹ค๋ฉฐ ํ˜น์‹œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ˆ˜์ง‘ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”์ง€! ๋ฌผ์–ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๐Ÿคฉ ๊ธฐ์‚ฌ ์ˆ˜์ง‘์ด๋ผ๋ฉด ๋‹น๊ทผ!

ํฌ๋กค๋ง์ด๋ผ๋ฉด ๋˜ ์ œ ์ „๋ฌธ ๋ถ„์•ผ ์ด๊ธฐ์— ์‹œ๊ฐ„์ด ๋  ๋•Œ ๋„์™€ ์ฃผ๊ธฐ๋กœ ํ•˜์˜€๊ณ 

๊ฐ„๋‹จํ•˜๊ฒŒ ๊ธฐ์‚ฌ ์ œ๋ชฉ, ๊ธฐ์‚ฌ์˜ url, ์–ธ๋ก ์‚ฌ, ๊ธฐ์‚ฌ๊ฐ€ ์˜ฌ๋ผ์˜จ ๋‚ ์งœ ์ด๋ ‡๊ฒŒ 4๊ฐ€์ง€๋ฅผ ํฌ๋กค๋งํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜์—ฌ 

12์›” 1๋‹ฌ ๊ฐ„์˜ ๊ธฐ์‚ฌ๋ฅผ ํฌ๋กค๋งํ•ด์„œ ์ „๋‹ฌํ•ด ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ๊ทธ๋•Œ ์ž‘์„ฑํ–ˆ๋˜ ์ฝ”๋“œ์—์„œ ์กฐ๊ธˆ ๊ฐœ์„ ํ•˜์—ฌ ๊ณต์œ ํ•ด๋ณด๋ ค ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ

ํŠน์ • ํšŒ์‚ฌ๋ฅผ ๋„ค์ด๋ฒ„ ๋‰ด์Šค์— ๊ฒ€์ƒ‰ํ–ˆ์„๋•Œ ๋‚˜์˜ค๋Š” O์›” O์ผ ~ O์›” O์ผ ์‚ฌ์ด์˜ ๋ชจ๋“  ๊ธฐ์‚ฌ๋ฅผ ์ˆ˜์ง‘ํ•ด๋‹ฌ๋ผ

์ˆ˜์ง‘๋‚ด์šฉ์€ ๊ธฐ์‚ฌ ์ œ๋ชฉ, ์–ธ๋ก ์‚ฌ, ๊ธฐ์‚ฌ ๋‚ ์งœ, ๊ธฐ์‚ฌ ์ œ๋ชฉ 

๐Ÿค” ์‚ฌ๋žŒ์ด ์ด๊ฑธ ์ง์ ‘ ํ•œ๋‹ค๋ฉด?

๋งŒ์•ฝ ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ† ์Šค๋ผ๋Š” ๊ธฐ์—…์˜ 2022๋…„ 1์›” 1์ผ ~ 1์›” 4์ผ ์‚ฌ์ด์˜ ๋ชจ๋“  ๊ธฐ์‚ฌ๋ผ๊ณ  ํ•œ๋‹ค๋ฉด

๋„ค์ด๋ฒ„์—์„œ ํ† ์Šค๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ  ๋‰ด์Šค ํƒญ์œผ๋กœ ์ด๋™ํ•œ ๋‹ค์Œ

๊ฒ€์ƒ‰ ์˜ต์…˜์„ ํŽผ์ณ ๊ธฐ๊ฐ„์„ 2022๋…„ 1์›” 1์ผ ~ 1์›” 4์ผ๋กœ ์„ค์ •ํ•˜์—ฌ ๊ฒ€์ƒ‰์„ ํ•œ ๋’ค์— 

์ฒซ ๊ธฐ์‚ฌ๋ถ€ํ„ฐ ํ•˜๋‚˜์”ฉ ์ œ๋ชฉ ๋ณต์‚ฌํ•˜๊ณ , ์–ธ๋ก ์‚ฌ ๋ณด๊ณ  ์ ๊ณ  ๊ธฐ์‚ฌ url ๋ณต์‚ฌํ•ด์„œ ์ˆ˜์ง‘ํ• ๊ฒ๋‹ˆ๋‹ค.

๐Ÿค” ๊ทธ๋Ÿผ ๊ฐœ๋ฐœ์€ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ํ•˜์ง€?

ํฌ๋กค๋ง์„ ํ†ตํ•œ ๊ธฐ์‚ฌ์ˆ˜์ง‘๋„ ์‚ฌ๋žŒ์ด ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•˜๊ฒŒ 2022๋…„ 1์›” 1์ผ ~ 1์›” 4์ผ๋กœ ๋‚ ์งœ๋ฅผ ์„ค์ •ํ•˜๊ณ 

์ฒซ ๊ธฐ์‚ฌ๋ถ€ํ„ฐ ํ•˜๋‚˜์”ฉ ์ œ๋ชฉ, ์–ธ๋ก ์‚ฌ, ๊ธฐ์‚ฌ url์„ ์ˆ˜์ง‘ ํ•˜๋ฉด๋ฉ๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ํ•˜๋‚˜ ์ƒ๊ฐ์„ ํ•œ ๊ฒƒ์ด 1์›” 1์ผ ~ 1์›” 4์ผ์„ ํ•œ๋ฒˆ์— ์„ค์ •ํ•˜์—ฌ ๊ธฐ์‚ฌ๋ฅผ ํฌ๋กค๋งํ•˜๋Š”๋ฐ

1์›” 1์ผ ๋ถ€ํ„ฐ 3์ผ๊นŒ์ง€ ์ž˜ ์ˆ˜์ง‘ํ•ด์˜ค๋‹ค๊ฐ€ 4์ผ ์ค‘๋ฐ˜์— ๊ฐ‘์ž๊ธฐ ์ธํ„ฐ๋„ท์ด ๋Š๊ธด๋‹ค๊ฑฐ๋‚˜ ํ•˜๋Š” ์ด์œ ๋กœ ํฌ๋กค๋ง์ด ๋ฉˆ์ถ”๊ฒŒ ๋˜๋ฉด

๊ทธ์‚ฌ์ด ์ˆ˜์ง‘ํ•œ 1์›” 1์ผ ~ 1์›” 3์ผ ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ๋Š” ๋ชจ๋‘ ๋‚ ์•„๊ฐ€๊ฒŒ ๋˜๋ฏ€๋กœ

๋งŒ์•ฝ ๊ธฐ์‚ฌ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์ €์žฅํ•ด์•ผํ•˜๋Š” ๋‚ ์งœ๊ฐ€ 1์›” 1์ผ ~ 1์›” 31์ผ ์ด๋ผ๋ฉด

1์›” 1์ผ ~ 1์›” 1์ผ ( 1์›” 1์ผ ํ•˜๋ฃจ ๊ธฐ์‚ฌ ) ํฌ๋กค๋ง -> ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅ

1์›” 2์ผ ~ 1์›” 2์ผ ( 1์›” 2์ผ ํ•˜๋ฃจ ๊ธฐ์‚ฌ ) ํฌ๋กค๋ง -> ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅ

... 

1์›” 31์ผ ~ 1์›” 31์ผ ( 1์›” 31์ผ ํ•˜๋ฃจ ๊ธฐ์‚ฌ ) ํฌ๋กค๋ง -> ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅ

=> 1์›” 1์ผ ~ 1์›” 31์ผ ํฌ๋กค๋ง ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘

์œ„์™€ ๊ฐ™์ด ํ•˜๋ฃจ ๋‹จ์œ„๋กœ ์ˆ˜์ง‘ -> ์ €์žฅ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐœ๋ฐœ ํ•ด์•ผ๊ฒ ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ‘ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์„ค์ •

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป ํ•„์ž ๊ฐœ๋ฐœ ํ™˜๊ฒฝ

- ๋งฅ๋ถ ํ”„๋กœ 2017 13์ธ์น˜ or ํŽœํ‹ฐ์—„ ๋ฐ์Šคํฌํƒ‘

- ์–ธ์–ด : Python 3.7.3 / ์‚ฌ์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : Selenium / BeautifulSoup / Pandas

- ์ฝ”๋“œ ์ž‘์„ฑ : Jupyter Notebook

- ๋ธŒ๋ผ์šฐ์ € : Chrome ( ํฌ๋กฌ )

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป ๊ฐœ๋ฐœ ํ™˜๊ฒฝ ์„ค์ • - Python ๊ณผ ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

๋ณธ๊ฒฉ์ ์ธ ๊ฐœ๋ฐœ์„ ์œ„ํ•ด์„œ๋Š” Python๊ณผ ๊ฐ์ข… ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋“ฑ์„ ์„ค์น˜ํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์˜ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์…”๋„ ์ข‹๊ณ  ๋‹ค๋ฅธ ๊ฐœ๋ฐœ ๊ธ€์„ ์ฐธ๊ณ ํ•˜์…”๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

์ €๋Š” ์ฝ”๋“œ ์ž‘์„ฑ์„ Jupyter notebook ์—์„œ ์ง„ํ–‰ํ•˜์˜€๋Š”๋ฐ ๋‹ค๋ฅธ

1. Python ์„ค์น˜  - ( Windows ์˜ ๊ฒฝ์šฐ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ • )

2019.09.07 - [Programming/Python] - [Python]Ubuntu์— Python 3.7 ์„ค์น˜ํ•˜๊ธฐ!

 

[Python]Ubuntu์— Python 3.7 ์„ค์น˜ํ•˜๊ธฐ!

1. Python ์„ค์น˜ ์ „ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ํ•˜๊ธฐ Ubuntu(๋˜๋Š” Putty)์—์„œ ํ„ฐ๋ฏธ๋„์„ ์—ด์–ด ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์„ค์น˜ ์ค‘๊ฐ„ ์ค‘๊ฐ„์— [ y | n ] ์ค‘์— ๊ณ ๋ฅด๋ผ๊ณ  ๋‚˜์˜ค๋ฉด y๋ฅผ ํƒ€์ดํ•‘ํ•˜๊ณ  ์—”ํ„ฐ๋ฅผ ํ•ด์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค! $ s

somjang.tistory.com

2. Selenium๊ณผ BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ 

$ pip install selenium
$ pip install bs4
$ pip install lxml

3. Selenium ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ๋‹ค์šด๋กœ๋“œ

2019.09.14 - [์œ ์šฉํ•œ ์ •๋ณด/Windows] - [Windows]Windows10์— Selenium์„ค์น˜ํ•˜๊ธฐ(20.2.13 ์—…๋ฐ์ดํŠธ)

 

[Windows]Windows10์— Selenium์„ค์น˜ํ•˜๊ธฐ(20.2.13 ์—…๋ฐ์ดํŠธ)

1. ๊ตฌ๊ธ€ ํฌ๋กฌ ์ตœ์‹ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ๋จผ์ € ํฌ๋กฌ์˜ ๋งจ ์šฐ์ธก ์ƒ๋‹จ์˜ ์„ธ ๊ฐœ์˜ ์ ์„ ํด๋ฆญํ•˜์—ฌ ํฌ๋กฌ์˜ ์„ค์ •ํŽ˜์ด์ง€๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์™ผ์ชฝ ๋ฉ”๋‰ด์—์„œ Chrome ์ •๋ณด๋ฅผ ํด๋ฆญํ•˜์—ฌ ์—…๋ฐ์ดํŠธ๋ฅผ ์‹ค์‹œํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ์‹œ

somjang.tistory.com

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ชจ๋‘ ์„ค์น˜ํ•˜์˜€๋‹ค๋ฉด Selenium ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์Šต๋‹ˆ๋‹ค.

๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ํ›„ ํ•ด๋‹น ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋ฅผ ์ž˜ ํ™•์ธํ•ด๋‘ก๋‹ˆ๋‹ค.

4. ๊ทธ ์™ธ ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜

Pandas : ํฌ๋กค๋งํ•œ ๊ฒฐ๊ณผ๋ฅผ ์—‘์…€๋กœ ๋งŒ๋“ค๋•Œ ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 

$ pip install pandas

tqdm : ์ง„ํ–‰์ƒํ™ฉ์„ ๋ณด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

$ pip install tqdm

๐Ÿ“ป๋„ค์ด๋ฒ„ ๋‰ด์Šค ํŽ˜์ด์ง€ ๋ถ„์„ํ•˜๊ธฐ

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป URL ๋ถ„์„

 

๋จผ์ € ํฌ๋กค๋ง์„ ํฌ๋งํ•˜๋Š” ํŽ˜์ด์ง€์˜ URL์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค.

https://search.naver.com/search.naver?where=news&query=%ED%86%A0%EC%8A%A4&sm=tab_opt&sort=0&photo=0&field=0&pd=3&ds=2022.01.13&de=2022.01.13&docid=&related=0&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so%3Ar%2Cp%3Afrom20220113to20220113&is_sug_officeid=0

์œ„์™€ ๊ฐ™์ด ํ† ์Šค ํ‚ค์›Œ๋“œ์— ๋Œ€ํ•ด์„œ ๊ด€๋ จ๋„์ˆœ์œผ๋กœ ์ •๋ ฌํ•œ ํŽ˜์ด์ง€์˜ url์„ ํ•˜๋‚˜ํ•˜๋‚˜ ๋œฏ์–ด๋ณด๋ฉด

https://search.naver.com/search.naver?

where=news

&query=%ED%86%A0%EC%8A%A4 # ๊ฒ€์ƒ‰์–ด : ํ† ์Šค

&sm=tab_opt

&sort=0 # ๊ด€๋ จ๋„์ˆœ ์ •๋ ฌ # ์ตœ์‹ ์ˆœ 1 # ์˜ค๋ž˜๋œ ์ˆœ 2

&photo=0

&field=0

&pd=3

&ds=2022.01.13 # ์‹œ์ž‘์ผ

&de=2022.01.13 # ์ข…๋ฃŒ์ผ

&docid=&related=0

&mynews=0

&office_type=0

&office_section_code=0

&news_office_checked=&nso=so%3Ar%2Cp%3Afrom20220113to20220113

&is_sug_officeid=0

query ์— ๋‚ด๊ฐ€ ๊ฒ€์ƒ‰์„ ํฌ๋งํ•˜๋Š” ๊ฒ€์ƒ‰์–ด - ( ์ธ์ฝ”๋”ฉ ๋œ ๊ฐ’์ด ํ•„์š”ํ•จ )

sort์— ๋‚ด๊ฐ€ ํฌ๋งํ•˜๋Š” ์ •๋ ฌ ๋ฐฉ์‹ - ( ๊ด€๋ จ๋„์ˆœ 0 / ์ตœ์‹ ์ˆœ 1 / ์˜ค๋žœ๋œ์ˆœ 2 )

ds๋Š” ๊ฒ€์ƒ‰ ํฌ๋ง ๊ธฐ๊ฐ„ ์‹œ์ž‘์ผ 

de๋Š” ๊ฒ€์ƒ‰ ํฌ๋ง ๊ธฐ๊ฐ„ ์ข…๋ฃŒ์ผ

์ด URL ์†์— ํฌํ•จ๋˜์–ด์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป ๋„ค์ด๋ฒ„ ๋‰ด์Šค ํŽ˜์ด์ง€ ๊ตฌ์„ฑ ์š”์†Œ ํŒŒ์•…ํ•˜๊ธฐ - ํฌ๋กฌ ๊ฐœ๋ฐœ์ž๋„๊ตฌ ํ™œ์šฉ

URL ๋งŒ ์•Œ๊ณ  ์žˆ์–ด์„œ๋Š” ์›ํ•˜๋Š” ๊ฐ’๋“ค๋งŒ ํฌ๋กค๋ง์œผ๋กœ ์ถ”์ถœํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค

๋‚ด๊ฐ€ ํ•„์š”๋กœํ•˜๋Š” ๋‚ด์šฉ์ด ๋“ค์–ด์žˆ๋Š” ํ•ญ๋ชฉ๋“ค์ด ์–ด๋–ค ๊ฐ’๋“ค๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๋Š”์ง€ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Š” ํฌ๋กฌ์˜ ๊ฐœ๋ฐœ์ž๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์‰ฝ๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋จผ์ € ์›ํ•˜๋Š” ํŽ˜์ด์ง€์—์„œ F12๋ฅผ ๋ˆ„๋ฅด๋ฉด ํฌ๋กฌ์˜ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๊ฐ€ ์—ด๋ฆฝ๋‹ˆ๋‹ค.

๋งŒ์•ฝ F12๋กœ ์—ด๋ฆฌ์ง€ ์•Š๋Š” ๋‹ค๋ฉด ์•„๋ž˜์˜ ๋ฐฉ๋ฒ•์œผ๋กœ๋„ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋ฅผ ์—ด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํฌ๋กฌ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ์—์„œ ํ™”์‚ดํ‘œ ๋ชจ์–‘์˜ ๋ฒ„ํŠผ์„ ํด๋ฆญํ•˜์—ฌ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ™”์‚ดํ‘œ ๋„๊ตฌ๋ฅผ ๋ˆŒ๋Ÿฌ ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•˜๋ฉด ์™ผ์ชฝ์˜ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋ถ€๋ถ„์— ๋งˆ์šฐ์Šค๋ฅผ ๊ฐ€์ ธ๋‹ค ๋Œ”์„ ๋•Œ 

ํ•ด๋‹น ํ•ญ๋ชฉ์ด ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํด๋ž˜์Šค๋ช…์ด๋‚˜ id ๊ฐ’ ๋“ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ €๋Š” ๋‚ด์šฉ ์ถ”์ถœ์„ ์œ„ํ•œ ์–ธ๋ก ์‚ฌ, ๊ธฐ์‚ฌ ์ œ๋ชฉ, ๊ธฐ์‚ฌ URL ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„๊ณผ

๋‹ค์Œ ํŽ˜์ด์ง€๋กœ ์ด๋™ํ•˜๊ธฐ ์œ„ํ•œ ํ™”์‚ดํ‘œ ๋ถ€๋ถ„์„ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

๐Ÿ‘จ๐Ÿป‍๐Ÿ’ป ์œ„์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์ธํ•œ ๊ฐ’

๊ธฐ์‚ฌ ์ •๋ณด ์˜์—ญ ๋ถ€๋ถ„ - div.news_area
                 ์ œ๋ชฉ ๋ถ€๋ถ„ - title
                 ๋งํฌ ๋ถ€๋ถ„ - href

์–ธ๋ก ์‚ฌ ๋ถ€๋ถ„ - div.info_group > a.info.press
     ๋˜๋Š” - div.info_group > span.info_press
     
     
๋‹ค์Œ ํŽ˜์ด์ง€ ์ด๋™ ๋ฒ„ํŠผ - a.btn_next
                 - area-disabled ๊ฐ€ true ์ธ ๊ฒฝ์šฐ ๋”์ด์ƒ ํด๋ฆญ ๋ถˆ๊ฐ€

๐Ÿ˜ŽSelenium๊ณผ BeautifulSoup๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ฝ”๋“œ ์ž‘์„ฑํ•˜๊ธฐ

๊ฐœ๋ฐœํ™˜๊ฒฝ๋„ ๋ชจ๋‘ ์„ค์ •ํ–ˆ๊ณ  

ํฌ๋กค๋ง์„ ํ•˜๋ ค๋Š” ํŽ˜์ด์ง€์˜ ๊ตฌ์„ฑ์š”์†Œ ๋ถ„์„๋„ ๋๋‚ฌ๋‹ค๋ฉด ์ด์ œ๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ๋งŒ ๋‚จ์•˜์Šต๋‹ˆ๋‹ค.

from selenium import webdriver as wd
from bs4 import BeautifulSoup
import pandas as pd
import time
import urllib

def get_article_info(driver, crawl_date, press_list, title_list, link_list, date_list, more_news_base_url=None, more_news=False):
    more_news_url_list = []
    while True:    
        page_html_source = driver.page_source
        url_soup = BeautifulSoup(page_html_source, 'lxml')
        
        more_news_infos = url_soup.select('a.news_more')
        
        if more_news:
            for more_news_info in more_news_infos:
                more_news_url = f"{more_news_base_url}{more_news_info.get('href')}"

                more_news_url_list.append(more_news_url)

        article_infos = url_soup.select("div.news_area")
        
        if not article_infos:
            break

        for article_info in article_infos:  
            press_info = article_info.select_one("div.info_group > a.info.press")
            
            if press_info is None:
                press_info = article_info.select_one("div.info_group > span.info.press")
            article = article_info.select_one("a.news_tit")
            
            press = press_info.text.replace("์–ธ๋ก ์‚ฌ ์„ ์ •", "")
            title = article.get('title')
            link = article.get('href')

#             print(f"press - {press} / title - {title} / link - {link}")
            press_list.append(press)
            title_list.append(title)
            link_list.append(link)
            date_list.append(crawl_date)

        time.sleep(2.0)
                      
                      
        next_button_status = url_soup.select_one("a.btn_next").get("aria-disabled")
        
        if next_button_status == 'true':
            break
        
        time.sleep(1.0)
        next_page_btn = driver.find_element_by_css_selector("a.btn_next").click()      
    
    return press_list, title_list, link_list, more_news_url_list
    
    

def get_naver_news_info_from_selenium(keyword, save_path, target_date, ds_de, sort=0, remove_duplicate=False):
    crawl_date = f"{target_date[:4]}.{target_date[4:6]}.{target_date[6:]}"
    driver = wd.Chrome("./chromedriver") # chromedriver ํŒŒ์ผ ๊ฒฝ๋กœ

    encoded_keyword = urllib.parse.quote(keyword)
    url = f"https://search.naver.com/search.naver?where=news&query={encoded_keyword}&sm=tab_opt&sort={sort}&photo=0&field=0&pd=3&ds={ds_de}&de={ds_de}&docid=&related=0&mynews=0&office_type=0&office_section_code=0&news_office_checked=&nso=so%3Ar%2Cp%3Afrom{target_date}to{target_date}&is_sug_officeid=0"
    
    more_news_base_url = "https://search.naver.com/search.naver"

    driver.get(url)
    
    press_list, title_list, link_list, date_list, more_news_url_list = [], [], [], [], []
    
    press_list, title_list, link_list, more_news_url_list = get_article_info(driver=driver, 
                                                                             crawl_date=crawl_date, 
                                                                             press_list=press_list, 
                                                                             title_list=title_list, 
                                                                             link_list=link_list,
                                                                             date_list=date_list,
                                                                             more_news_base_url=more_news_base_url,
                                                                             more_news=True)
    driver.close()
    
    if len(more_news_url_list) > 0:
        print(len(more_news_url_list))
        more_news_url_list = list(set(more_news_url_list))
        print(f"->{len(more_news_url_list)}")
        for more_news_url in more_news_url_list:
            driver = wd.Chrome("./chromedriver")
            driver.get(more_news_url)
            
            press_list, title_list, link_list, more_news_url_list = get_article_info(driver=driver, 
                                                                             crawl_date=crawl_date, 
                                                                             press_list=press_list, 
                                                                             title_list=title_list, 
                                                                             link_list=link_list,
                                                                             date_list=date_list)
            driver.close()
    article_df = pd.DataFrame({"๋‚ ์งœ": date_list, "์–ธ๋ก ์‚ฌ": press_list, "์ œ๋ชฉ": title_list, "๋งํฌ": link_list})
    
    print(f"extract article num : {len(article_df)}")
    if remove_duplicate:
        article_df = article_df.drop_duplicates(['๋งํฌ'], keep='first')
        print(f"after remove duplicate -> {len(article_df)}")
    
    article_df.to_excel(save_path, index=False)

๋จผ์ € selenium์„ ํ™œ์šฉํ•˜์—ฌ ํŽ˜์ด์ง€์˜ html ์†Œ์Šค๋ฅผ ๊ฐ€์ ธ์˜จ ๋’ค

beautifulsoup์˜ select, select_one, find_element_by_css_selector๋ฅผ ํ™œ์šฉํ•ด์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ณ 

selenium์„ ํ™œ์šฉํ•˜์—ฌ ๊ณ„์† ๋‹ค์Œ ํŽ˜์ด์ง€๋กœ ๋„˜์–ด๊ฐ€๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.

from datetime import datetime
from tqdm import tqdm

def crawl_news_data(keyword, year, month, start_day, end_day, save_path):
    for day in tqdm(range(start_day, end_day+1)):
        date_time_obj = datetime(year=year, month=month, day=day)
        target_date = date_time_obj.strftime("%Y%m%d")
        ds_de = date_time_obj.strftime("%Y.%m.%d")

        get_naver_news_info_from_selenium(keyword=keyword, save_path=f"{save_path}/{keyword}/{target_date}_{keyword}_.xlsx", target_date=target_date, ds_de=ds_de, remove_duplicate=False)

๊ทธ๋ ‡๊ฒŒ ๋งŒ๋“  ์ฝ”๋“œ๋กœ ํ‚ค์›Œ๋“œ, ๋‚ ์งœ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๊ทธ๋งŒํผ ํฌ๋กค๋ง์„ ํ•ด์ฃผ๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

keywords = ['ํ‹ด๋”', 'ํ† ์Šค', '์•ผ๋†€์ž', '๋‹น๊ทผ๋งˆ์ผ“', '์•„ํ”„๋ฆฌ์นดtv', '์˜จํ”Œ๋ฒ•', '๋งค์น˜๊ทธ๋ฃน']
save_path = "./naver_news_article_2022

for keyword in keywords:
    os.makedirs(f"{save_path}/{keyword}")

๊ทธ๋ฆฌ๊ณ  ์›ํ•˜๋Š” ํ‚ค์›Œ๋“œ์™€ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•  ๊ฒฝ๋กœ๋ฅผ ์„ค์ •ํ•œ ๋‹ค์Œ ๊ฒฝ๋กœ/ํ‚ค์›Œ๋“œ ๋กœ ๋””๋ ‰ํ† ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

for keyword in keywords:
    print(f"start keyword - {keyword} crawling ...")
    crawl_news_data(keyword=keyword, year=2022, month=1, start_day=1, end_day=13, save_path=save_path)

๊ทธ ๋‹ค์Œ ์›ํ•˜๋Š” ๊ธฐ๊ฐ„๊ณผ ์ €์žฅ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ ํฌ๋กค๋ง์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

์œ„์˜ ๊ฒฝ์šฐ์—๋Š” 2022๋…„ 1์›” 1์ผ ๋ถ€ํ„ฐ 13์ผ๊นŒ์ง€์˜ ๊ฐ’์„ ํฌ๋กค๋งํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.

ํฌ๋กค๋ง์„ ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ๋‚ ์งœ๋ณ„๋กœ ํฌ๋กค๋ง์ด ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋‚˜๋Š” ํ•œ๋ฒˆ์— ํ•ฉ์ณ์ง„ ๊ฐ’์„ ๋ณด๊ณ ์‹ถ๋‹ค! ํ•œ๋‹ค๋ฉด

import pandas as pd
import glob
import os

def merge_excel_files(file_path, file_format, save_path, save_format, columns=None):
    merge_df = pd.DataFrame()
    file_list = file_list = [f"{file_path}/{file}" for file in os.listdir(file_path) if file_format in file]
    
    for file in file_list:
        if file_format == ".xlsx":
            file_df = pd.read_excel(file)
        else:
            file_df = pd.read_csv(file)
        
        if columns is None:
            columns = file_df.columns
            
        temp_df = pd.DataFrame(file_df, columns=columns)
        
        merge_df = merge_df.append(temp_df)
        
    if save_format == ".xlsx":
        merge_df.to_excel(save_path, index=False)
    else:
        merge_df.to_csv(save_path, index=False)
        

if __name__ == "__main__":
    for keyword in keywords:
        merge_excel_files(file_path=f"/Users/donghyunjang/PythonHome/naver_news_article_2022/{keyword}", file_format=".xlsx", 
                          save_path=f"/Users/donghyunjang/PythonHome/naver_news_article_2022/{keyword}/20220101~20220113_{keyword}_๋„ค์ด๋ฒ„_๊ธฐ์‚ฌ.xlsx", save_format=".xlsx")

์œ„์˜ ์ฝ”๋“œ๋กœ ํ•ฉ๋ณ‘์„ ์‹œ์ผœ์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ ์œ„์™€ ๊ฐ™์ด ํ•ฉ๋ณ‘๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ™‚ ์ตœ์ข… ๊ฒฐ๊ณผ

์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

728x90
๋ฐ˜์‘ํ˜•
3 Comments
๋Œ“๊ธ€์“ฐ๊ธฐ ํผ