https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python
파이썬 웹크롤링 가이드 feat. jupyternotebook(0107)
from __future__ import print_function
import os.path
from collections import defaultdict
import string
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
1. defaultdict 자료저장
출처: https://docs.python.org/2/library/collections.html
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
>>> d = defaultdict(list)
>>> for k, v in s:
... d[k].append(v)
...
>>> d.items()
[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]
2. python request 모듈
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
request.content 와 request.text의 차이점
r.text
is the content of the response in unicode, and r.content
is the content of the response in bytes.
3. Data를 가져올 HTML 구조를 파악
clean up function
def convert_num(val):
"""
Convert the string number value to a float
- Remove all extra whitespace
- Remove commas
- If wrapped in (), then it is negative number
"""
val = string.strip(val).replace(",","").replace("(","-").replace(")","")
return float(val)
4. Parse the HTML
<파이썬 웹 크롤러 연습할 떄 참고한 사이트>
http://creativeworks.tistory.com/entry/PYTHON-3-Tutorials-24-%EC%9B%B9-%ED%81%AC%EB%A1%A4%EB%9F%AClike-Google-%EB%A7%8C%EB%93%A4%EA%B8%B0-1-How-to-build-a-web-crawler
<파이썬 프로젝트 구성하기>
http://python-guide-kr.readthedocs.io/ko/latest/writing/structure.html
<파이썬 스터디 URL>
http://blog.naver.com/dudwo567890/220914435973