파이썬 웹 크롤링 연습

Jr.Kelly 2017. 1. 7. 03:33

2017. 1. 7. 03:33

https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python

파이썬 웹크롤링 가이드 feat. jupyternotebook(0107)

from __future__ import print_function
import os.path
from collections import defaultdict
import string
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. defaultdict 자료저장

출처: https://docs.python.org/2/library/collections.html

8.3.3.1. `defaultdict` Examples

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
>>> d = defaultdict(list)
>>> for k, v in s:
...     d[k].append(v)
...
>>> d.items()
[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

2. python request 모듈

result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)

request.content 와 request.text의 차이점

r.text is the content of the response in unicode, and r.content is the content of the response in bytes.

3. Data를 가져올 HTML 구조를 파악

clean up function

def convert_num(val):
    """
    Convert the string number value to a float
     - Remove all extra whitespace
     - Remove commas
     - If wrapped in (), then it is negative number
    """
    val = string.strip(val).replace(",","").replace("(","-").replace(")","")
    return float(val)

4. Parse the HTML

<파이썬 웹 크롤러 연습할 떄 참고한 사이트>

http://creativeworks.tistory.com/entry/PYTHON-3-Tutorials-24-%EC%9B%B9-%ED%81%AC%EB%A1%A4%EB%9F%AClike-Google-%EB%A7%8C%EB%93%A4%EA%B8%B0-1-How-to-build-a-web-crawler

<파이썬 프로젝트 구성하기>

http://python-guide-kr.readthedocs.io/ko/latest/writing/structure.html

<파이썬 스터디 URL>

http://blog.naver.com/dudwo567890/220914435973

'Python_ > Analysis' 카테고리의 다른 글

dict , iteritems() (0)	2017.10.02
<Pandas 데이터 분석> 1. 디렉토리 설정 (0)	2017.01.16
네이버영화리뷰 크롤링 feat.Beautifulsoup (0)	2016.12.06
[파이썬 모듈 설치] Beautiful Soup4 install, import 윈도우 cmd에서 실행 (0)	2016.12.06

매일이 쌓이는 이야기