파이썬 웹 크롤링 연습

Python_/Analysis

파이썬 웹 크롤링 연습

Jr.Kelly 2017. 1. 7. 03:33

https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python

파이썬 웹크롤링 가이드 feat. jupyternotebook(0107)

from __future__ import print_function
import os.path
from collections import defaultdict
import string
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. defaultdict 자료저장

출처: https://docs.python.org/2/library/collections.html

8.3.3.1. `defaultdict` Examples

s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
>>> d = defaultdict(list)
>>> for k, v in s:
...     d[k].append(v)
...
>>> d.items()
[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

2. python request 모듈

result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)

request.content 와 request.text의 차이점

r.text is the content of the response in unicode, and r.content is the content of the response in bytes.

3. Data를 가져올 HTML 구조를 파악

clean up function

def convert_num(val):
    """
    Convert the string number value to a float
     - Remove all extra whitespace
     - Remove commas
     - If wrapped in (), then it is negative number
    """
    val = string.strip(val).replace(",","").replace("(","-").replace(")","")
    return float(val)

4. Parse the HTML

<파이썬 웹 크롤러 연습할 떄 참고한 사이트>

http://creativeworks.tistory.com/entry/PYTHON-3-Tutorials-24-%EC%9B%B9-%ED%81%AC%EB%A1%A4%EB%9F%AClike-Google-%EB%A7%8C%EB%93%A4%EA%B8%B0-1-How-to-build-a-web-crawler

<파이썬 프로젝트 구성하기>

http://python-guide-kr.readthedocs.io/ko/latest/writing/structure.html

<파이썬 스터디 URL>

http://blog.naver.com/dudwo567890/220914435973

파이썬 웹 크롤링 연습

8.3.3.1. defaultdict Examples

8.3.3.1. `defaultdict` Examples