[python] 데이터 수집

728x90

필요 패키지

urllib.parse.quote 함수: 검색어가 한글일 때 인코딩을 하기 위해 사용

requests: 웹의 문자열을 잃어오기 위한 패키지

beautifulsoup: HTML 문자열에서 원하는 데이터를 추출하기 위해 사용하는 패키지

설치

pip install requests
pip install beautifulsoup4

HTML 검색

문자열 인코딩

from urllib.parse import quote

#검색어를 입력박아 인코딩 
string = input("검색어를 입력하세요:")
keyword = quote(string)

print(keyword)

# 결과
검색어를 입력하세요:안녕
%EC%95%88%EB%85%95

url 가져오기

import requests
URL = 불러오고자 하는 사이트의 url
response = requests.get(URL)

#html 가져오기 
html = response.text

데이터 읽기

from bs4 import BeautifulSoup

#HTMl 텍스트는 메모리에 트리 형태로 저장  
bs = BeautifulSoup(html, 'html.parser')

#선택자는 동일한 데이터가 있을 수 있으므로 list
cnt = bs.select(선택자)
#cnt에 선택자에 해당하는 내용을 저장한다.

실습1(페이지 내 글 제목 불러오기)

import requests
URL = 'https://growingegg.tistory.com/'
response = requests.get(URL)
html = response.text

from bs4 import BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')

cnt = bs.select('strong.tit_post')

for i in cnt:
    print(i.getText())

# 결과
[python] numpy(2)
[python] numpy(1)
[python] sklearn, pandas(2)
[python]pandas
[Taebleu] 대시보드

실습 2

import requests
URL = 'https://growingegg.tistory.com/'
response = requests.get(URL)
html = response.text

from bs4 import BeautifulSoup
bs = BeautifulSoup(html, 'html.parser')

cnt = bs.select('#dkBody > span.txt_title')
cnt = cnt[0].getText()
cnt = int(cnt[6:-1])

pages = int(cnt/5 + 0.99)

for page in range(1,pages+1):
    url = 'https://growingegg.tistory.com//?page='+str(page)
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.select('strong.tit_post')
    print(title)

R의 데이터 활용

pyreadr이라는 패키지를 사용

pyreadr.readr('rds 파일 경로')[None]

통계 프로그램 데이터 읽어오기

pyreadstat 패키지를 이용하면 SPSS,Stata, SAS 프로그램 등의 데이터를 읽을 수 있다.
최근의 데이터 분석을 할 때 pandas의 Data Frame을 이용하지 않고 pySpark의 dataframe을 이용하는 경우가 많은데 이 이유 중의 하나는 pySpark가 가져올 수 있는 데이터의 종류가 더 많다.