python Scrapy (crawling)

9/22/2013

python Scrapy (crawling)

crawling.....말이 크롤링이지 그냥 웹페이지를 긁어오기다 콜록.. ㅋㅋ 나는 python을 좋아한다. 훌륭한 개발자분들이 편리한 모듈, 프레임워크등을 많이 만들어 놓으셨다. 감사합니다^^

Crawling 하는 방법은 엄청 많다. 그중에서 나는 Scrapy를 사용했다...스크랩파이? 콜록 ㅋㅋ 사용법은 매우 간단하다. Scrapy Tutorial 튜토리얼도 매우 친절하게 나와있다.

당연히....python이 필요하다 ㅋㅋㅋ 2.6 또는 2.7버전이 필요하다. 3.0은 안해 봣지만 안됄지 쉽다..콜록 ㅋㅋ pip 나 easy_install을 이용하여 편하게 이지 인스톨한다..콜록

easy_install Scrapy

pip install Scrapy

설치가 끝낫다....필요한 디펜던시들은 알아서 처리해준다...편하다 콜록...ㅋㅋ

scrapy 프로젝트를 생성한다.

scrapy startproject tutorial

scrapy.cfg: 프로젝트의 설정 파일 tutorial/: 프로젝트 폴더 tutorial/items.py: 사용할 item, 뽑아낼 데이터가 들어 갈 것이다. tutorial/pipelines.py: 뽑아낸 데이터로 뭔가 처리할때 쓴다. 뭔가..ㅎ tutorial/settings.py: 프로젝트 설정 파일 tutorial/spiders/: 정보를 모아줄 거미들...ㅋㅋ동작할 크롤러들이다.

item.py에 사용할 item을 추가한다.

from scrapy.item import Item, Field
class HealthItem(Item):
 name = Field()
 phone = Field()
 address = Field()
 home_page = Field()

spiders폴더에 정보를 모아 줄 거미를 추가한다..ㅎㅎ

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from vwell.items import HealthItem

class HealthSpider(BaseSpider):
    # spider 이름 유니크해야한다.
    name = "health"
    allowed_domains = ["cdc.go.kr"]
    # 크롤링할 url
    start_urls = [
     "https://nip.cdc.go.kr/nip/manage.do?service=getMedicalCenterList&ARTICLECNT=100&CURPAGE=1&SIDCOD=1100000000&SelFlag=HC"
    ]
    
    # 데이터 파싱
    def parse(self, response):
  hxs = HtmlXPathSelector(response)
  healths = hxs.select('//*[@id="contents"]/div[@class="conbox sch"]/div[@class="tableA"]/table/tbody/tr')
  items = []
  count = 0

  for health in healths:
   item = HealthItem()
   item['name'] = health.select('//td')[count*4].select('text()').extract()[0]
   item['phone'] = health.select('//td')[count*4+1].select('ul/li')[0].select('text()').extract()
   item['phone'].append(health.select('//td')[count*4+1].select('ul/li')[1].select('text()').extract()[0])
   item['address'] = health.select('//td')[count*4+2].select('text()').extract()[0]
   item['home_page'] = health.select('//td')[count*4+3].select('a/@href').extract()[0]
   count += 1
   items.append(item)

  return items

HtmlXPathSelector의 select를 이용해 html을 파싱한다. 사용법은 금방 적응 할 정도로 간단하다. tag이름과 / 로 태그를 선택할 수있다.

hxs.select('//ul/li')

@를 이용하여 태그의 properties에 접근할 수있다.

hxs.select('//*[@id="contents"]/div[@class="conbox sch"]/div[@class="tableA"]/table/tbody/tr')

text()로 value값을 가져 올 수 있다. extract()를 할때 List로 반환 된다는데 주의하자.

hxs.select('//ul/li/text()').extract()

scrapy.cfg 파일있는 폴더로가서 실행해보자.

scrapy crawl health -o items.json -t json

크롤링한 결과는 items.json 파일에 저장되어있을 것이다.

크롤링한 데이터로 다른 처리를 추가하고 싶다면 pipeline을 이용하면된다. pipelines.py파일에 처리할 클래스를 추가하고 settings.py등록해주면 끝이다. 간단하다. 나는 유니코드가 보기 싫어서 스트링으로 저장하는 pipeline을 추가햇다.

class JsonWriterPipeline(object):

 def __init__(self):
  self.file = open('healthList.json', 'w')

 def process_item(self, item, spider):
  newItem = {}
  newItem['name'] = item['name'].encode('utf-8')
  newItem['phone'] = item['phone']
  newItem['address'] = item['address'].encode('utf-8')
  newItem['home_page'] = item['home_page'].encode('utf-8')

  line =  '{ "name" : "%s", "phone" : ["%s", "%s"], "address" : "%s", "home_page" : "%s" },' %(newItem['name'], newItem['phone'][0].encode('utf-8'), newItem['phone'][1].encode('utf-8'), newItem['address'], newItem['home_page']) + '\n'
  self.file.write(line)
  return item

settings.py에 등록해준다.

ITEM_PIPELINES = [
    'vwell.pipelines.JsonWriterPipeline'
]

다시 실행한다. 끝~ 웹페이지 크롤링....쉽다...콜록^^
source : https://github.com/semicolok/python-scrapy

2 개의 댓글:

Unknown9/23/2013 9:48 오전
오 크롤링
답글삭제
답글

댓글 추가

Semicolok's Blog

Pages

Popular Posts

Blogroll

About

프로필

9/22/2013

python Scrapy (crawling)

2 개의 댓글:

Blog Archive

Labels

Blogger templates

Blogger news