how to do web scrapping

Discussion in 'App Development' started by gmst, Feb 7, 2013.

  1. ive actually used autoit to do some pretty heavy lifting
     
    #31     Jul 10, 2013
  2. Depends on what you're trying to scrape. If it's in something like flash, then I'd say you're out of luck.

    Otherwise, I just use Python for any web scraping needs (actually very shocked no one mentioned it in this thread). As far as libraries go, just use BeautifulSoup and if you need to do authentication Requests. The time it takes to code these things up is pretty much nil.

    Here's a sample, this downloads NAVs for ETFs from bloomberg's website. I just use BeautifulSoup for the scraping and regex for date parsing.

    #/usr/bin/env python
    import BeautifulSoup
    import urllib
    import re

    from pandas import *

    def scrape_ETF_data(s):
    link = "http://www.bloomberg.com/quote/"+s+":US"
    soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(link))
    table = soup.find("div", {"class" : "standard_stat"}).findChildren("tr")
    nav_date = re.search("20\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])", table[0].span$
    data = {re.sub(r'[^\w]', '', x.th.string) : x.td.contents[-1].strip() for x in table}
    data["Symbol"] = s
    data["NAV_asof"] = nav_date
    return data
     
    #32     Aug 7, 2013
  3. Just a FYI, if you do this your IP address will get a message saying to verify that you are a human then after your IP will get banned....
     
    #33     Aug 7, 2013