Exhuastive Set of Listed Equities

gmst · Jul 3, 2013

Quote from blah12345678:

My solution to building a database from Yahoo Finance is to just scrape the pages:

1. Go to the sectors page. Scrape the 10 or so Sectors, and obtain the links to each individual Sector page
2. Parse each Sector page, obtaining the name and URL of each Industry in that Sector.
3. Parse each Industry page, obtaining the symbol and name of every US stock in that Industry.

After these quick steps, I have a list of the 6,300 US stocks and 950ish ETFs recognized by Yahoo.

From there, I get each stock's name, exchange, whether they trade options, shares outstanding, and historical prices.

I use Perl with the Furl, HTML::TreeBuilder, and Parallel::ForkManager modules.
More...

Thanks for the algorithm. I have never used perl. any other way to do it? I can write code in vba. Btw, possible for you to post your perl code? ok if not. maybe that will give an idea of how to execute this in vba. Thanks.

Also, can you please explain this line - couldn't understand it.
I use Perl with the Furl, HTML::TreeBuilder, and Parallel::ForkManager modules.

Bob111 · Jul 3, 2013

Quote from blah12345678:

My solution to building a database from Yahoo Finance is to just scrape the pages:

1. Go to the sectors page. Scrape the 10 or so Sectors, and obtain the links to each individual Sector page
2. Parse each Sector page, obtaining the name and URL of each Industry in that Sector.
3. Parse each Industry page, obtaining the symbol and name of every US stock in that Industry.

After these quick steps, I have a list of the 6,300 US stocks and 950ish ETFs recognized by Yahoo.

From there, I get each stock's name, exchange, whether they trade options, shares outstanding, and historical prices.

I use Perl with the Furl, HTML::TreeBuilder, and Parallel::ForkManager modules.
More...

+1. i have similar list of procedures. and that's how ladies and gentleman we are getting the data,which suppose to be free and easy to find on exchanges.

blah12345678 · Jul 3, 2013

Quote from gmst:

Thanks for the algorithm. I have never used perl. any other way to do it? I can write code in vba. Btw, possible for you to post your perl code? ok if not. maybe that will give an idea of how to execute this in vba. Thanks.

Also, can you please explain this line - couldn't understand it.
I use Perl with the Furl, HTML::TreeBuilder, and Parallel::ForkManager modules.
More...

Perl is the core scripting language. Similar to Ruby or Python or Lua, etc.

Other people have written add-ons that enhance/simplify certain tasks, such as retrieving data from the web, parsing web pages, manipulating databases, creating charts, calculating statistics or option Greeks, etc.

They donate their work to the public by submitting them to CPAN - the Comprehensive Perl Archive Network - cpan.perl.org.

Ruby, Python, and R have similar repositories.

The modules I mentioned and use:

Furl - a faster version of LWP, which is the built-in module to download web pages and other data from the web.

HTML::TreeBuilder - part of HTML::Tree. A module to help parse HTML pages. This module parses the page into simple text based on the html tags. There are similar modules that convert the HTML into XML or DOM trees for easier, more orderly parsing.

Parallel::ForkManager is a simple module to help parallellize discrete procedures. So, instead of retrieving, parsing, and inserting the data into the database for the 7,200 stocks and ETFs, one at a time, I fire up 50 instances to run at the same time. That reduces the total time from hours to about 30 min over a cable modem line.

I use it again when I update prices each day.

DBI is the top-level module for database interaction. DBI works in conjunction with a database driver module (in my case, DBI:g for PostgreSQL interaction).

There are others, but those are the most relevant.

I run FreeBSD. If you want to continue with Windows, you'll probably want to install Cygwin to run a Unix-like environment on your Windows box. Or, you can install VirtualBox, and install Linux or FreeBSD as a guest host.

For the uninitiated, I highly recommend FreeBSD instead of Linux. The basic analogy I use to compare the two is this:

Linux is a car - but you receive it in parts. It's up to you to put it together, tune it, and maintain it. No matter what distro you use, you'll still work harder to get it to do what you want.

FreeBSD is a car, but it's delivered already to run and use. You just have to change the oil and filters, rotate the tires, etc, every once in a while. You'll still need to build/configure your window manager if you're particular like me and don't want to use GNOME or KDE.

The code I've written is not ready for public consumption. I definitely need to run through and eliminate inconsistencies and inefficiencies.

But the gist of the design:

I've created my own package (Perl-speak for library or the above mentioned modules) called Finance:ataMining.

The stocks portion is in Finance:ataMining::Stocks
The futures is in Finance:ataMining::Futures
The options portion is in Finance:ataMining::Options
The analysis routines are in Finance:ataMining::Quant

In Perl, the double-colons are equivalent to directory slashes. So, in Unix, Finance:ataMining::Stocks is equivalent to:

$lib_dir/Finance/DataMining/Stocks.pm

I've put all of the routines into the Stocks.pm file and all the config stuff - default variables, SQL calls, urls into a config file called $etc/Stocks.conf

I've written everything so I can merely do this to get all of the symbols and all of their closing prices:

my $stocks = new Finance:ataMining::Stocks;
my @symbols = $stocks->getSymbols();

foreach (@symbols) {
my %prices = $stocks->getStockPricesBySymbol($_->[0]);

foreach my $symbol (sort keys %prices) {
foreach my $date (sort keys %{$prices{$symbol}}) {
print "$symbol - $date - $prices{$symbol}{$date}{'close'}\n";
}
}
}

Granted, this isn't a very fast (it took 409 seconds when I timed it) or very practical example , since I'm pulling and printing data from a heavily indexed table with 20 million rows (I've downloaded all the price data from 1970 where applicable). But if you're not in any hurry, it works, especially for EOD data...

If one needs/wants to avoid survivorship bias issues, then they can buy the delisted data from premiumdata.net. I doubt they'll have other useful data like shares outstanding, sector, industry, fundamentals, etc.

This is just the data collection stuff. If I finally get around to building a more complete package - charting, backtesting, order generator, portfolio manager, maybe a web or app-based gui, then I'll release it.

But it would need to be more complete and user-friendly than GeniusTrader. Which it isn't ... yet.

And I may entertain the idea of rewriting it in Python. We'll see. No promises...

*** how do we turn off the smilies? Replace the smilies above with colon and D ***

gmst · Jul 3, 2013

blah12345678

that was a very thorough reply and a nice introduction. Highly Appreciated!

Log in or Sign up

Exhuastive Set of Listed Equities

gmst

Bob111

blah12345678

gmst