I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
For more performance improvement check out buffered RandomAccessFile from:
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.