Using PostgreSQL, I have a very large table of ~250 million rows composed of the following 4 attributes:
CREATE TABLE public.sim_values_english
id bigint NOT NULL DEFAULT nextval('sim_values_english_id_seq'::regclass),
word1 character varying(255),
word2 character varying(255),
sim_value double precision,
CONSTRAINT pk_sim_values_english PRIMARY KEY(id)
I am trying to select the
sim_value attribute based on the defined
word2 using the following SELECT query:
(Assume I have 2 words: X and Y in which X or Y can belong to either word1 or word2)
Where (word1='X' or word2='X') and (word1='Y' or word2='Y') and (word1!=word2)
This is taking on average 1~1.5 minutes to return the
sim_value which is really long! Storing the entire table in memory is very costly since it weights more than 10 GB).
How can I speed up this query? What are your suggestions?
word2 are never identical so if for 1 case:
word1 is X and
word2 is Y, Then there does not exist another row where
word1 is Y and
word2 is X!
NOTE: I have searched for similar topics but none of them addressed this exact issue. Thank you for understanding
Firstly, if you don't have it already, I would ensure that the following index exists:
CREATE INDEX ON sim_values_english(word1, word2);
Then I would try the following query:
SELECT sim_value FROM sim_values_english WHERE word1='X' AND word2='Y' UNION ALL SELECT sim_value FROM sim_values_english WHERE word1='Y' AND word2='X'
So you have to create index on these 2 fields
CREATE INDEX word1_word2_idx ON sim_values_english (word1, word2);
Then your query is very basic
SELECT sim_value FROM sim_values_english WHERE (word1='X' AND word2='Y') OR (word1='Y' AND word2='X')
Start by not storing the words, but only references to them, this will make the bridge-table leaner:
CREATE TABLE words_english ( word_id integer -- or: serial if you want NOT NULL PRIMARY KEY , word varchar UNIQUE ); CREATE TABLE sim_values_english ( word_id integer NOT NULL references words_english (word_id) , other_id integer NOT NULL references words_english (word_id) , sim_value DOUBLE PRECISION NOT NULL DEFAULT 0.0 , PRIMARY KEY (word_id, other_id) ); CREATE UNIQUE INDEX ON sim_values_english(other_id,word_id);
Your query can now be rewitten to:
SELECT sim_value FROM sim_values_english v JOIN words_english one ON v.word_id = one.word_id JOIN words_english two ON v.other_id = two.word_id WHERE one.word IN ('X' ,'Y') AND two.word IN ( 'X', 'Y' ) AND v.word_id <> v.other_id ;
(or use a view to emulate the old table)
Check this older answer for a recipe for squeezing out fat, repetitive columns out of a table.