Question Details

No question body available.

Tags

sql regex postgresql full-text-search

Answers (2)

March 3, 2026 Score: 1 Rep: 30,597 Quality: Medium Completeness: 80%

Combine FTS with pgtrgm and/or fuzzystrmatch.

F.35.5. Text Search Integration

Trigram matching is a very useful tool when used in conjunction with a full text index. In particular it can help to recognize misspelled input words that will not be matched directly by the full text search mechanism.

The first step is to generate an auxiliary table containing all the unique words in the documents:

CREATE TABLE words AS SELECT word FROM
       tsstat('SELECT totsvector(''simple'', bodytext) FROM documents');

where documents is a table that has a text field bodytext that we wish to search. The reason for using the simple configuration with the totsvector function, instead of using a language-specific configuration, is that we want a list of the original (unstemmed) words.

Next, create a trigram index on the word column:

CREATE INDEX wordsidx ON words USING GIN (word gintrgm_ops);

Now, a SELECT query similar to the previous example can be used to suggest spellings for misspelled words in user search terms. A useful extra test is to require that the selected words are also of similar length to the misspelled word.

If you'd like to also check for synonyms and conceptual similarity (cold, chilly, freezing, sub-zero, antarctic can all mean just low temperature), then look into pgvector.

March 3, 2026 Score: 1 Rep: 1 Quality: Low Completeness: 60%

The Problem

PostgreSQL's parser tokenizes 16MB as one token (numword) but 16 MB as two. This happens before dictionaries, so no text search configuration can fix it.

Without a custom C parser, normalizing text before indexing is the cleanest approach. Wrap it in a function so the regex stays in one place:

create or replace function tsnormalize(text)
returns text language sql immutable strict parallel safe as $$
    select regexpreplace(
               regexpreplace($1, '(\d+)\s*(\D)', '\1 \2', 'g'),
               '(\D)-(\d)', '\1 \2', 'g')
$$;

Then use a generated column so normalization happens at write time, not query time:

create table product (
    id serial primary key,
    description text not null,
    descriptionnormalized text 
        generated always as (tsnormalize(description)) stored
);

create index idxproductfts on product using gin (totsvector('portuguessemacento', descriptionnormalized));

Querying:

select * from product
where totsvector('portuguessemacento', descriptionnormalized) 
   @@ plaintotsquery('portuguessemacento', tsnormalize('16gb mbx'));

The regex lives in one place, the GIN index works normally, and if rules grow you only update one function. The 'g' flag replaces your 1, 0 parameters — shorter and equivalent.

Alternative: if normalization rules keep multiplying, pgtrgm is inherently whitespace-tolerant since it compares 3-character slices, at the cost of less linguistic precision.