I have an application with a pretty large active dataset (let's say cars) with around 2 million active rows of data. Each "car" has a multitude of attributes (columns) like price, mileage, year, brand, model, fuel type etc. etc.
Now on the /show page for each car in my web application I need to produce a list of the top 10 most "similar" cars. As I never "know" if a car is a very common or very rare kind of car (before actually doing the db query) I have designed a pattern where I hardly do any filtering (WHERE
-clauses) in the "similar-cars"-query. Instead I do a lot of ORDER BY
-clauses, combined with CASE WHEN
-statements based on the current car in view's data. Let's say a user looks at a Ford Focus, 2010, 30.000km, Gasoline, 12490EUR from around Düsseldorf
car. Then I would do something like:
SELECT "cars".*
FROM de."cars"
WHERE ("cars"."id" != 24352543)
AND "cars"."sales_state" = 'onsale'
AND (cars.is_disabled IS NOT TRUE)
ORDER BY
CASE WHEN ABS(cars.price - 12490) < cars.price * 0.2 THEN 1 WHEN ABS(cars.price - 12490) < cars.price * 0.4 THEN 2 WHEN ABS(cars.price - 12490) < cars.price * 0.6 THEN 3 ELSE 4 END,
CASE WHEN fuel_type = 'Gasoline' THEN 0 ELSE 1 END,
ABS(cars.price - 12490),
CASE WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || cars.longitude || ' ' || cars.latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 30000 THEN 1 WHEN ST_Distance( ST_GeographyFromText( 'SRID=4326;POINT(' || cars.longitude || ' ' || cars.latitude || ')' ), ST_GeographyFromText('SRID=4326;POINT(12.172130 48.162990)') ) <= 100000 THEN 2 ELSE 3 END,
ABS(cars.year - 2010),
ABS(cars.km - 30000)
LIMIT 10
In reality there are even more ordering clauses.
Now this is handy because no matter how "easy" it is to find 10 "relevant" cars similar to the current car, the query is always gonna return something - the trouble is - it's slow and almost impossible to index from my knowledge. Doing this on 2 million records, even if I have a well-tunes super fast dedicated PostgreSQL 11, 300GB ram, 10 SSD RAID 10 32 core server, this will still take me around 2-4 seconds, time I don't have. I need it down to < 200ms.
I have been scratching my head for approaches on solving this, but as I'm unexperienced with solving problems like this at scale, I am unsure which approach would pay better off, solving the challenge. Some of the ideas I have:
- Doing the query in an iterative fashion where I filter (
WHERE
) on some column in terms, (e.g. starting by limiting the data on a subset of prices), to reduce the dataset. Then if results are returned, great, else do another slightly broader query and so on and so on. - Using a completely different kind of algorithm, maybe pre-populating some kind of similarity measure columns for the cars
- Utilising some internal PostgreSQL features/extensions that could speed thing, whatever those be?
Aucun commentaire:
Enregistrer un commentaire