I wanted to find potentially duplicate records in my data, based on having the exact same name and being geographically located within 500 meters of each other.
with potential_duplicates as ( select a.id as one, b.id as two, ST_Distance(a.point, b.point) as distance_m from location a, location b where a.name = b.name and a.id > b.id and ST_Distance(a.point, b.point) < 500 ) select * from potential_duplicates
I'm using a CTE here because it makes it easy to further customize the output with an additional query.
A few tricks in here:
bin order to join against itself to find duplicates
ST_Distance(a.point, b.point) < 500clause returns locations within 500m of each other
a.id > b.idclause solves a problem I had with the first version of this query where each pairing was returned twice, with
twoswapped. By requiring
ato have a higher
bI avoid this problem entirely - and also prevent rows from matching themselves (where
a.id = b.id).
Created 2021-05-19T20:52:04-07:00 · Edit