autocomplete: tune autocorrect algorithm.

Tune autocorrect to produce fewer false positives. Before we used trigram similarity. Now we use Levenshtein edit distance with a dynamic typo threshold. Trigram similarity was able to correct large transpositions (e.g. `miku_hatsune` -> `hatsune_miku`), but it was bad at correcting small typos. Levenshtein is good at small typos, but can't correct large transpositions.
2020-12-13 00:45:22 -06:00
parent 119268e118
commit 6a46aeb55c
5 changed files with 31 additions and 7 deletions
--- a/app/logical/autocomplete_service.rb
+++ b/app/logical/autocomplete_service.rb
@@ -100,7 +100,7 @@ class AutocompleteService
  end

  def tag_autocorrect_matches(string)
-    tags = Tag.nonempty.fuzzy_name_matches(string).order_similarity(string).limit(limit)
+    tags = Tag.nonempty.autocorrect_matches(string).limit(limit)

    tags.map do |tag|
      { type: "tag", label: tag.pretty_name, value: tag.name, category: tag.category, post_count: tag.post_count }
--- a/app/models/tag.rb
+++ b/app/models/tag.rb
@@ -234,16 +234,19 @@ class Tag < ApplicationRecord
  end

  module SearchMethods
+    def autocorrect_matches(name)
+      tags = fuzzy_name_matches(name).order_similarity(name)
+    end
+
    # ref: https://www.postgresql.org/docs/current/static/pgtrgm.html#idm46428634524336
    def order_similarity(name)
-      # trunc(3 * sim) reduces the similarity score from a range of 0.0 -> 1.0 to just 0, 1, or 2.
-      # This groups tags first by approximate similarity, then by largest tags within groups of similar tags.
-      order(Arel.sql("trunc(3 * similarity(name, #{connection.quote(name)})) DESC"), "post_count DESC", "name DESC")
+      order(Arel.sql("levenshtein(left(name, 255), #{connection.quote(name)}), tags.post_count DESC, tags.name ASC"))
    end

    # ref: https://www.postgresql.org/docs/current/static/pgtrgm.html#idm46428634524336
    def fuzzy_name_matches(name)
-      where("tags.name % ?", name)
+      max_distance = [name.size / 4, 3].max.floor.to_i
+      where("tags.name % ?", name).where("levenshtein(left(name, 255), ?) < ?", name, max_distance)
    end

    def name_matches(name)