Fix not being able to use the full set of search operators on polymorphic `model_id` and
`model_type` attributes. Before things like `search[model_type]=Post` worked, but
`search[model_type_not_eq]=Post` or other `model_type_*` operators didn't.
* Optimize status:modqueue and status:unmoderated searches. This brings them down from
taking 500ms-1000ms per search to ~5ms.
* Change status:unmoderated so that it only filters out the user's disapproved posts, not
the user's own uploads or past approvals. Now it's equivalent to `status:modqueue -disapproved:evazion`,
whereas before it was equivalent to `status:modqueue -disapproved:evazion -approver:evazion -user:evazion`.
Filtering out the user's own uploads and approvals was slow and usually unnecessary,
since for most users it's rare for their own uploads or approvals to reenter the modqueue.
Before status:modqueue did this:
SELECT * FROM posts WHERE is_pending = TRUE OR is_flagged = TRUE OR (is_deleted = TRUE AND id IN (SELECT post_id FROM post_appeals WHERE status = 0))
Now we do this:
SELECT * FROM posts WHERE id IN (SELECT id FROM posts WHERE is_pending = TRUE UNION ALL SELECT id FROM posts WHERE is_flagged = TRUE UNION ALL SELECT id FROM posts WHERE id IN (SELECT post_id FROM post_appeals WHERE status = 0))
Postgres had a bad time with the "pending or flagged or has a pending appeal" clause because
it didn't know that posts can only be in one state at a time, so it overestimated how many
posts would be returned and chose a seq scan. Replacing the OR with a UNION avoids this.
Support searches like the following:
* score:<0,>100 (equivalent to `score:<0 or score:>100`)
* score:5,10..15,>20 (equivalent to `score:5 or score:10..15 or score:>20`)
* id:5,10..15,20..25,30 (equivalent to `id:5 or id:10..15 or id:20..25 or id:30`)
This also works inside the `search[id]` URL parameter, and inside other numeric
URL search parameters.
Factor out the code that parses metatag values (e.g. `score:>5`) and
search URL params (e.g. `search[score]=>5`) into a RangeParser class.
Also fix a bug where, if the `search[order]=custom` param was used
without a `search[id]` param, an exception would be raised. Fix another
bug where if an invalid `search[id]` was provided, then the custom order
would be ignored and the search would be returned with the default order
instead. Now if you use `search[order]=custom` without a valid
`search[id]` param, the search will return no results.
Fix the approver:, parent:, and pixiv: metatags not working correctly when negated:
* Fix -approver:<name> not including posts that don't have an approver (the approver_id is NULL)
* Fix -parent:<id> not including posts that don't have a parent (the parent_id is NULL)
* Fix -pixiv:<id> not including posts that aren't from Pixiv (the pixiv_id is NULL)
The problem lies how the equality operator is negated when the column contains NULL values;
`approver_id != 52664` doesn't match posts where the `approver_id` is NULL.
The search `approver:evazion` boils down to:
# Post.where(approver_id: 52664).to_sql
SELECT * FROM posts WHERE approver_id = 52664;
When that is negated with `-approver:evazion`, it becomes:
# Post.where(approver_id: 52664).invert_where.to_sql
SELECT * FROM posts WHERE approver_id != 52664;
But in SQL, `approver_id != 52664` doesn't match when the approver_id IS NULL, so the search doesn't
include posts without an approver.
We could use `a IS NOT DISTINCT FROM b` instead of `a = b`:
# Post.where(Post.arel_table[:approver_id].is_not_distinct_from(52664)).to_sql
SELECT * FROM posts WHERE approver_id IS NOT DISTINCT FROM 52664;
This way when it's inverted it becomes `IS DISTINCT FROM`:
# Post.where(Post.arel_table[:approver_id].is_not_distinct_from(52664)).invert_where.to_sql
SELECT * FROM posts WHERE approver_id IS NOT DISTINCT FROM 52664;
`approver_id IS DISTINCT FROM 52664` is like `approver_id != 52664`, except it matches when
approver_id is NULL [1].
This works correctly, however the problem is that `IS NOT DISTINCT FROM` can't use indexes because
of a long-standing Postgres limitation [2]. This makes searches too slow. So instead we do this:
# Post.where(approver_id: 52664).where.not(approver_id: nil).to_sql
SELECT * FROM posts WHERE approver_id = 52664 AND approver_id IS NOT NULL;
That way when negated it becomes:
# Post.where(approver_id: 52664).where.not(approver_id: nil).invert_where.to_sql
SELECT * FROM posts WHERE approver_id != 52664 OR approver_id IS NULL;
Which is the correct behavior.
[1] https://modern-sql.com/feature/is-distinct-from
[2] https://www.postgresql.org/message-id/6FC83909-5DB1-420F-9191-DBE533A3CEDE@excoventures.com
Standardize it so that all fields of type `text` are searchable with
`search[<field>_matches]`.
Before, the `<field>_matches` param was handled manually and some fields
were left out or handled inconsistently. Now it applies to all columns
of type `text`.
This does a full-text search on the field, so for example, searching
`/artist_commentaries?search[translated_description_matches]=smiling`
will match translated commentaries containing either the word "smiling",
"smiles", "smiled", or "smile".
Note that this only applies to columns defined as type `text`, not to
columns defined as `character varying`. The difference is that `text` is
used for fields containing free-form natural language, such as comments,
notes, forum posts, wiki pages, pool descriptions, etc, while `character
varying` is used for short strings not containing free-form language,
such as tag names, wiki page titles, urls, status fields, etc.
API changes:
* Add the `search[original_title_matches]`, `search[original_description_matches]`,
`search[translated_title_matches]`, `search[translated_description_matches]` params
to /artist_commentaries and /artist_commentary_versions.
* Remove the `search[name_matches]` and `search[group_name_matches]` params from /artist_versions.
* Remove the `search[title_matches]` param from /wiki_page_versions.
* Change the `search[name_matches]` param on /pools, /favorite_groups, and /pool_versions
to do a full-text search instead of a substring match.
Add a `visible_for_search` method to ApplicationPolicy that lets us
define which fields a user is allowed to search for.
For example, when a normal user searches for post flags by flagger name,
they're only allowed to see their own flags, not flags by other users.
But when a mod searches for flags by flagger name, they're allowed to
see all flags, except for flags on their own uploads.
This framework lets us define these rules in the `visible_for_search`
method in the model's policy class, rather than as special cases in the
`search` method of each model.
Move the `search_attributes` helper methods inside the Searchable
concern into a SearchContext helper class. This is so we don't have to
pass `params` and `current_user` down through a bunch of different
helper methods, and so that these private helper methods don't pollute
the object's namespace.
Switch autocomplete to match individual words in the tag, instead of
only matching the start of the tag.
For example, "hair" matches any tag containing the word "hair", not just tags
starting with "hair". "long_hair" matches all tags containing the words "long"
and "hair", which includes "very_long_hair" and "absurdly_long_hair".
Words can be in any order and words can be left out. So "closed_eye" matches
"one_eye_closed". "asuka_langley_souryuu" matches "souryuu_asuka_langley".
This has several advantages:
* You can search characters by first name. For example, "miku" matches "hatsune_miku".
"zelda" matches both "princess_zelda" and "the_legend_of_zelda".
* You can find the right tag even if you get the word order wrong, or forget a word.
For example, "eyes_closed" matches "closed_eyes". "hair_over_eye" matches "hair_over_one_eye".
* You can find more related tags. For example, searching "skirt" shows all tags
containing the word "skirt", not just tags starting with "skirt".
The downside is this may break muscle memory by changing the autocomplete order of
some tags. This is an acceptable trade-off.
You can get the old behavior by writing a "*" at the end of the tag. For
example, searching "skirt*" gives the same results as before.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.
AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.
The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.
You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).
You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.
To generate tags, use the `autotag` script from the Autotagger repo, something like this:
docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz
To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
Switch to the post search engine using the new PostQuery parser. The new
engine fully supports AND, OR, and NOT operators and grouping expressions
with parentheses.
Highlights:
New OR operator:
* `skirt or dress` (same as `~skirt ~dress`)
Tags can be grouped with parentheses:
* `1girl (skirt or dress)`
* `(blonde_hair blue_eyes) or (red_hair green_eyes)`
* `~(blonde_hair blue_eyes) ~(red_hair green_eyes)` (same as above)
* `(pantyhose or thighhighs) (black_legwear or brown_legwear)`
* `(~pantyhose ~thighhighs) (~black_legwear ~brown_legwear)` (same as above)
Metatags can be OR'd together:
* `user:evazion or fav:evazion`
* `~user:evazion ~fav:evazion`
Wildcard tags can combined with either AND or OR:
* `black_* white_*` (find posts with at least one black_* tag AND one white_* tag)
* `black_* or white_*` (find posts with at least one black_* tag OR one white_* tag)
* `~black_* ~white_*` (same as above)
See 4c7cfc73 for more syntax examples.
Fixes#4949: And+or search?
Fixes#5056: Wildcard searches return unexpected results when combined with OR searches
Add ability to search jobs on the /jobs page by job type or by status.
Fixes#2577 (Search filters for delayed jobs). This wasn't possible
before with DelayedJobs because it stored the job data in a YAML string,
which made it difficult to search jobs by type. GoodJobs stores job data
in a JSON object, which is easier to search in Postgres.
Optimize searches using the `search[user_name]=...` URL parameter. If
we're not doing a wildcard search, then do a regular user lookup, which
generates better SQL.
Refactor full-text search on several tables (comments, dmails,
forum_posts, forum_topics, notes, and wiki_pages) to use to_tsvector
expression indexes instead of dedicated tsvector columns. This way
full-text search works the same way across all tables.
API changes:
* Changed /wiki_pages.json?search[body_matches] to match against only
the body. Before `body_matches` matched against both the title and the body.
* Added /wiki_pages.json?search[title_or_body_matches] to match against
both the title and the body.
* Fixed /dmails.json?search[message_matches] to match against both the
title and body when doing a wildcard search. Before a wildcard search
only matched against the body.
* Added /dmails.json?search[body_matches] to match against only the dmail body.
Change the wiki_pages tsvector_update_trigger to use
`pg_catalog.english` instead of `public.danbooru`. This changes how wiki
page text is parsed for full-text search to use the standard English
parser instead of test_parser. This is to prepare for dropping
test_parser. Using test_parser here was wrong anyway because it meant
that punctuation wasn't removed from words when indexing wiki pages for
full-text search.
Use the `string_to_array(tag_string, ' ')` index instead of the
`tag_index` for tag searches. The string_to_array index lets us treat
the tag_string as an array for searching purposes. This lets us get rid
of the tag_index column and the test_parser dependency in the future.
Fix exception during https://danbooru.donmai.us/posts/random?tags=ordfav:nonamethanks
Before we were doing a query like this:
SELECT
"posts".*
FROM
"posts"
INNER JOIN
"favorites" ON "favorites"."post_id" = "posts"."id"
WHERE
(favorites.user_id % 100 = 64 AND favorites.user_id = 52664)
AND "posts"."id" = 343894
ORDER BY
favorites.id DESC,
posts.id DESC,
ID=343894 DESC
but `ID=? DESC` is ambiguous during an ordfav: search because of the
join on the favorites table. The fix is to qualify the reference as
`posts.id`.
Changes:
* Change the `expires_at` field to `duration`.
* Make moderators choose from a fixed set of standard ban lengths,
instead of allowing arbitrary ban lengths.
* List `duration` in seconds in the /bans.json API.
* Dump bans to BigQuery.
Note that some old bans have a negative duration. This is because their
expiration date was before their creation date, which is because in 2013
bans were migrated to Danbooru 2 and the original ban creation dates
were lost.
Optimize searches for non-English phrases in autocomplete. These
searches were pretty slow, and could sometimes cause sitewide lag spikes
when users typed long strings of non-English text into the search box
and caused an unintentional DoS.
The trick is to use an `array_to_tsvector(other_names) USING gin` index
on other_names. This supports fast string prefix matching against all
elements of the array. The downside is that it doesn't allow infix or
suffix matches, so we can't support wildcards in general. Wildcards
didn't quite work anyway, since artist and wiki other names can contain
literal '*' characters.
Add tracking of certain important user actions. These events include:
* Logins
* Logouts
* Failed login attempts
* Account creations
* Account deletions
* Password reset requests
* Password changes
* Email address changes
This is similar to the mod actions log, except for account activity
related to a single user.
The information tracked includes the user, the event type (login,
logout, etc), the timestamp, the user's IP address, IP geolocation
information, the user's browser user agent, and the user's session ID
from their session cookie. This information is visible to mods only.
This is done with three models. The UserEvent model tracks the event
type (login, logout, password change, etc) and the user. The UserEvent
is tied to a UserSession, which contains the user's IP address and
browser metadata. Finally, the IpGeolocation model contains the
geolocation information for IPs, including the city, country, ISP, and
whether the IP is a proxy.
This tracking will be used for a few purposes:
* Letting users view their account history, to detect things like logins
from unrecognized IPs, failed logins attempts, password changes, etc.
* Rate limiting failed login attempts.
* Detecting sockpuppet accounts using their login history.
* Detecting unauthorized account sharing.
Expand the tag abbreviation system introduced in b0be8ae45 so that it
works in searches and when tagging posts, not just in autocomplete.
For example, you can tag a post with /evth and it will add the tag
eyebrows_visible_through_hair. You can search for /evth and it will
search for the tag eyebrows_visible_through_hair.
Some more examples:
* /ops is short for one-piece_swimsuit
* /hooe is short for hair_over_one_eye
* /saol is short for standing_on_one_leg
* /tlozbotw is short for the_legend_of_zelda:_breath_of_the_wild
If two tags have the same abbreviation, then the larger tag takes
precedence. For example, /be is short for blue_eyes, not brown_eyes,
because blue_eyes is the bigger tag.
If there is an existing shortcut alias that conflicts with the
abbreviation, then the alias take precedence. For example, /sh is short
for suzumiya_haruhi, not short_hair, because there's an old alias for
/sh -> suzumiya_haruhi.