Add bar charts for non-timeseries data. For example, a bar chart of the
top 10 uploaders overall in the last month, rather than a timeseries
chart of the number of uploads per day for the last month.
Add ability to group reports by various columns. For example, you can see
the posts by the top 10 uploaders over time, or posts grouped by rating
over time.
Fix not being able to use the full set of search operators on polymorphic `model_id` and
`model_type` attributes. Before things like `search[model_type]=Post` worked, but
`search[model_type_not_eq]=Post` or other `model_type_*` operators didn't.
* Optimize status:modqueue and status:unmoderated searches. This brings them down from
taking 500ms-1000ms per search to ~5ms.
* Change status:unmoderated so that it only filters out the user's disapproved posts, not
the user's own uploads or past approvals. Now it's equivalent to `status:modqueue -disapproved:evazion`,
whereas before it was equivalent to `status:modqueue -disapproved:evazion -approver:evazion -user:evazion`.
Filtering out the user's own uploads and approvals was slow and usually unnecessary,
since for most users it's rare for their own uploads or approvals to reenter the modqueue.
Before status:modqueue did this:
SELECT * FROM posts WHERE is_pending = TRUE OR is_flagged = TRUE OR (is_deleted = TRUE AND id IN (SELECT post_id FROM post_appeals WHERE status = 0))
Now we do this:
SELECT * FROM posts WHERE id IN (SELECT id FROM posts WHERE is_pending = TRUE UNION ALL SELECT id FROM posts WHERE is_flagged = TRUE UNION ALL SELECT id FROM posts WHERE id IN (SELECT post_id FROM post_appeals WHERE status = 0))
Postgres had a bad time with the "pending or flagged or has a pending appeal" clause because
it didn't know that posts can only be in one state at a time, so it overestimated how many
posts would be returned and chose a seq scan. Replacing the OR with a UNION avoids this.
Support searches like the following:
* score:<0,>100 (equivalent to `score:<0 or score:>100`)
* score:5,10..15,>20 (equivalent to `score:5 or score:10..15 or score:>20`)
* id:5,10..15,20..25,30 (equivalent to `id:5 or id:10..15 or id:20..25 or id:30`)
This also works inside the `search[id]` URL parameter, and inside other numeric
URL search parameters.
Factor out the code that parses metatag values (e.g. `score:>5`) and
search URL params (e.g. `search[score]=>5`) into a RangeParser class.
Also fix a bug where, if the `search[order]=custom` param was used
without a `search[id]` param, an exception would be raised. Fix another
bug where if an invalid `search[id]` was provided, then the custom order
would be ignored and the search would be returned with the default order
instead. Now if you use `search[order]=custom` without a valid
`search[id]` param, the search will return no results.
Fix the approver:, parent:, and pixiv: metatags not working correctly when negated:
* Fix -approver:<name> not including posts that don't have an approver (the approver_id is NULL)
* Fix -parent:<id> not including posts that don't have a parent (the parent_id is NULL)
* Fix -pixiv:<id> not including posts that aren't from Pixiv (the pixiv_id is NULL)
The problem lies how the equality operator is negated when the column contains NULL values;
`approver_id != 52664` doesn't match posts where the `approver_id` is NULL.
The search `approver:evazion` boils down to:
# Post.where(approver_id: 52664).to_sql
SELECT * FROM posts WHERE approver_id = 52664;
When that is negated with `-approver:evazion`, it becomes:
# Post.where(approver_id: 52664).invert_where.to_sql
SELECT * FROM posts WHERE approver_id != 52664;
But in SQL, `approver_id != 52664` doesn't match when the approver_id IS NULL, so the search doesn't
include posts without an approver.
We could use `a IS NOT DISTINCT FROM b` instead of `a = b`:
# Post.where(Post.arel_table[:approver_id].is_not_distinct_from(52664)).to_sql
SELECT * FROM posts WHERE approver_id IS NOT DISTINCT FROM 52664;
This way when it's inverted it becomes `IS DISTINCT FROM`:
# Post.where(Post.arel_table[:approver_id].is_not_distinct_from(52664)).invert_where.to_sql
SELECT * FROM posts WHERE approver_id IS NOT DISTINCT FROM 52664;
`approver_id IS DISTINCT FROM 52664` is like `approver_id != 52664`, except it matches when
approver_id is NULL [1].
This works correctly, however the problem is that `IS NOT DISTINCT FROM` can't use indexes because
of a long-standing Postgres limitation [2]. This makes searches too slow. So instead we do this:
# Post.where(approver_id: 52664).where.not(approver_id: nil).to_sql
SELECT * FROM posts WHERE approver_id = 52664 AND approver_id IS NOT NULL;
That way when negated it becomes:
# Post.where(approver_id: 52664).where.not(approver_id: nil).invert_where.to_sql
SELECT * FROM posts WHERE approver_id != 52664 OR approver_id IS NULL;
Which is the correct behavior.
[1] https://modern-sql.com/feature/is-distinct-from
[2] https://www.postgresql.org/message-id/6FC83909-5DB1-420F-9191-DBE533A3CEDE@excoventures.com
Standardize it so that all fields of type `text` are searchable with
`search[<field>_matches]`.
Before, the `<field>_matches` param was handled manually and some fields
were left out or handled inconsistently. Now it applies to all columns
of type `text`.
This does a full-text search on the field, so for example, searching
`/artist_commentaries?search[translated_description_matches]=smiling`
will match translated commentaries containing either the word "smiling",
"smiles", "smiled", or "smile".
Note that this only applies to columns defined as type `text`, not to
columns defined as `character varying`. The difference is that `text` is
used for fields containing free-form natural language, such as comments,
notes, forum posts, wiki pages, pool descriptions, etc, while `character
varying` is used for short strings not containing free-form language,
such as tag names, wiki page titles, urls, status fields, etc.
API changes:
* Add the `search[original_title_matches]`, `search[original_description_matches]`,
`search[translated_title_matches]`, `search[translated_description_matches]` params
to /artist_commentaries and /artist_commentary_versions.
* Remove the `search[name_matches]` and `search[group_name_matches]` params from /artist_versions.
* Remove the `search[title_matches]` param from /wiki_page_versions.
* Change the `search[name_matches]` param on /pools, /favorite_groups, and /pool_versions
to do a full-text search instead of a substring match.
Add a `visible_for_search` method to ApplicationPolicy that lets us
define which fields a user is allowed to search for.
For example, when a normal user searches for post flags by flagger name,
they're only allowed to see their own flags, not flags by other users.
But when a mod searches for flags by flagger name, they're allowed to
see all flags, except for flags on their own uploads.
This framework lets us define these rules in the `visible_for_search`
method in the model's policy class, rather than as special cases in the
`search` method of each model.
Move the `search_attributes` helper methods inside the Searchable
concern into a SearchContext helper class. This is so we don't have to
pass `params` and `current_user` down through a bunch of different
helper methods, and so that these private helper methods don't pollute
the object's namespace.
Track the history of the tag `category` and `is_deprecated` fields in
the `tag_versions` table.
Adds generic Versionable and VersionFor concerns that encapsulate most
of the history tracking logic. These concerns are designed to make it
easy to add history to any model.
There are a couple notable differences between tag versions and other versions:
* There is no 1 hour edit merge window. All changes to the `category`
and `is_deprecated` fields produce a new version in the tag history.
* New versions aren't created when a tag is created. Versions are only
created when a tag is edited for the first time. The tag's initial
version isn't created until *after* the tag is edited for the first time.
For example, if you change the category of a tag that was last updated
10 years ago, that will create an initial version of the tag backdated
to 10 years ago, plus a new version for your edit.
This is for a few reasons:
* So that we don't have to create new tag versions every time a new tag
is created. This would be wasteful because most tags never have their
category or deprecation status change.
* So that if you make a typo tag, your name isn't recorded in the tag's
history forever.
* So that we can create new tags in various places without having to know
who created the tag (which may be unknown if the current user isn't set).
* Because we don't know the full history of most tags, so we have to
deal with incomplete histories anyway.
This has a few important consequences:
* Most tags won't have any tag versions. They only gain tag versions if
they're edited.
* You can't track /tag_versions to see newly created tags. It only
shows changes to already existing tags.
* Tag version IDs won't be in strict chronological order. Higher IDs may
have created_at timestamps before lower IDs. For example, if you
change the category of a tag that is 10 years old, that will create an
initial version with a high ID, but with a created_at timestamp dated
to 10 years ago.
Fixes#4402: Track tag category changes
Switch autocomplete to match individual words in the tag, instead of
only matching the start of the tag.
For example, "hair" matches any tag containing the word "hair", not just tags
starting with "hair". "long_hair" matches all tags containing the words "long"
and "hair", which includes "very_long_hair" and "absurdly_long_hair".
Words can be in any order and words can be left out. So "closed_eye" matches
"one_eye_closed". "asuka_langley_souryuu" matches "souryuu_asuka_langley".
This has several advantages:
* You can search characters by first name. For example, "miku" matches "hatsune_miku".
"zelda" matches both "princess_zelda" and "the_legend_of_zelda".
* You can find the right tag even if you get the word order wrong, or forget a word.
For example, "eyes_closed" matches "closed_eyes". "hair_over_eye" matches "hair_over_one_eye".
* You can find more related tags. For example, searching "skirt" shows all tags
containing the word "skirt", not just tags starting with "skirt".
The downside is this may break muscle memory by changing the autocomplete order of
some tags. This is an acceptable trade-off.
You can get the old behavior by writing a "*" at the end of the tag. For
example, searching "skirt*" gives the same results as before.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.
AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.
The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.
You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).
You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.
To generate tags, use the `autotag` script from the Autotagger repo, something like this:
docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz
To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
Switch to the post search engine using the new PostQuery parser. The new
engine fully supports AND, OR, and NOT operators and grouping expressions
with parentheses.
Highlights:
New OR operator:
* `skirt or dress` (same as `~skirt ~dress`)
Tags can be grouped with parentheses:
* `1girl (skirt or dress)`
* `(blonde_hair blue_eyes) or (red_hair green_eyes)`
* `~(blonde_hair blue_eyes) ~(red_hair green_eyes)` (same as above)
* `(pantyhose or thighhighs) (black_legwear or brown_legwear)`
* `(~pantyhose ~thighhighs) (~black_legwear ~brown_legwear)` (same as above)
Metatags can be OR'd together:
* `user:evazion or fav:evazion`
* `~user:evazion ~fav:evazion`
Wildcard tags can combined with either AND or OR:
* `black_* white_*` (find posts with at least one black_* tag AND one white_* tag)
* `black_* or white_*` (find posts with at least one black_* tag OR one white_* tag)
* `~black_* ~white_*` (same as above)
See 4c7cfc73 for more syntax examples.
Fixes#4949: And+or search?
Fixes#5056: Wildcard searches return unexpected results when combined with OR searches
Add ability to search jobs on the /jobs page by job type or by status.
Fixes#2577 (Search filters for delayed jobs). This wasn't possible
before with DelayedJobs because it stored the job data in a YAML string,
which made it difficult to search jobs by type. GoodJobs stores job data
in a JSON object, which is easier to search in Postgres.
Optimize searches using the `search[user_name]=...` URL parameter. If
we're not doing a wildcard search, then do a regular user lookup, which
generates better SQL.
Make private favorites and upvotes a Gold-only account option.
Existing Members with private favorites enabled are allowed to keep it
enabled, as long as they don't disable it. If they disable it, then they
can't re-enable it again without upgrading to Gold first.
This is a Gold-only option to prevent uploaders from creating multiple
accounts to upvote their own posts. If private upvotes were allowed for
Members, then it would be too easy to use fake accounts and private
upvotes to upvote your own posts.
Refactor full-text search on several tables (comments, dmails,
forum_posts, forum_topics, notes, and wiki_pages) to use to_tsvector
expression indexes instead of dedicated tsvector columns. This way
full-text search works the same way across all tables.
API changes:
* Changed /wiki_pages.json?search[body_matches] to match against only
the body. Before `body_matches` matched against both the title and the body.
* Added /wiki_pages.json?search[title_or_body_matches] to match against
both the title and the body.
* Fixed /dmails.json?search[message_matches] to match against both the
title and body when doing a wildcard search. Before a wildcard search
only matched against the body.
* Added /dmails.json?search[body_matches] to match against only the dmail body.
Change the wiki_pages tsvector_update_trigger to use
`pg_catalog.english` instead of `public.danbooru`. This changes how wiki
page text is parsed for full-text search to use the standard English
parser instead of test_parser. This is to prepare for dropping
test_parser. Using test_parser here was wrong anyway because it meant
that punctuation wasn't removed from words when indexing wiki pages for
full-text search.
Use the `string_to_array(tag_string, ' ')` index instead of the
`tag_index` for tag searches. The string_to_array index lets us treat
the tag_string as an array for searching purposes. This lets us get rid
of the tag_index column and the test_parser dependency in the future.
Fix exception during https://danbooru.donmai.us/posts/random?tags=ordfav:nonamethanks
Before we were doing a query like this:
SELECT
"posts".*
FROM
"posts"
INNER JOIN
"favorites" ON "favorites"."post_id" = "posts"."id"
WHERE
(favorites.user_id % 100 = 64 AND favorites.user_id = 52664)
AND "posts"."id" = 343894
ORDER BY
favorites.id DESC,
posts.id DESC,
ID=343894 DESC
but `ID=? DESC` is ambiguous during an ordfav: search because of the
join on the favorites table. The fix is to qualify the reference as
`posts.id`.
Make it so that when a user removes their own vote, the vote is soft
deleted (the is_deleted flag is set) instead of hard deleted.
Changes:
* Add is_deleted flag to comment votes.
* Relax uniqueness constraint so you can have multiple deleted votes on
the same comment. You can still only have one active vote on the comment.
* Add `soft_delete` method to Deletable concern.
Changes:
* Change the `expires_at` field to `duration`.
* Make moderators choose from a fixed set of standard ban lengths,
instead of allowing arbitrary ban lengths.
* List `duration` in seconds in the /bans.json API.
* Dump bans to BigQuery.
Note that some old bans have a negative duration. This is because their
expiration date was before their creation date, which is because in 2013
bans were migrated to Danbooru 2 and the original ban creation dates
were lost.