Commit Graph

51 Commits

Author SHA1 Message Date
evazion
1aeb52186e Add AI tag model and UI.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.

AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.

The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.

You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).

You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.

To generate tags, use the `autotag` script from the Autotagger repo, something like this:

  docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz

To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
2022-06-24 04:54:26 -05:00
evazion
af183467b6 post queries: switch to new post search engine.
Switch to the post search engine using the new PostQuery parser. The new
engine fully supports AND, OR, and NOT operators and grouping expressions
with parentheses.

Highlights:

New OR operator:

* `skirt or dress` (same as `~skirt ~dress`)

Tags can be grouped with parentheses:

* `1girl (skirt or dress)`
* `(blonde_hair blue_eyes) or (red_hair green_eyes)`
* `~(blonde_hair blue_eyes) ~(red_hair green_eyes)` (same as above)
* `(pantyhose or thighhighs) (black_legwear or brown_legwear)`
* `(~pantyhose ~thighhighs) (~black_legwear ~brown_legwear)` (same as above)

Metatags can be OR'd together:

* `user:evazion or fav:evazion`
* `~user:evazion ~fav:evazion`

Wildcard tags can combined with either AND or OR:

* `black_* white_*` (find posts with at least one black_* tag AND one white_* tag)
* `black_* or white_*` (find posts with at least one black_* tag OR one white_* tag)
* `~black_* ~white_*` (same as above)

See 4c7cfc73 for more syntax examples.

Fixes #4949: And+or search?
Fixes #5056: Wildcard searches return unexpected results when combined with OR searches
2022-04-17 23:20:22 -05:00
evazion
01a22930e7 posts: move attribute search methods from PostQueryBuilder to Post.
Move `status_matches` etc methods from PostQueryBuilder to Post. This is
to make refactoring to use the new query parser easier.
2022-04-06 20:25:09 -05:00
evazion
51ba56e8a3 Fix #5001: Media assets not searchable through upload records.
Fix this:

  https://danbooru.donmai.us/uploads.json?search[media_assets][md5]=b83daa7f1ae7e4127b1befd32f71ba10

failing with an ActiveRecord::StatementInvalid error.

The bug was that for a `has_many through: ...` association, like
`has_many :media_assets, through: :upload_media_assets`, we weren't
joining on the associated table properly so we ended up generating
invalid SQL.
2022-02-08 19:18:11 -06:00
evazion
72ea78e697 searchable: replace find_ordered with in_order_of.
Rails 7 added an `in_order_of` method that does what our `find_ordered`
method did before.
2022-01-07 14:24:57 -06:00
evazion
82211ba935 jobs: add ability to search jobs on /jobs page.
Add ability to search jobs on the /jobs page by job type or by status.

Fixes #2577 (Search filters for delayed jobs). This wasn't possible
before with DelayedJobs because it stored the job data in a YAML string,
which made it difficult to search jobs by type. GoodJobs stores job data
in a JSON object, which is easier to search in Postgres.
2022-01-04 17:18:36 -06:00
evazion
a7dc05ce63 Enable frozen string literals.
Make all string literals immutable by default.
2021-12-14 21:33:27 -06:00
evazion
6b9e1181e5 search: optimize ?search[user_name]=... searches.
Optimize searches using the `search[user_name]=...` URL parameter. If
we're not doing a wildcard search, then do a regular user lookup, which
generates better SQL.
2021-11-20 03:19:04 -06:00
evazion
2845164872 search: support quoted phrases, OR, and NOT operators in full-text search.
Make all full-text search fields support quoted phrases and OR and NOT
operators.

This affects all text search fields (any search field that looks like `*_matches`).

Examples:

* hakurei reimu   - matches anything containing the words "hakurei" and "reimu", in any order.
* hakuri or reimu - matches either "hakurei" or "reimu".
* hakurei -reimu  - matches "hakurei" but not "reimu"
* "hakurei reimu" - matches the exact phrase "hakurei reimu"
* "reimu hakurei" - matches the exact phrase "reimu hakurei"

* https://danbooru.donmai.us/notes?search[body_matches]=reimu+hakurei
* https://danbooru.donmai.us/notes?search[body_matches]=reimu+or+hakurei
* https://danbooru.donmai.us/notes?search[body_matches]=reimu+-hakurei
* https://danbooru.donmai.us/notes?search[body_matches]="hakurei+reimu"
* https://danbooru.donmai.us/notes?search[body_matches]="reimu+hakurei"

The phrase search ability partially fixes #4536 (Inconsistent behavior
of search function for comments/forums).

See `websearch_to_tsquery` [1] for full details of the search syntax.

[1]: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
2021-10-16 19:13:09 -05:00
evazion
e3b836b506 Refactor full-text search to get rid of tsvector columns.
Refactor full-text search on several tables (comments, dmails,
forum_posts, forum_topics, notes, and wiki_pages) to use to_tsvector
expression indexes instead of dedicated tsvector columns. This way
full-text search works the same way across all tables.

API changes:

* Changed /wiki_pages.json?search[body_matches] to match against only
  the body. Before `body_matches` matched against both the title and the body.

* Added /wiki_pages.json?search[title_or_body_matches] to match against
  both the title and the body.

* Fixed /dmails.json?search[message_matches] to match against both the
  title and body when doing a wildcard search. Before a wildcard search
  only matched against the body.

* Added /dmails.json?search[body_matches] to match against only the dmail body.
2021-10-16 07:44:27 -05:00
evazion
c0f744f84d Fix #4893: Add a FIELD_present parameter variation for text fields.
Usage:

* https://danbooru.donmai.us/wiki_pages.json?search[body_present]=true
* https://danbooru.donmai.us/wiki_pages.json?search[body_present]=false
2021-10-13 04:10:23 -05:00
evazion
7976323f7a wiki pages: change tsvector update trigger to not use test_parser.
Change the wiki_pages tsvector_update_trigger to use
`pg_catalog.english` instead of `public.danbooru`. This changes how wiki
page text is parsed for full-text search to use the standard English
parser instead of test_parser. This is to prepare for dropping
test_parser. Using test_parser here was wrong anyway because it meant
that punctuation wasn't removed from words when indexing wiki pages for
full-text search.
2021-10-11 03:34:47 -05:00
evazion
37a8dc5dbd posts: use string_to_array index for tag searches.
Use the `string_to_array(tag_string, ' ')` index instead of the
`tag_index` for tag searches. The string_to_array index lets us treat
the tag_string as an array for searching purposes. This lets us get rid
of the tag_index column and the test_parser dependency in the future.
2021-10-10 22:00:10 -05:00
evazion
ea6e47125e metadata: add ability to search exif metadata.
Usage:

* https://danbooru.donmai.us/media_metadata?search[has_metadata]=true
* https://danbooru.donmai.us/media_metadata?search[has_metadata]=false
* https://danbooru.donmai.us/media_metadata?search[metadata_has_key]=GIF:GIFVersion
* https://danbooru.donmai.us/media_metadata?search[metadata][GIF:GIFVersion]=89a
* https://danbooru.donmai.us/media_metadata?search[metadata][GIF:GIFVersion]&search[metadata][GIF:BackgroundColor]=0
2021-09-16 00:25:21 -05:00
evazion
49d18e64e8 Fix #4869: "Random" button raises exception when viewing ordfav.
Fix exception during https://danbooru.donmai.us/posts/random?tags=ordfav:nonamethanks

Before we were doing a query like this:

    SELECT
      "posts".*
    FROM
      "posts"
    INNER JOIN
      "favorites" ON "favorites"."post_id" = "posts"."id"
    WHERE
      (favorites.user_id % 100 = 64 AND favorites.user_id = 52664)
      AND "posts"."id" = 343894
    ORDER BY
      favorites.id DESC,
      posts.id DESC,
      ID=343894 DESC

but `ID=? DESC` is ambiguous during an ordfav: search because of the
join on the favorites table. The fix is to qualify the reference as
`posts.id`.
2021-08-30 16:46:03 -05:00
evazion
81fe68d392 bans: change expires_at field to duration.
Changes:

* Change the `expires_at` field to `duration`.
* Make moderators choose from a fixed set of standard ban lengths,
  instead of allowing arbitrary ban lengths.
* List `duration` in seconds in the /bans.json API.
* Dump bans to BigQuery.

Note that some old bans have a negative duration. This is because their
expiration date was before their creation date, which is because in 2013
bans were migrated to Danbooru 2 and the original ban creation dates
were lost.
2021-03-11 02:59:58 -06:00
evazion
92b8f24724 ip addresses: move more logic to Danbooru::IpAddress.
* Move `is_local?` from IpLookup to Danbooru::IpAddress.
* Refactor more things to use Danbooru::IpAddress instead of using
  IPAddress directly.
2021-03-01 20:13:14 -06:00
evazion
ef177a09cf searchable: fixup bugs in e7b454686. 2021-01-11 19:47:20 -06:00
evazion
c1b865b160 searchable: add more enum attribute search options.
Add `<enum>_not` and `<enum>_id_<op>` search options:

* https://danbooru.donmai.us/mod_actions?search[category_not]=post_regenerate,post_regenerate_iqdb
* https://danbooru.donmai.us/mod_actions?search[category_not]=48,49
* https://danbooru.donmai.us/mod_actions?search[category_id]=40..50
* https://danbooru.donmai.us/mod_actions?search[category_id_not]=40..50
* https://danbooru.donmai.us/mod_actions?search[category_id_gt]=40&search[category_id_lt]=50
2021-01-11 19:13:35 -06:00
evazion
e7b454686e searchable: refactor where_operator method.
Refactor the `where_operator` method so we can use it to avoid raw SQL
in more places.
2021-01-11 19:13:29 -06:00
evazion
6d2eeb6f28 searchable: fix being unable to use multiple operators on same attribute.
Fix searches like this not working:

* https://danbooru.donmai.us/tags?search[id]=1..100&search[id_not]=50

Before one of these params would override the other.
2021-01-11 14:59:04 -06:00
evazion
fc5db679e4 autocomplete: optimize searching by artist/wiki page other names.
Optimize searches for non-English phrases in autocomplete. These
searches were pretty slow, and could sometimes cause sitewide lag spikes
when users typed long strings of non-English text into the search box
and caused an unintentional DoS.

The trick is to use an `array_to_tsvector(other_names) USING gin` index
on other_names. This supports fast string prefix matching against all
elements of the array. The downside is that it doesn't allow infix or
suffix matches, so we can't support wildcards in general. Wildcards
didn't quite work anyway, since artist and wiki other names can contain
literal '*' characters.
2021-01-10 03:35:12 -06:00
evazion
9759701071 search: add way to search array attributes by regex.
Add a `where_any_in_array_matches_regex` method and expose it to the API:

 * https://danbooru.donmai.us/artists?search[any_other_name_matches_regex]=^blah
 * https://danbooru.donmai.us/wiki_pages?search[any_other_name_matches_regex]=^blah
 * https://danbooru.donmai.us/saved_searches?search[any_label_matches_regex]=^blah

In SQL, this does `WHERE '^blah' ~<< ANY(other_names)`, where `~<<` is a
custom operator based on the `~` regex match operator, but with the
arguments reversed. This allows it to be used with the ANY(array) operator.

See also:

* https://stackoverflow.com/a/22101172
* https://www.postgresql.org/docs/current/sql-createfunction.html
* https://www.postgresql.org/docs/current/sql-createoperator.html
* https://www.postgresql.org/docs/current/functions-comparisons.html
2021-01-10 02:03:02 -06:00
evazion
65adcd09c2 users: track logins, signups, and other user events.
Add tracking of certain important user actions. These events include:

* Logins
* Logouts
* Failed login attempts
* Account creations
* Account deletions
* Password reset requests
* Password changes
* Email address changes

This is similar to the mod actions log, except for account activity
related to a single user.

The information tracked includes the user, the event type (login,
logout, etc), the timestamp, the user's IP address, IP geolocation
information, the user's browser user agent, and the user's session ID
from their session cookie. This information is visible to mods only.

This is done with three models. The UserEvent model tracks the event
type (login, logout, password change, etc) and the user. The UserEvent
is tied to a UserSession, which contains the user's IP address and
browser metadata. Finally, the IpGeolocation model contains the
geolocation information for IPs, including the city, country, ISP, and
whether the IP is a proxy.

This tracking will be used for a few purposes:

* Letting users view their account history, to detect things like logins
  from unrecognized IPs, failed logins attempts, password changes, etc.
* Rate limiting failed login attempts.
* Detecting sockpuppet accounts using their login history.
* Detecting unauthorized account sharing.
2021-01-08 22:34:37 -06:00
evazion
da3e8e4726 searchable: fix bug with searching multiple association attributes.
Fix a bug with searches like the following not working correctly:

* https://danbooru.donmai.us/comments.json?search[creator][level]=20&search[creator_id]=1234
* https://danbooru.donmai.us/comments.json?search[creator][level]=20&search[creator_name]=abcd
* https://danbooru.donmai.us/comments.json?search[post][rating]=s&search[post_tags_match]=touhou

It wasn't possible to search for both `creator` and `creator_id` at the
same time (or `post` and `post_tags_match`, etc). Only the `creator_id`
param would be recognized.

Also refactor some internals:

* `search_includes` was renamed to `search_associated_attribute`.
* `search_attribute` was split up into `search_basic_attribute` and
  `search_associated_attribute`.
2021-01-07 17:10:29 -06:00
BrokenEagle
db5f9ce243 Support multiple excludes for enum types
It's not possible to pass it off to search_numeric_attribute directly
since the column "category" does not match the prefix "category_id".
2021-01-06 20:21:56 +00:00
BrokenEagle
57de81686b Support using all numeric searches for includes 2021-01-06 20:21:56 +00:00
BrokenEagle
4a439d72d6 Support multiple exclusions
Since it does a not of numeric_attribute_matches which uses the
post query builder, it now also support reverse ranges and reverse
greater/less than.
2021-01-06 20:21:55 +00:00
evazion
7f1b798b05 searchable: refactor search_boolean_attribute. 2020-12-27 05:26:21 -06:00
evazion
2c1da660fd tags: allow tag abbreviations in searches and during tagging.
Expand the tag abbreviation system introduced in b0be8ae45 so that it
works in searches and when tagging posts, not just in autocomplete.

For example, you can tag a post with /evth and it will add the tag
eyebrows_visible_through_hair. You can search for /evth and it will
search for the tag eyebrows_visible_through_hair.

Some more examples:

* /ops is short for one-piece_swimsuit
* /hooe is short for hair_over_one_eye
* /saol is short for standing_on_one_leg
* /tlozbotw is short for the_legend_of_zelda:_breath_of_the_wild

If two tags have the same abbreviation, then the larger tag takes
precedence. For example, /be is short for blue_eyes, not brown_eyes,
because blue_eyes is the bigger tag.

If there is an existing shortcut alias that conflicts with the
abbreviation, then the alias take precedence. For example, /sh is short
for suzumiya_haruhi, not short_hair, because there's an old alias for
/sh -> suzumiya_haruhi.
2020-12-17 23:57:13 -06:00
evazion
ee4516f5fe searchable: refactor searchable_includes.
Pass searchable associations directly to search_attributes instead of
defining them separately in searchable_includes.
2020-12-16 23:57:07 -06:00
evazion
e771c0fca8 searchable: don't automatically include id, created_at, updated_at.
Don't make search methods on models call super in order to search
certain default attributes (id, created_at, updated_at). Simplifies some
magic.
2020-12-16 23:57:07 -06:00
evazion
2297bf5da5 Fix #4638: Add exclusions to the numeric attributes.
Add the following search operators:

* /tags?search[post_count_eq]=42
* /tags?search[post_count_not_eq]=42
* /tags?search[post_count_gt]=42
* /tags?search[post_count_gteq]=42
* /tags?search[post_count_lt]=42
* /tags?search[post_count_lteq]=42

Works for all numeric attributes on all index actions.
2020-12-16 20:03:09 -06:00
evazion
35134abe8f post query builder: fix incompatibilities with Rails 6.1.
* Rename the `#negate` and `#and` methods that we monkey patch into
  ActiveRecord::Relation. These methods are now defined in Rails 6.1, but
  they shadow our methods and have slightly different behavior.
* Fix a call to `invert`. It no longer accepts an argument.
2020-12-13 04:10:48 -06:00
evazion
f0299a8945 aliases: refactor tag moving code.
* Factor out the code for moving tags from tag aliases to a separate
  TagMover class.

* When aliasing two tags that have conflicting wikis, merge the old wiki
  into the new one instead of failing with an error. Merge the other names
  fields, replace the old wiki body with a message linking to the new
  wiki, and mark the old wiki as deleted.

* When aliasing two tags that have conflicting artist entries, merge the
  old artist into the new one instead of silently ignore the conflict.
  Merge the group name, other names, and urls fields, and mark the old
  artist as deleted.

* When two tags have conflicting wikis or artist entries, but the old
  wiki or artist entry is deleted, then just ignore the old wiki or
  artist and don't try to merge it.

* Fix it so that when saved searches are rewritten, we rewrite negated
  searches too.
2020-08-26 17:05:41 -05:00
evazion
70b82010a7 search: fix info leak when searching nested associations.
Fix an exploit in #4553. It was possible to use nested searches to infer
the contents of private forum posts.

For example:

* https://danbooru.donmai.us/users?search[forum_posts][id]=121683&search[forum_posts][body_matches]=h*
* https://danbooru.donmai.us/users?search[forum_posts][id]=121683&search[forum_posts][body_matches]=he*
* https://danbooru.donmai.us/users?search[forum_posts][id]=121683&search[forum_posts][body_matches]=hel*
* https://danbooru.donmai.us/users?search[forum_posts][id]=121683&search[forum_posts][body_matches]=hell*
* https://danbooru.donmai.us/users?search[forum_posts][id]=121683&search[forum_posts][body_matches]=hello*

The above searches returned the user 'albert', indicating that the
private forum post with id 121683 starts with the word 'hello'.

By guessing the id of a private forum post (which can be done by
searching for gaps in the id sequence), and by guessing text within the
post (which can be done by sequentially guessing characters with
wildcard searches), one could eventually infer the full text of a
private forum post.

The fix is to make nested searches only return records that are visible
to the current user.
2020-08-18 15:21:39 -05:00
BrokenEagle
36fa8efcd5 Fix parameter hash detection
Hash-like objects will respond to each_value, whereas arrays do not.
2020-08-18 05:34:14 +00:00
evazion
5db11a0b5f Merge branch 'master' into attribute-searching 2020-08-17 14:23:00 -05:00
evazion
2b0cd3c90b searchable: add support for searching enum fields.
Allow searching enum fields by string, by id, or by array of
comma-separated values. The category field in modactions is an example
of an enum field that can be searched this way.
2020-08-07 19:24:57 -05:00
BrokenEagle
c141a358bd Add support for chaining more search includes
- A generalized search includes function was added
-- The post and user includes functions were changed to use that
- A search function for polymorphic includes was added
- All models are given 3 class functions to control which includes
  are searchable, and extra restrictions for the "has_" params
2020-07-27 19:29:17 +00:00
evazion
b551e3634f Fix misc rubocop warnings. 2020-06-16 21:36:15 -05:00
evazion
f38c38f26e search: split tag_match into user_tag_match / system_tag_match.
When doing a tag search, we have to be careful about which user we're
running the search as because the results depend on the current user.
Specifically, things like private favorites, private favorite groups,
post votes, saved searches, and flagger names depend on the user's
permissions, and whether non-safe or deleted posts are filtered out
depend on whether the user has safe mode on or the hide deleted posts
setting enabled.

* Refactor internal searches to explicitly state whether they're
  running as the system user (DanbooruBot) or as the current user.
* Explicitly pass in the current user to PostQueryBuilder instead of
  implicitly relying on the CurrentUser global.
* Get rid of CurrentUser.admin_mode? (used to ignore the hide deleted
  post setting) and CurrentUser.without_safe_mode (used to ignore safe
  mode).
* Change the /counts/posts.json endpoint to ignore safe mode and the
  hide deleted posts settings when counting posts.
* Fix searches not correctly overriding the hide deleted posts setting
  when multiple status: metatags were used (e.g. `status:banned status:active`)
* Fix fast_count not respecting the hide deleted posts setting when the
  status:banned metatag was used.
2020-05-07 03:29:44 -05:00
evazion
8cbcec285d search: fix multiple metatag searches not working in some cases.
Bug: in some cases searching for multiple metatags would cause one
metatag to be ignored. For example, a search for {{user:1 pool:2}} would
be treated as a search for {{pool:2}}.

Cause: we used `ActiveRecord::Relation#merge` to combine two relations,
which was wrong because `merge` doesn't combine `column IN (?)` clauses
correctly. If there are two `column IN (?)` clauses on the same column,
then `#merge` takes only the second clause and ignores the first.

Fix: write our own half-baked `#and` method to work around Rails'
broken-by-design `#merge` method.

ref: https://github.com/rails/rails/issues/33501.
2020-04-27 22:29:42 -05:00
evazion
18685ae5ae search: fixup broken class method references.
Fixup for 3dab648d0.
2020-04-23 13:38:19 -05:00
evazion
fef90b46ee search: clean up filetype: metatag.
* Fix not being able to use the filetype: metatag twice in the same search.
* Support comma-separated filetypes (filetype:png,jpg).
2020-04-20 04:14:24 -05:00
evazion
c92ac9ab89 search: clean up status: metatag.
* Fix not being able to use the status: metatag twice in the same search.
* Fix status:active excluding banned posts.
* Fix status:garbage returning all posts.
2020-04-20 04:14:24 -05:00
evazion
172095730c search: support repeated numeric-valued metatags.
Support using the same numeric-valued metatag twice in the same search.
Numeric-valued metatags are those taking an integer, float, filesize, or
date argument. Previously using the same metatag twice would cause the
second metatag to overwrite the first metatag.

Examples:

* "id:>5 id:<10"
* "width:>500 width:<1000"
* "date:>2019-01-01 date:<2020-01-01"
2020-04-20 02:44:09 -05:00
evazion
be27423afd search: fix invalid username searches returning wrong results.
Partial fix for #4389.

* Fix invalid username searches returning all posts instead of no posts.
* Fix "user:A user:B" returning results for user:B instead of no results.
* Fix "approver:A approver:B" returning results for approver:B instead of no results.
* Add support for negated -commenter, -noter, -noteupdater, -upvote, -downvote metatags.
* Add support for "any" and "none" values for all username metatags,
  including negated metatags that didn't support "any" or "none" before.
* Change noter:any and commenter:any to include posts with deleted notes
  or comments. Note that commenter:<username> already included deleted
  comments before. This is so that commenter:any has the same behavior
  as commenter:<username>
2020-04-15 01:18:41 -05:00
evazion
cb11d818b1 artist versions: fixup is_active reference. 2020-03-09 14:40:22 -05:00
evazion
967d398c8e search: move query parsing code from tag model to post query builder. 2020-03-06 23:23:38 -06:00