Refactor full-text search on several tables (comments, dmails,
forum_posts, forum_topics, notes, and wiki_pages) to use to_tsvector
expression indexes instead of dedicated tsvector columns. This way
full-text search works the same way across all tables.
API changes:
* Changed /wiki_pages.json?search[body_matches] to match against only
the body. Before `body_matches` matched against both the title and the body.
* Added /wiki_pages.json?search[title_or_body_matches] to match against
both the title and the body.
* Fixed /dmails.json?search[message_matches] to match against both the
title and body when doing a wildcard search. Before a wildcard search
only matched against the body.
* Added /dmails.json?search[body_matches] to match against only the dmail body.
Make it so that when a database call inside a `with_timeout` block times
out, the error logged to New Relic is marked as expected. This is so
that expected timeouts, such as timeouts when calculating search counts
or timeouts when generating related tags for the sidebar, don't count
against the error rate.
Drop the final dependency on the Postgres test_parser extension.
We also have to remove references to test_parser in the migration where
it was first defined, otherwise replaying all migrations from the
beginning will fail. Replaying all migrations from the beginning
normally isn't done except in testing.
After this, it should be possible to use a vanilla install of Postgres
with Danbooru. It's still recommended to use Danbooru's Docker image for
Postgres (https://ghcr.io/danbooru/postgres), as other Postgres extensions
may be necessary in the future.
Restructure the Dockerfile and the CSS/JS files so that we only rebuild
the CSS and JS when they change, not on every commit.
Before it took several minutes to rebuild the Docker image after every
commit, even when the JS/CSS files didn't change. This also made pulling
images slower.
This requires refactoring the CSS and JS to not use embedded Ruby (ERB)
templates, since this made the CSS and JS dependent on the Ruby
codebase, which is why we had to rebuild the assets after every Ruby
change.
Move all the code for defining tag categories from the config file to
TagCategory. It didn't belong in the config because it's not possible to
add new tag categories purely in the config without editing other things
like the CSS.
Also change it so that tag colors are hardcoded in the CSS instead of
generated using ERB. Generating the CSS in ERB meant that the Docker
build had to recompile the CSS on every commit, even when it didn't
change, because it relied on Ruby code outside the CSS that we couldn't
guarantee didn't change.
Try to optimize certain types of common slow searches:
* Searches for mutually-exclusive tags (e.g. `1girl multiple_girls`,
`touhou solo -1girl -1boy`)
* Relatively large tags that are heavily skewed towards old posts
(e.g. lucky_star, haruhi_suzumiya_no_yuuutsu, inazuma_eleven_(series),
imageboard_desourced).
* Mid-sized tags in the <30k post range that Postgres thinks are
big enough for a post id index scan, but a tag index scan is faster.
The general pattern is Postgres not using the tag index because it
thinks scanning down the post id index would be faster, but it's
actually much slower because it degrades to a full table scan. This
usually happens when Postgres thinks a tag is larger or more common than
it really is. Here we try to force Postgres into using the tag index
when we know the search is small.
One case that is still slow is `2girls -multiple_girls`. This returns no
results, but we can't know that without searching all of `2girls`. The
general case is searching for `A -B` where A is a subset of B and A and B
are both large tags.
Hopefully fixes#581, #654, #743, #1020, #1039, #1421, #2207, #4070,
#4337, #4896, and various other issues raised over the years regarding
slow searches.
When a search is performed, we cache the post count so we don't have to
calculate it again every time the user switches pages. However, if the
count times out, we didn't cache it before, causing us to do a slow
count on every page load. This usually happens on multi-tag searches
that return a lot of results, `1girl solo` for example.
This changes it so that the count is cached even when it times out. This
will speed up large multi-tag searches.
This also changes it so that the count is cached for a fixed 5 minutes.
Before it was variable based on the size of the count, but this probably
didn't make much difference.
Change the wiki_pages tsvector_update_trigger to use
`pg_catalog.english` instead of `public.danbooru`. This changes how wiki
page text is parsed for full-text search to use the standard English
parser instead of test_parser. This is to prepare for dropping
test_parser. Using test_parser here was wrong anyway because it meant
that punctuation wasn't removed from words when indexing wiki pages for
full-text search.
Use the `string_to_array(tag_string, ' ')` index instead of the
`tag_index` for tag searches. The string_to_array index lets us treat
the tag_string as an array for searching purposes. This lets us get rid
of the tag_index column and the test_parser dependency in the future.
Stop updating the fav_string attribute on posts. The column still exists
on the table, but is no longer used or updated.
Like the pool_string in 7d503f08, the fav_string was used in the past to
facilitate `fav:X` searches. Posts had a hidden fav_string column that
contained a list of every user who favorited the post. These were
treated like fake hidden tags on the post so that a search for `fav:X`
was treated like a tag search.
The fav_string attribute has been unused for search purposes for a while
now. It was only kept because of technicalities that required
departitioning the favorites table first (340e1008e) before it could be
removed. Basically, removing favorites with `@favorite.destroy` was
slow because Rails always deletes object by ID, but we didn't have an
index on favorites.id, and we couldn't easily add one until the
favorites table was departitioned.
Fixes#4652. See https://github.com/danbooru/danbooru/issues/4652#issuecomment-754993802
for more discussion of issues caused by the fav_string (in short: write
amplification, post table bloat, and favorite inconsistency problems).
Optimize counting the number of posts returned by fav:<name> and
pool:<name> searches. Use cached counts to avoid slow count(*) queries
for users with lots of favorites.
Include the favorites table in the nightly database dumps in BigQuery.
Previously we couldn't do this because we didn't have an index on
the favorite ID, which we needed to iterate across the table efficiently.
Note that this doesn't include private favorites. Note also that if a
user switches their favorites from private to public, then their
favorites will begin to appear in these dumps.
Order https://danbooru.donmai.us/favorites.json by favorite ID instead
of by post ID. This way you can get a feed of recently added favorites.
Previously this wasn't possible because we didn't have an index on
favorite ID.
Merge the 100 favorite subtables into a single table.
Previously the favorites table was partitioned by user id into 100
subtables to try to make searching by user id faster. This wasn't really
necessary and probably slower than just making an index on
(favorites.user_id, favorites.id) to satisfy ordfav searches. BTree
indexes are logarithmic so dividing an index by 100 doesn't make it 100
times faster to search; instead it just removes a layer or two from the
tree.
This also adds a uniqueness index on (user_id, post_id) to prevent
duplicate favorites. Previously we had to check for duplicates at the
application layer, which required careful locking to do it correctly.
Finally, this adds an index on favorites.id, which was surprisingly
missing before. This made ordering and deleting favorites by id really
slow because it degraded to a sequential scan.
Add fix script to remove duplicate favorites. When a user has duplicate
favorites on the same post, the earliest favorite will be kept and the
rest will be removed.
Let all users have unlimited favorites. Formerly the limit was 10k
favorites for regular members, 20k for Gold, and unlimited for Platinum.
Limiting favorites doesn't make sense since upvotes are unlimited.
Stop using the pool_string attribute on posts:
* Stop updating it when adding or removing posts from pools.
* Stop returning pool_string in the /posts.json API.
* Stop including the `data-pools` attribute on thumbnails.
The pool_string attribute was used in the past to facilitate pool:X
searches. Posts had a hidden pool_string attribute that contained a list
of every pool the post belonged to. These pools were treated like fake
hidden tags on the post and a search for `pool:X` was treated like a tag
search.
The pool_string has no longer been used for this purpose for a long time
now, and was only maintained for API compatibility purposes. Getting rid
of it eliminates a bunch of legacy cruft relating to adding and removing
posts from pools.
If you need to see which pools a post belongs to, do this:
* https://danbooru.donmai.us/pools.json?search[post_ids_include_any]=318550
The `data-pools` attribute on thumbnails was used by some people to add
custom borders to pooled posts with custom CSS. This will no longer
work. This was already broken because it included things like collection
pools and deleted pools, which you probably didn't want. Use a
userscript to add this attribute back to thumbnails if you need it.
Start storing the duration of animations and videos in the `duration`
field on the media_assets table. This had to wait until 3d30bfd69 was
deployed, which had to wait until Postgres was upgraded in order to add
the duration column to the media_assets table without downtime.
Also add a fix script to backfill the duration on existing posts. Usage:
TAGS=animated ./script/fixes/079_fix_duration.rb
Fix certain animated PNGs returning NaN as the duration because the
frame rate was being reported as "0/0" by FFMpeg. This happens when the
animation has zero delay between frames. This is supposed to mean a PNG
with an infinitely fast frame rate, but in practice browsers limit it to
around 10FPS. The exact frame rate browsers will use is unknown and
implementation defined.
Update the Postgres client binaries (psql et al) to version 14.0. This
is so they match the server version, and so that pg_amcheck is
available, which was introduced in 14.0.
This requires updating the base image to Ubuntu 21.04 at the same time
because the Postgres repo doesn't support version 14.0 on Ubuntu 20.10.
* Skip Nijie tests because they fail a lot due to Nijie rate limiting us.
* Skip ArtStation downloads tests because they sometimes return different file sizes.
* Fix random duplicate favgroup errors because favgroup names weren't random enough.