Add media_asset_id and old_media_asset_id columns for associating replacements with media assets.
This way we can easily tell which replacements don't have a media asset (with the md5 alone we can't
tell whether the media asset actually exists).
Fix StatementInvalid exception when uploading https://files.catbox.moe/vxoe2p.mp4.
This was a result of multiple bugs:
* First, generating thumbnails for the video failed. This was because
the video uses the AV1 codec, which FFmpeg failed to decode. It failed
because our version of FFmpeg was built without the `--enable-libdav1d`
flag, so it uses the builtin AV1 decoder, which apparently can't
handle this particular video (it spews a bunch of errors about "Failed
to get pixel format" and "missing sequence header" and "failed to get
reference frame").
* Because generating the thumbnails failed, an exception was raised. We
tried to save the error message in the upload_media_assets.error
field. However, this also failed because the error message was 77kb
long (it contained the entire output of the ffmpeg command), but the
`upload_media_assets` table had a btree index on the `error` column,
which meant the maximum length of the error column was limited to
~2.7kb. This lead to a StatementInvalid exception being raised.
* Because the StatementInvalid exception was raised while we were trying
to set the upload media asset's status to `failed`, the upload was
left stuck in the `processing` state rather than being set to the
`failed` state.
* Because the upload was stuck in the `processing` state, the upload
page would hang forever waiting for the upload to complete.
The fixes are to:
* Build FFmpeg with `--enable-libdav1d` to use libdav1d for decoding AV1
videos instead of the builtin AV1 decoder.
* Remove the index on the `upload_media_assets.error` column so that
setting overly long error messages won't fail.
* Catch unexpected exceptions in ProcessUploadMediaAssetJob so we can
mark uploads as failed, even if `process_upload!` itself fails because
it raises an unexpected exception inside its own exception handler.
* Check that the video is playable with `MediaFile::Video#is_corrupt?` before
allowing it to be uploaded. This way we can return a better error
message if we can't generate thumbnails because the video isn't
playable. This requires decoding the entire video, so it means uploads
may take several seconds longer for long videos. It's also a security
risk in case ffmpeg has any bugs.
* Define `MediaAsset#preview!` as raising an exception on error, so
it's clear that generating thumbnails can fail. Define `MediaAsset#preview`
as returning nil on error for when we don't care about the cause of
the error.
Show the following actions on the post events page:
* Post bans and unbans
* Post deletions and undeletions
* Thumbnail regenerations and IQDB regenerations
* Favorites moves
* Rating locks and unlocks
* Note locks and unlocks
Fixes#3825: Events/Moderation page for each post should show eventual ban actions
* Add a global /post_events page that shows the history of all approvals,
disapprovals, flags, appeals, and replacements on a single page.
* Redesign the /posts/:id/events page to show all approval, disapproval,
flag, appeal, and replacement events for a single post (before it only
showed approvals, flags, and appeals).
* Remove the replacement history link from the post show page. Replacements
are now included in the post events page (closes#4948: Highlighed replacements).
* Add /post_approvals/:id and /post_replacements/:id routes (these are
used by the "Details" link on the post events page).
Fix various columns to be either `character varying` or `text`,
depending on what kind of text is stored in the column. `text` is used
for columns that contain free-form natural language, like pool and forum
topic titles, while `character varying` is used for short strings that
don't contain free-form text, like URLs and status fields.
Both types are treated the same by Postgres; the only difference is how
we treat them in Rails. In edit forms, `text` fields use multi-line
textboxes, while `character varying` fields use single-line inputs. And
during search, we allow `text` fields to be searched using full-text
search, but not `character varying` fields.
Remove the `CurrentUser.ip_addr` global variable and replace it with
`request.remote_ip`. Before we had to track the current user's IP in a
global variable so that when we edited a post for example, we could pass
down the user's IP to the model and save it in the post_versions table.
Now that we now longer save IPs in version tables, we don't need a global
variable to get access to the current user's IP outside of controllers.
Set various IP address columns to nullable in preparation for dropping them.
In production, some of these tables already contained null values even
though it violated the constraint.
Remove the /ip_addresses page. This page allowed moderators to search
users by IP, and to see recent activity tied to an IP. However, it was
limited to IPs tied to uploads, comments, dmails, artist edits, note
edits, and wiki edits.
Remove this page because it was limited in scope and because there are
better ways of doing what it did. The /user_events page is better at
catching sockpuppets because it tracks IPs for every login, not just for
certain types of edits. And the /user_actions page is better at
monitoring user activity because it shows all activity associated with
an account, not just for certain types of edits.
Removing this allows us to drop IP addresses from all tables besides the
user_events table. This is good because these IPs are no longer necessary
for any purpose, and because storing them forever is a liability.
Add a user_actions view. This view unions together a bunch of tables to
produce an event log of every action taken by a user.
Also add a bunch of indexes to make queries on this table efficient.
Even though the view is an enormous query combining together about 30
different tables, queries are very efficient as long as every table has
`created_at` and `(user_id, created)` indexes.
Change ID columns from `bigint` (64-bits) to `integer` (32-bits) on various tables.
Rails 6.0 switched the default from bigint to integer for IDs on new
tables, so now we have a mix of tables with integer IDs and bigint IDs.
Switch back to integer IDs on certain tables because we're going to
build a view that unions a bunch of tables together to build a user
activity timeline, and for this purpose all the tables need to have IDs
of the same type in order for Postgres to optimize the query effectively.
Add a tag_versions table for tracking the history of tags.
A couple notable differences from other version tables:
* There is a previous_version_id column that points to the previous
version. This allows finding the first, last, previous, or next
version efficiently in SQL.
* There is a `version` column that tracks the revision number (1, 2, 3, etc).
Post versions and note versions have this, but other version tables don't.
* The `updater_id` column is optional. This is because we don't know who
the last updater was before we started tracking the history of tags,
so the initial updater will be NULL in the first version of the tag.
Add a `words` column to the tags table. This will be used for parsing
tags into words for word-based matching in autocomplete.
For example, "very_long_hair" can be parsed into ["very", "long", "hair"].
The `array_to_tsvector(words)` index is for performing wildcard
searches. It lets us do e.g
SELECT * FROM tags WHERE array_to_tsvector(words) @@ 'hand:* & hold:*'
to find tags containing the words "hand*" and "hold*".
Index on (tag_id, score) instead of (tag_id) to allow tags to be
filtered and sorted by confidence more efficiently. This index takes up
about the same amount of space as an index on tag_id alone, so including
the score in the index is essentially free.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.
AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.
The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.
You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).
You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.
To generate tags, use the `autotag` script from the Autotagger repo, something like this:
docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz
To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
Add a system for upgrading accounts using upgrade codes. Users purchase
an upgrade code off-site then redeem it on-site to upgrade their account
to Gold. Upgrade codes are randomly pre-generated and are one time use
only. Codes have enough randomness that guessing a code is infeasible.
* Rename the stripe_id column to transaction_id.
* Add a new payment_processor column to identity the processor used for
this transaction (and hence, which backend system the transaction_id is for).
Rewrite db/populate.rb:
* Fix broken code.
* Pull random posts from Danbooru for more realistic data.
* Generate more data (wiki pages, artist commentaries, artist urls).
* Make the amount of data generated configurable with environment variables.
* Use FFaker to generate better random text and usernames.
Usage:
* docker-compose exec danbooru bin/rails runner db/populate.rb # with Docker
* bin/rails runner db/populate.rb # without Docker
`file_key` is a random 9-character base-62 string that will be used as
the image filename in the future.
`is_public` is whether the image can be viewed without authentication or not.
Users running downstream boorus must run `bin/rails db:migrate` and
`script/fixes/109_generate_media_asset_file_keys.rb` after this commit.
* Deprecated tags can't be added to posts, but existing deprecated tags
in a post won't be removed
* Only empty tags can be marked as deprecated manually
* No tags can be manually undeprecated
** These limits don't apply to admins
* Deprecating or undeprecating a tag will create a new mod action to
prevent people from going rogue
* Added deprecate/undeprecate commands for BURs
* Deprecating a tag via BUR removes all implications to and from it as well
Make the artist finder search for artists using the `url` field instead
of the `normalized_url` field. This lets us get rid of `normalized_url`
in the future.
As described in 10dac3ee5, artist URLs have both a `url` column and a
`normalized_url` column. The `normalized_url` column was the one used
for artist finding. The `url` was secretly normalized behind the scenes
so that artist finding would work no matter how the URL was written in
the artist entry. This is no longer necessary now that URLs are directly
normalized in artist entries.
This fixes various cases where artist finding didn't work for non-obvious
reasons, usually because the URL wasn't written in the right format so
it wasn't properly normalized behind the scenes.
This also makes it so that artist finding is case-insensitive, which
fixes#4821. Hopefully no sites are perverse enough to allow two
different usernames that differ only in case.
Users running their own Danbooru instance may have to fix the URLs in
their artist entries for artist finding to work again. There are a few
fix scripts to help with this:
* script/fixes/104_normalize_weibo_artist_urls.rb
* script/fixes/105_normalize_pixiv_artist_urls.rb
* script/fixes/106_normalize_artist_urls.rb
This is needed for multi-file uploads. We need to know both the image
url and the page url to set the post's source correctly when converting
an upload media asset into a post.
Make upload_media_assets.media_asset_id nullable in order to support
multi-file uploads. The media asset will be null while the image is
still being downloaded from the source.
* uploads.media_asset_count - the number of media assets attached to this upload.
* upload_media_assets.status - the status of each media asset attached to this upload (processing, active, failed)
* upload_media_assets.source_url - the source of each media asset attached to this upload
* upload_media_assets.error - the error message if uploading the media asset failed
Mark old columns as ignored in preparation for dropping them. Make the
rating and tag_string nullable so they don't have to be set when
creating uploads and can be ignored too.
Add a join table that allows multiple media assets (images or videos) to
be attached to uploads. This is for a future ability to upload multiple
files at once.
* Add ability to mark moderation reports as 'handled' or 'rejected'.
* Automatically mark reports as handled when the comment or forum post
is deleted.
* Send a dmail to the reporter when their report is handled.
* Don't show the report notice on comments or forum posts when all
reports against it have been handled or rejected.
* Add a fix script to mark all existing reports for deleted comments,
forum posts, or dmails as handled.
Add foreign key constraints on all foreign keys on all tables.
These constraints are deferrable so that they're checked at the end of
the transaction, rather at the end of the statement. This is to reduce
lock duration and to allow for cyclic relationships.
Constraints are added in one migration then validated in another so that
the entire table isn't locked against reads and writes while the foreign
key constraints are being validated.
A few tables had invalid foreign keys. Add a fix script to fix these tables:
* A couple artist versions belonged to deleted artists.
* One dmail belonged to a deleted user.
* One forum topic visit belonged to that same deleted user.
* A few dozen note versions belonged to nonexistent posts. This came
from RaisingK moving notes to different posts years ago, back when it
was possible for users to set a note's post ID in the API.
* Some uploads had their parent ID set to 0.