Add a fix script that imports the md5 for old post replacements from the corresponding DanbooruBot
replacement comment, then deletes all replacement comments.
There are about 250 replacements left that still have a null md5 because they don't have a matching
comment. This is because if a post was replaced but the file didn't change, it didn't leave a comment.
Use ExifTool to get the dimensions of Flash files instead of calculating
it ourselves. Avoids copying third-party code.
Fixes a bug where Flash files with fractional dimensions (e.g. 607.6 x 756.6)
had their dimensions rounded down instead of rounded up.
Fixes another bug where Flash files could return negative dimensions.
This happened for two files:
* https://danbooru.donmai.us/media_assets/228662 (-179.2 x -339.2)
* https://danbooru.donmai.us/media_assets/228664 (-179.2 x -339.2)
Now we round these up to 1x1. This is still wrong, but it's less wrong than before.
Don't wrap the metadata refresh script in a transaction because it could
be a very long running operation and it's not good to leave a transaction
open that long.
Add a script to go through every media asset and check the metadata
(width, height, duration, filesize, md5, EXIF metadata) and update it
if it's changed. This is necessary after upgrading ExifTool because the
metadata it returns may have changed.
Update email validation rules to disallow the percent character (e.g.
`foo%bar@gmail.com`) and names ending with a period (e.g. `foo.@gmail.com`).
Names ending with a period are invalid according to the RFCs and cause
`Mail::Address.new` to raise an exception.
The percent character is technically legal, but only one email used it
and it was probably a typo.
Store Ugoira frame delays in the MediaMetadata model as a fake EXIF
field instead of in the PixivUgoiraFrameData model. This way we can get
rid of the PixivUgoiraFrameData model completely. This is a step towards
fixing #5264.
Whenever the email address normalization procedure changes, the
`normalized_address` column of the email address table must be updated.
This is normally when the list of canonical domain mappings changes.
Renormalizing addresses may also require deleting duplicates.
In the past it was possible for users to create multiple accounts with
the same email address. We had about 9000 such accounts. This removes
the email address from these accounts.
When multiple accounts have the same email address, the account that
visited the site last gets to keep the address.
Add a fix script that fixes invalid email addresses if they can be
fixed, otherwise they're deleted.
For a long time we didn't have any email validation, so we ended up with
a lot of invalid email addresses containing typos or other random garbage.
This tries to fix the most common typos when possible, otherwise the
email address is deleted.
In many cases the user created two accounts, one with a typo in the
email and one with the correct email. In these cases we can't fix the
invalid email, so we just delete it.
Add a fix script to populate the mod_actions subject field by parsing
mod action descriptions. Most mod actions contain an ID, so finding the
subject is easy, but some don't. And some mod actions refer to deleted
objects, such as deleted posts or comments. In these cases the subject
will be null.
For IP bans, the mod action description only contains the IP, but it's
possible to have multiple bans for the same IP. So we look for IP bans
created by the same user, for the same IP, within the same time range.
For user bans, the mod action only contains the banned user's name and
the ban reason. This makes it difficult to find the banned user's ID in
some cases, because it's possible for the user to have changed their
name, and for the name change to have not been recorded, and for the
banner to have edited the ban reason, or for the ban to have been
deleted. So we try multiple things until we find the closest match.
Add a fix script to delete all accounts with invalid usernames. Also
change it so the owner-level user can delete accounts belonging to other
users.
Users who have logged in in the last year and who have a valid email
address will be given a one week warning. After that all accounts with
invalid names will be deleted. Anyone who has visited the site in the
last 6 months will have already seen a warning page that their name must
be changed to keep using the site.
Fix mod actions to use the same message format everywhere.
Before mod actions were formatted in various inconsistent ways:
* "deleted post #1234"
* "comment #1234 updated by <user>"
* "<user> updated forum #1234"
* "<user> level changed Member -> Builder"
Now all mod actions consistently use this format:
* "deleted post #1234"
* "updated comment #1234"
* "updated forum #1234"
* "promoted <user> from Member to Builder"
This way mod actions are formatted consistently with other actions on
the /user_actions page, where everything is written as "<user> did X".
Also add a fix script to fix existing mod actions.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.
AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.
The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.
You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).
You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.
To generate tags, use the `autotag` script from the Autotagger repo, something like this:
docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz
To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
Add a system for upgrading accounts using upgrade codes. Users purchase
an upgrade code off-site then redeem it on-site to upgrade their account
to Gold. Upgrade codes are randomly pre-generated and are one time use
only. Codes have enough randomness that guessing a code is infeasible.
`file_key` is a random 9-character base-62 string that will be used as
the image filename in the future.
`is_public` is whether the image can be viewed without authentication or not.
Users running downstream boorus must run `bin/rails db:migrate` and
`script/fixes/109_generate_media_asset_file_keys.rb` after this commit.
Fix the Tags field in the BUR search form not finding all BURs
mentioning that tag. Specifically, tags that were part of a mass update,
and that were prefixed with `~` or `-` (OR tags and NOT tags), weren't
indexed as tags affected by the BUR.
This requires re-running script/fixes/064_initialize_bulk_update_request_tags.rb
to fix old BURs.
Rationale:
* The spoilers tag is the most frequently removed tag from the default blacklist.
* It's frustrating for regular users to have posts randomly hidden because of trivial
spoilers from a series they don't care about.
* The spoilers tag is used way too liberally for things that aren't considered
spoilers on other sites.
* If you're looking up fanart on the internet, you should expect to see a certain
level of spoilers.
* The tag is used very inconsistently, with some characters like Nia_(blade)_(xenoblade)
getting the spoilers tag half the time and the rest of the time not.
* Save the filename for files uploaded from disk. This could be used in
the future to extract source data if the filename is from a known site.
* Save both the image URL and the page URL for files uploaded from
source. This is needed for multi-file uploads. The image URL is the
URL of the file actually downloaded from the source. This can be
different from the URL given by the user, if the user tried to upload
a sample URL and we automatically changed it to the original URL. The
page URL is the URL of the page containing the image. We don't always
know this, for example if someone uploads a Twitter image without the
bookmarklet, then we can't find the page URL.
* Add a fix script to backfill URLs for existing uploads. For file
uploads, the filename will be set to "unknown.jpg". For source
uploads, we fetch the source data again to get the image and page
URLs. This may fail for uploads that have been deleted from the
source since uploading.
Delete all old upload records from before the upload rework in abdab7a0a
/ f11c46b4f. Uploads from before the rework don't have any attached
media assets, so they're not valid under the new system because we can't
find which files they were for.
Before the rework, completed uploads were only saved for 1 hour, and
failed uploads were only saved for 3 days, so deleting this data
doesn't really lose anything that wouldn't have been deleted before.
Fix strings like "pokémon" (NFD form) and "pokémon" (NFC form) being
considered different strings in sources.
Also add a fix script to fix existing sources. There were only 15 posts
with unnormalized sources.
Fix it so that when a forum topic is deleted, all posts in the topic are
deleted too. Also make it so that when a forum topic is undeleted, all
posts in it are undeleted too.
Before when a topic was deleted, only the topic itself was marked as
deleted, not the posts inside the topic. This meant that when a spam
topic was deleted, the OP wouldn't be marked as deleted, so any
modreports against it wouldn't be marked as handled.
Also change it so that it's not possible to undelete a post in a deleted
topic, or to delete the OP of a topic without deleting the topic itself.
Finally, add a fix script to delete all active posts in deleted topics,
and to undelete all deleted OPs in active topics.
* Add ability to mark moderation reports as 'handled' or 'rejected'.
* Automatically mark reports as handled when the comment or forum post
is deleted.
* Send a dmail to the reporter when their report is handled.
* Don't show the report notice on comments or forum posts when all
reports against it have been handled or rejected.
* Add a fix script to mark all existing reports for deleted comments,
forum posts, or dmails as handled.
There are a lot of old artist entries with Japanese names. These names
are now invalid and these artist entries can't be edited because they
fail validation checks.
Add a fix script to delete all artist entries with non-ASCII names, and
rename them to `artist_1234`.
Normalize artist group names following the same rules as artist other names.
This means artist group names now use underscores instead of spaces.
It also means extra space characters at the beginning and end of names
is stripped, and Unicode characters are normalized.
Fixes#4647, which was caused by users accidentally replacing group
names with a single space character when trying to remove a group.
Autotag non-web_source on posts that have a non-http:// or https:// URL.
Add a fix script to backfill old posts.
Syntactically invalid URLs are still considered web sources. For
example, `https://google,com` technically isn't a valid URL, but it's
not considered a non-web source.