Fix temp files generated during the upload process not being cleaned up quickly enough. This included
downloaded files, generated preview images, and Ugoira video conversions.
Before we relied on `Tempfile` cleaning up files automatically. But this only happened when the
Tempfile object was garbage collected, which could take a long time. In the meantime we could have
hundreds of megabytes of temp files hanging around.
The fix is to explicitly close temp files when we're done with them. But the standard `Tempfile`
class doesn't immediately delete the file when it's closed. So we also have to introduce a
Danbooru::Tempfile wrapper that deletes the tempfile as soon as it's closed.
Automatically add the `sound` tag if the post has sound. Remove the tag if the post doesn't have sound.
A video is considered to have sound if its peak loudness is greater than -70 dB. The current quietest post
on Danbooru has a peak loudness of -62 dB (post #3470668), but it's possible to have audible sound at
-80 dB or possibly even lower. It's hard to draw a clear line between "silent" and "barely audible".
If a media asset is corrupt, include the error message from libvips or
ffmpeg in the "Vips:Error" or "FFmpeg:Error" fields in the media
metadata table.
Corrupt files can't be uploaded nowadays, but they could be in the past,
so we have some old corrupted files that we can't generate thumbnails
for. This lets us mark these files in the metadata so they're findable
with the tag search `exif:Vips:Error`.
Known bug: Vips has a single global error buffer that is shared between
threads and that isn't cleared between operations. So we can't reliably
get the actual error message because it may pick up errors from other
threads, or from previous operations in the same thread.
Add a `MediaAsset#regenerate!` method that regenerates everything about
the asset, including the metadata, thumbnails, IQDB, cached Cloudflare
URLs, and AI tags.
Fixes it so that a) it's possible to regenerate media assets that aren't
attached to posts and b) regenerating a post regenerates everything. Before
it didn't regenerate the metadata, AI tags, or all of the cached URLs.
Fix it so that trying to regenerate AI tags for a Flash file doesn't
fail because Flash files have no image preview.
Also let `MediaFile.open` take a block argument.
Fix certain corrupt GIFs returning dimensions of 0x0. This happened
when the GIF was too corrupt for libvips to read. Fixed by using
ExifTool to read the dimensions instead.
Also add validations to ensure that it's not possible to have media
assets with a width or height of 0.
Add a script to go through every media asset and check the metadata
(width, height, duration, filesize, md5, EXIF metadata) and update it
if it's changed. This is necessary after upgrading ExifTool because the
metadata it returns may have changed.
Fix StatementInvalid exception when uploading https://files.catbox.moe/vxoe2p.mp4.
This was a result of multiple bugs:
* First, generating thumbnails for the video failed. This was because
the video uses the AV1 codec, which FFmpeg failed to decode. It failed
because our version of FFmpeg was built without the `--enable-libdav1d`
flag, so it uses the builtin AV1 decoder, which apparently can't
handle this particular video (it spews a bunch of errors about "Failed
to get pixel format" and "missing sequence header" and "failed to get
reference frame").
* Because generating the thumbnails failed, an exception was raised. We
tried to save the error message in the upload_media_assets.error
field. However, this also failed because the error message was 77kb
long (it contained the entire output of the ffmpeg command), but the
`upload_media_assets` table had a btree index on the `error` column,
which meant the maximum length of the error column was limited to
~2.7kb. This lead to a StatementInvalid exception being raised.
* Because the StatementInvalid exception was raised while we were trying
to set the upload media asset's status to `failed`, the upload was
left stuck in the `processing` state rather than being set to the
`failed` state.
* Because the upload was stuck in the `processing` state, the upload
page would hang forever waiting for the upload to complete.
The fixes are to:
* Build FFmpeg with `--enable-libdav1d` to use libdav1d for decoding AV1
videos instead of the builtin AV1 decoder.
* Remove the index on the `upload_media_assets.error` column so that
setting overly long error messages won't fail.
* Catch unexpected exceptions in ProcessUploadMediaAssetJob so we can
mark uploads as failed, even if `process_upload!` itself fails because
it raises an unexpected exception inside its own exception handler.
* Check that the video is playable with `MediaFile::Video#is_corrupt?` before
allowing it to be uploaded. This way we can return a better error
message if we can't generate thumbnails because the video isn't
playable. This requires decoding the entire video, so it means uploads
may take several seconds longer for long videos. It's also a security
risk in case ffmpeg has any bugs.
* Define `MediaAsset#preview!` as raising an exception on error, so
it's clear that generating thumbnails can fail. Define `MediaAsset#preview`
as returning nil on error for when we don't care about the cause of
the error.
Add a JPEG conversion for .avif and .webp files. The `full` variant is
the .avif or .webp file converted to JPEG format, with the same
resolution as the original file (full resolution).
Known bug: When converting an HDR .avif file to .jpeg, the resulting
image is too bright compared to the original image as rendered by
Firefox or Chrome.
Add ability to upload .webp images.
Animated WebP images aren't supported. This is because they aren't
supported by FFmpeg yet[1], so generating thumbnails and samples for
them would be more complicated than for other formats.
[1]: https://trac.ffmpeg.org/ticket/4907
Features of AVIF include:
* Lossless and lossy compression.
* High dynamic range (HDR) images
* Wide color gamut images (i.e. 10- and 12-bit color depths)
* Transparency (through alpha planes).
* Animations (with an optional cover image).
* Auxiliary image sequences, where the file contains a single primary
image and a short secondary video, like Apple's Live Photos.
* Metadata rotation, mirroring, and cropping.
The AVIF format is still relatively new and some of these features aren't well
supported by browsers or other software:
* Animated AVIFs aren't supported by Firefox or by libvips.
* HDR images aren't supported by Firefox.
* Rotated, mirrored, and cropped AVIFs aren't supported by Firefox or Chrome.
* Image grids, where the file contains multiple images that are tiled
together into one big image, aren't supported by Firefox.
* AVIF as a whole has only been supported for a year or two by Chrome
and Firefox, and less than a year by Safari.
For these reasons, only basic AVIFs that don't use animation, rotation,
cropping, or image grids can be uploaded.
Add config options to customize where uploads are stored, and how image URLs are generated.
* Add `media_asset_file_path` option to customize where uploads are stored.
* Add `media_asset_file_url` option to customize how image URLs are generated.
* Remove the `enable_seo_post_urls` config option. The `media_asset_file_url` option
should be used instead to include the tags in the image URL.
* Replace the "Download" placeholder thumbnail for Flash files with a
new placeholder that specifically says it's a Flash file.
* Fix a bug where the Flash placeholder thumbnail was too small when
using larger thumbnail sizes.
* Fix it so that media assets don't falsely consider Flash files to have
thumbnails. This could potentially cause errors if someone tried to
expunge, replace, or regenerate a Flash post.
Remove the last remaining uses of the PixivUgoiraFrameData model. As of
32bfb8407, Ugoira frame data is now stored in the MediaMetadata model,
under the `Ugoira:FrameDelays` EXIF field.
The pixiv_ugoira_frame_data table still exists, but it can be removed
after this commit is deployed.
Fixes#5264: Error when replacing with ugoira.
Automatically add the AI-generated tag to posts that have the
`PNG:Software=NovelAI` EXIF attribute.
This is not foolproof because this metadata may get removed if an
AI-generated post is resaved or uploaded to a site that strips EXIF
metadata. It also only works for NovelAI. Currently it detects 29 out of
177 AI-generated uploads on Danbooru.
Add ability to search the /media_assets index by AI tags. Multi-tag
searches are supported, including AND/OR/NOT operators, but metatags
aren't supported. Multi-tag searches will probably be slow.
The default AI tag confidence threshold is 50%. There's a hidden
search[min_score] URL param that lets you change this.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.
AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.
The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.
You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).
You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.
To generate tags, use the `autotag` script from the Autotagger repo, something like this:
docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz
To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
`file_key` is a random 9-character base-62 string that will be used as
the image filename in the future.
`is_public` is whether the image can be viewed without authentication or not.
Users running downstream boorus must run `bin/rails db:migrate` and
`script/fixes/109_generate_media_asset_file_keys.rb` after this commit.
Followup to 093a808a3. Using a NOT EXISTS clause is much faster than the
`LEFT OUTER JOIN posts WHERE posts.id IS NULL` clause generated by
`.where.missing(:post)`.
Fix an error when trying to upload a file larger than the file size
limit. In this case we tried to dump the whole HTTP response into the
error message, which included the binary file itself, which caused this
exception because it contained null bytes.
Make the upload page automatically detect when a source URL has multiple images
and let the user choose which images to post.
For example, when uploading a Twitter or Pixiv post with more than one image, we
direct the user to a page showing a thumbnail for each image and letting
them choose which ones to post.
This is similar to the batch upload page, except we actually download each image
in the background, instead of just hotlinking or proxying the thumbnails through
our servers. This avoids various problems with proxying and makes new features
possible, like showing which images in the batch have already been posted.
Make uploads faster by generating and saving thumbnails in parallel.
We generate each thumbnail in parallel, then send each thumbnail to the
backend image servers in parallel.
Most images have 5 variants: 'preview' (150x150), 180x180, 360x360,
720x720, and 'sample' (850px width). Plus the original file, that's 6
files we have to save. In production we have 2 image servers, so we have
to save each file twice, to 2 remote servers. Doing all this in parallel
should make uploads significantly faster.
Add a thumbnail view to the /media_assets page. This page lets you see
all images uploaded to Danbooru by all users (although you can't see who
the uploader is). Also add a link to this page in the subnav bar on the
upload page.
Automatically merge tags when uploading a duplicate.
There are two cases:
* You try to upload an image, but it's already on Danbooru. In this case
you'll be immediately redirected to the original post, before you
can start tagging the upload.
* You're uploading an image, it wasn't a dupe when you first opened the
upload page, but you got sniped while tagging it. In this case your tags
will be merged with the original post, and you will be redirected to the
original post.
There are a few corner cases:
* If you don't have permission to edit the original post, for example
because it's banned or has a censored tag, then your tags won't be
merged and will be silently ignored.
* Only the tags, rating, and parent ID will be merged. The source and
artist commentary won't be merged. This is so that if an artist uploads
the exact same file to multiple sites, the new source won't override
the original source.
* Some tags might be contradictory. For example, the new post might
be tagged translation_request, but the original post might already be
translated. It's up to the user to fix these things afterwards.
* Fix broken upload tests.
* Fix uploads to return an error if both a file and a source are given
at the same time, or if neither are given. Also fix the error message
in this case so that it doesn't include "base" at the start of the string.
* Fix uploads to percent-encode any Unicode characters in the source URL.
* Add a max filesize validation to media assets.
Rework the upload process so that files are saved to Danbooru first
before the user starts tagging the upload.
The main user-visible change is that you have to select the file first
before you can start tagging it. Saving the file first lets us fix a
number of problems:
* We can check for dupes before the user tags the upload.
* We can perform dupe checks and show preview images for users not using the bookmarklet.
* We can show preview images without having to proxy images through Danbooru.
* We can show previews of videos and ugoira files.
* We can reliably show the filesize and resolution of the image.
* We can let the user save files to upload later.
* We can get rid of a lot of spaghetti code related to preprocessing
uploads. This was the cause of most weird "md5 confirmation doesn't
match md5" errors.
(Not all of these are implemented yet.)
Internally, uploading is now a two-step process: first we create an upload
object, then we create a post from the upload. This is how it works:
* The user goes to /uploads/new and chooses a file or pastes an URL into
the file upload component.
* The file upload component calls `POST /uploads` to create an upload.
* `POST /uploads` immediately returns a new upload object in the `pending` state.
* Danbooru starts processing the upload in a background job (downloading,
resizing, and transferring the image to the image servers).
* The file upload component polls `/uploads/$id.json`, checking the
upload `status` until it returns `completed` or `error`.
* When the upload status is `completed`, the user is redirected to /uploads/$id.
* On the /uploads/$id page, the user can tag the upload and submit it.
* The upload form calls `POST /posts` to create a new post from the upload.
* The user is redirected to the new post.
This is the data model:
* An upload represents a set of files uploaded to Danbooru by a user.
Uploaded files don't have to belong to a post. An upload has an
uploader, a status (pending, processing, completed, or error), a
source (unless uploading from a file), and a list of media assets
(image or video files).
* There is a has-and-belongs-to-many relationship between uploads and
media assets. An upload can have many media assets, and a media asset
can belong to multiple uploads. Uploads are joined to media assets
through a upload_media_assets table.
An upload could potentially have multiple media assets if it's a Pixiv
or Twitter gallery. This is not yet implemented (at the moment all
uploads have one media asset).
A media asset can belong to multiple uploads if multiple people try
to upload the same file, or if the same user tries to upload the same
file more than once.
New features:
* On the upload page, you can press Ctrl+V to paste an URL and immediately upload it.
* You can save files for upload later. Your saved files are at /uploads.
Fixes:
* Improved error messages when uploading invalid files, bad URLs, and
when forgetting the rating.
Do a few micro-optimizations to reduce the number of memory allocations
during thumbnail generation.
This commit, combined with freezing string literals in a7dc05 and
67b961, reduces the number of allocations on the front page from 180,000
to 150,000, and the number of retained objects from 8,000 to 4,000.
Fix an error being raised when trying to generate thumbnails for corrupt
files. If the original image is corrupt, then ignore any errors and let
libvips try to generate a thumbnail as best it can. This will usually
result in an incomplete thumbnail.
This is used to provide higher resolution thumbnails for high pixel
density displays, such as phones or laptops. If your screen has a 2x
pixel density ratio, then 360x360 thumbnails will be rendered at 720x720
resolution.
We use WebP here because it's about 15% smaller than the equivalent
JPEG, and because if a device has a high enough pixel density to use
this, then it probably supports WebP.
720x720 thumbnails average about 36kb in size, compared to 20.35kb for
360x360 thumbnails and 7.55kb for 180x180 thumbnails.
Calculate the dimensions of thumbnails ourselves instead of letting
libvips calculate them for us. This way we know the exact size of
thumbnails, so we can set the right width and height for <img> tags. If
we let libvips calculate thumbnail sizes for us, then we can't predict
the exact size of thumbnails, because sometimes libvips rounds numbers
differently than us.
Bug: If a media asset got stuck in the 'processing' state during upload,
then it would stay stuck forever and the file couldn't be uploaded again
later.
Fix: Mark stuck assets as failed before raising the "Upload failed"
error. Once the asset is marked as failed, it can be uploaded again
later. Also, only wait for assets to finish processing if they were
uploaded less than 5 minutes ago. If a processing asset is more than 5
minutes old, consider it stuck and mark it as failed immediately.
Assets getting stuck in the processing state is a 'this should never
happen' error. Normally if any kind of exception is raised while
uploading the asset, the asset will be set to the 'failed' state. The
only way an asset can get stuck is if it fails and the exception handler
doesn't run, or the exception handler itself fails. This might happen if
the process is unexpectedly killed, or possibly if the HTTP request
times out and a TimeoutError is raised at an inopportune time. See below
for discussion of issues with Timeout.
[1]: https://vaneyckt.io/posts/the_disaster_that_is_rubys_timeout_method/
[2]: https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/
[3]: https://adamhooper.medium.com/in-ruby-dont-use-timeout-77d9d4e5a001
[4]: https://ruby-doc.org/core-3.0.2/Thread.html#method-c-handle_interrupt-label-Guarding+from+Timeout-3A-3AError