Commit Graph

48 Commits

Author SHA1 Message Date
evazion
e3af738371 tests: fix broken tests. 2022-08-24 02:03:37 -05:00
evazion
d7e08d1313 media assets: add ability to search by AI tags.
Add ability to search the /media_assets index by AI tags. Multi-tag
searches are supported, including AND/OR/NOT operators, but metatags
aren't supported. Multi-tag searches will probably be slow.

The default AI tag confidence threshold is 50%. There's a hidden
search[min_score] URL param that lets you change this.
2022-07-06 01:38:41 -05:00
evazion
67798c9ece Fix #5221: Trying to upload an unsupported url shows ai tags error. 2022-07-01 18:13:36 -05:00
evazion
a9fe73a483 ai tags: save ai tags on upload.
Save the AI tags when a media asset is uploaded.
2022-06-28 03:12:46 -05:00
evazion
6f24db92e5 ai tags: make ai tags accessible in api via includes.
Make these things work:

* https://danbooru.donmai.us/posts.json?only=ai_tags
* https://danbooru.donmai.us/media_assets.json?only=ai_tags
* https://danbooru.donmai.us/ai_tags.json?only=media_asset,post,tag
2022-06-26 20:37:35 -05:00
evazion
1aeb52186e Add AI tag model and UI.
Add a database model for storing AI-predicted tags, and add a UI for browsing and searching these tags.

AI tags are generated by the Danbooru Autotagger (https://github.com/danbooru/autotagger). See that
repo for details about the model.

The database schema is `ai_tags (media_asset_id integer, tag_id integer, score smallint)`. This is
designed to be as space-efficient as possible, since in production we have over 300 million
AI-generated tags (6 million images and 50 tags per post). This amounts to over 10GB in size, plus
indexes.

You can search for AI tags using e.g. `ai:scenery`. You can do `ai:scenery -scenery` to find posts
where the scenery tag is potentially missing, or `scenery -ai:scenery` to find posts that are
potentially mistagged (or more likely where the AI missed the tag).

You can browse AI tags at https://danbooru.donmai.us/ai_tags. On this page you can filter by
confidence level. You can also search unposted media assets by AI tag.

To generate tags, use the `autotag` script from the Autotagger repo, something like this:

  docker run --rm -v ~/danbooru/public/data/360x360:/images ghcr.io/danbooru/autotagger ./autotag -c -f /images | gzip > tags.csv.gz

To import tags, use the fix script in script/fixes/. Expect a Danbooru-size dataset to take
hours to days to generate tags, then 20-30 minutes to import. Currently this all has to be done by hand.
2022-06-24 04:54:26 -05:00
evazion
181639368c posts: add is: and has: metatags.
Add the following metatags:

* is:parent
* is:child
* is:safe
* is:questionable
* is:explicit
* is:sfw (same as -rating:q,e)
* is:nsfw (same as rating:q,e)
* is:active
* is:deleted
* is:pending
* is:flagged
* is:appealed
* is:banned
* is:modqueue
* is:unmoderated
* is:jpg
* is:png
* is:gif
* is:mp4
* is:webm
* is:swf
* is:zip
* has:parent
* has:children
* has:source
* has:appeals
* has:flags
* has:replacements
* has:comments
* has:commentary
* has:notes
* has:pools

All of these searches were already possible with other metatags, but these might be more convenient.
2022-05-18 13:04:15 -05:00
evazion
4ba993319a media assets: add file_key, is_public columns.
`file_key` is a random 9-character base-62 string that will be used as
the image filename in the future.

`is_public` is whether the image can be viewed without authentication or not.

Users running downstream boorus must run `bin/rails db:migrate` and
`script/fixes/109_generate_media_asset_file_keys.rb` after this commit.
2022-05-04 23:19:53 -05:00
evazion
ac98c142a4 posts: move expunged image to trash folder.
When a post is expunged, move the image to a trash folder so it can be
recovered if needed.
2022-05-03 05:51:09 -05:00
Michał Frąckiewicz
93635a20d9 Configurable max video duration 2022-03-21 19:22:34 +01:00
evazion
fc5aec7de0 media assets: optimize /media_assets?search[is_posted] query.
Followup to 093a808a3. Using a NOT EXISTS clause is much faster than the
`LEFT OUTER JOIN posts WHERE posts.id IS NULL` clause generated by
`.where.missing(:post)`.
2022-02-18 04:24:33 -06:00
evazion
093a808a36 Fix #4986: Add ability to filter images in /media_assets and /uploads depending on if they have become posts 2022-02-18 03:39:08 -06:00
evazion
e4d7453180 uploads: improve error messages.
Improve upload error messages when downloading an URL fails, or it isn't
an image or video file.
2022-02-15 18:54:55 -06:00
evazion
87a00a1182 uploads: fix "ArgumentError: string contains null byte" error
Fix an error when trying to upload a file larger than the file size
limit. In this case we tried to dump the whole HTTP response into the
error message, which included the binary file itself, which caused this
exception because it contained null bytes.
2022-02-15 18:16:47 -06:00
evazion
02edb52569 uploads: enable multi-file uploads when uploading from source.
Make the upload page automatically detect when a source URL has multiple images
and let the user choose which images to post.

For example, when uploading a Twitter or Pixiv post with more than one image, we
direct the user to a page showing a thumbnail for each image and letting
them choose which ones to post.

This is similar to the batch upload page, except we actually download each image
in the background, instead of just hotlinking or proxying the thumbnails through
our servers. This avoids various problems with proxying and makes new features
possible, like showing which images in the batch have already been posted.
2022-02-14 16:13:55 -06:00
evazion
e7744cb6e3 uploads: generate thumbnails in parallel.
Make uploads faster by generating and saving thumbnails in parallel.

We generate each thumbnail in parallel, then send each thumbnail to the
backend image servers in parallel.

Most images have 5 variants: 'preview' (150x150), 180x180, 360x360,
720x720, and 'sample' (850px width). Plus the original file, that's 6
files we have to save. In production we have 2 image servers, so we have
to save each file twice, to 2 remote servers. Doing all this in parallel
should make uploads significantly faster.
2022-02-04 16:20:50 -06:00
evazion
92a4d045e2 media assets: add thumbnail view to /media_assets page.
Add a thumbnail view to the /media_assets page. This page lets you see
all images uploaded to Danbooru by all users (although you can't see who
the uploader is). Also add a link to this page in the subnav bar on the
upload page.
2022-02-02 01:12:56 -06:00
evazion
43c4158d36 uploads: merge tags when a duplicate is uploaded (fix #3130).
Automatically merge tags when uploading a duplicate.

There are two cases:

* You try to upload an image, but it's already on Danbooru. In this case
  you'll be immediately redirected to the original post, before you
  can start tagging the upload.

* You're uploading an image, it wasn't a dupe when you first opened the
  upload page, but you got sniped while tagging it. In this case your tags
  will be merged with the original post, and you will be redirected to the
  original post.

There are a few corner cases:

* If you don't have permission to edit the original post, for example
  because it's banned or has a censored tag, then your tags won't be
  merged and will be silently ignored.

* Only the tags, rating, and parent ID will be merged. The source and
  artist commentary won't be merged. This is so that if an artist uploads
  the exact same file to multiple sites, the new source won't override
  the original source.

* Some tags might be contradictory. For example, the new post might
  be tagged translation_request, but the original post might already be
  translated. It's up to the user to fix these things afterwards.
2022-01-30 03:14:22 -06:00
evazion
11b7bcac91 uploads: fix broken tests.
* Fix broken upload tests.
* Fix uploads to return an error if both a file and a source are given
  at the same time, or if neither are given. Also fix the error message
  in this case so that it doesn't include "base" at the start of the string.
* Fix uploads to percent-encode any Unicode characters in the source URL.
* Add a max filesize validation to media assets.
2022-01-29 05:14:49 -06:00
evazion
abdab7a0a8 uploads: rework upload process.
Rework the upload process so that files are saved to Danbooru first
before the user starts tagging the upload.

The main user-visible change is that you have to select the file first
before you can start tagging it. Saving the file first lets us fix a
number of problems:

* We can check for dupes before the user tags the upload.
* We can perform dupe checks and show preview images for users not using the bookmarklet.
* We can show preview images without having to proxy images through Danbooru.
* We can show previews of videos and ugoira files.
* We can reliably show the filesize and resolution of the image.
* We can let the user save files to upload later.
* We can get rid of a lot of spaghetti code related to preprocessing
  uploads. This was the cause of most weird "md5 confirmation doesn't
  match md5" errors.

(Not all of these are implemented yet.)

Internally, uploading is now a two-step process: first we create an upload
object, then we create a post from the upload. This is how it works:

* The user goes to /uploads/new and chooses a file or pastes an URL into
  the file upload component.
* The file upload component calls `POST /uploads` to create an upload.
* `POST /uploads` immediately returns a new upload object in the `pending` state.
* Danbooru starts processing the upload in a background job (downloading,
  resizing, and transferring the image to the image servers).
* The file upload component polls `/uploads/$id.json`, checking the
  upload `status` until it returns `completed` or `error`.
* When the upload status is `completed`, the user is redirected to /uploads/$id.
* On the /uploads/$id page, the user can tag the upload and submit it.
* The upload form calls `POST /posts` to create a new post from the upload.
* The user is redirected to the new post.

This is the data model:

* An upload represents a set of files uploaded to Danbooru by a user.
  Uploaded files don't have to belong to a post. An upload has an
  uploader, a status (pending, processing, completed, or error), a
  source (unless uploading from a file), and a list of media assets
  (image or video files).

* There is a has-and-belongs-to-many relationship between uploads and
  media assets. An upload can have many media assets, and a media asset
  can belong to multiple uploads. Uploads are joined to media assets
  through a upload_media_assets table.

  An upload could potentially have multiple media assets if it's a Pixiv
  or Twitter gallery. This is not yet implemented (at the moment all
  uploads have one media asset).

  A media asset can belong to multiple uploads if multiple people try
  to upload the same file, or if the same user tries to upload the same
  file more than once.

New features:

* On the upload page, you can press Ctrl+V to paste an URL and immediately upload it.
* You can save files for upload later. Your saved files are at /uploads.

Fixes:

* Improved error messages when uploading invalid files, bad URLs, and
  when forgetting the rating.
2022-01-28 04:13:22 -06:00
evazion
1c5786d20f posts: remove cropped thumbnails. 2021-12-16 15:58:29 -06:00
evazion
163ba8e7da posts: micro-optimize allocations during thumbnail generation.
Do a few micro-optimizations to reduce the number of memory allocations
during thumbnail generation.

This commit, combined with freezing string literals in a7dc05 and
67b961, reduces the number of allocations on the front page from 180,000
to 150,000, and the number of retained objects from 8,000 to 4,000.
2021-12-16 00:53:48 -06:00
evazion
a7dc05ce63 Enable frozen string literals.
Make all string literals immutable by default.
2021-12-14 21:33:27 -06:00
evazion
c22f7b799b media assets: fix error when generating thumbnails for corrupt files.
Fix an error being raised when trying to generate thumbnails for corrupt
files. If the original image is corrupt, then ignore any errors and let
libvips try to generate a thumbnail as best it can. This will usually
result in an incomplete thumbnail.
2021-12-05 21:46:14 -06:00
evazion
ad49a10147 media assets: fix bug in thumbnail generation.
Fix thumbnail generation throwing a NoMatchingPatternError.
2021-12-05 19:04:17 -06:00
evazion
9cb70fa632 posts: add 720x720 thumbnail size.
This is used to provide higher resolution thumbnails for high pixel
density displays, such as phones or laptops. If your screen has a 2x
pixel density ratio, then 360x360 thumbnails will be rendered at 720x720
resolution.

We use WebP here because it's about 15% smaller than the equivalent
JPEG, and because if a device has a high enough pixel density to use
this, then it probably supports WebP.

720x720 thumbnails average about 36kb in size, compared to 20.35kb for
360x360 thumbnails and 7.55kb for 180x180 thumbnails.
2021-12-05 09:19:29 -06:00
evazion
17537084fe posts: generate 180x180px and 360x360px thumbnails (#4932).
Add two new thumbnail sizes. These new thumbnail sizes are generated on
upload, but not used yet.
2021-12-02 23:42:44 -06:00
evazion
e5ba6d4afc MediaFile: fix thumbnail dimension calculation.
Calculate the dimensions of thumbnails ourselves instead of letting
libvips calculate them for us. This way we know the exact size of
thumbnails, so we can set the right width and height for <img> tags. If
we let libvips calculate thumbnail sizes for us, then we can't predict
the exact size of thumbnails, because sometimes libvips rounds numbers
differently than us.
2021-12-01 04:45:26 -06:00
evazion
8f36ebe2b8 Fix #4914: RuntimeError corrupting uploads
Bug: If a media asset got stuck in the 'processing' state during upload,
then it would stay stuck forever and the file couldn't be uploaded again
later.

Fix: Mark stuck assets as failed before raising the "Upload failed"
error. Once the asset is marked as failed, it can be uploaded again
later. Also, only wait for assets to finish processing if they were
uploaded less than 5 minutes ago. If a processing asset is more than 5
minutes old, consider it stuck and mark it as failed immediately.

Assets getting stuck in the processing state is a 'this should never
happen' error. Normally if any kind of exception is raised while
uploading the asset, the asset will be set to the 'failed' state. The
only way an asset can get stuck is if it fails and the exception handler
doesn't run, or the exception handler itself fails. This might happen if
the process is unexpectedly killed, or possibly if the HTTP request
times out and a TimeoutError is raised at an inopportune time. See below
for discussion of issues with Timeout.

[1]: https://vaneyckt.io/posts/the_disaster_that_is_rubys_timeout_method/
[2]: https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/
[3]: https://adamhooper.medium.com/in-ruby-dont-use-timeout-77d9d4e5a001
[4]: https://ruby-doc.org/core-3.0.2/Thread.html#method-c-handle_interrupt-label-Guarding+from+Timeout-3A-3AError
2021-11-08 18:22:04 -06:00
evazion
4095d14f2a media assets: fix tagged filenames option.
Fix the `enable_seo_post_urls` config option not being respected. This
option controls whether filenames in image URLs contain the tags. This
option requires URLs rewrites in Nginx to work so it's disabled by
default.
2021-10-29 07:14:21 -05:00
evazion
082544ab03 StorageManager: remove Post-specific code.
Refactor StorageManager to remove all image URL generation code. Instead
the image URL generation code lives in MediaAsset.

Now StorageManager is only concerned with how to read and write files to
remote storage backends like S3 or SFTP, not with how image URLs should
be generated. This way the file storage code isn't tightly coupled to
posts, so it can be used to store any kind of file, not just images
belonging to posts.
2021-10-27 00:05:30 -05:00
evazion
afe5095ee6 posts: mark media asset as expunged when post is expunged.
Fix it so that when a post is expunged, the media asset is also marked
as expunged. This way the files will be deleted, but the media asset
will still remain as a record of what was expunged. The media asset will
have the md5, width, height, file ext, and file size of the deleted file.
2021-10-26 02:53:32 -05:00
evazion
f5e7d50dbb media assets: don't destroy ugoira data on destroy.
Don't destroy Pixiv Ugoira frame data when the media asset is destroyed.
This is wrong because when uploads were pruned, it could delete the
frame data of an active post.
2021-10-24 04:35:13 -05:00
evazion
5c7a0f225c media assets: prevent duplicate media assets.
Add a md5 uniqueness constraint on media assets to prevent duplicate
assets from being created. This way we can guarantee that there is one
active media asset per uploaded file.

Also make it so that if two people are uploading the same file at the
same time, the file is processed only once.
2021-10-24 04:35:06 -05:00
evazion
bc506ed1b8 uploads: refactor to simplify ugoira-handling and replacements:
* Make it so replacing a post doesn't generate a dummy upload as a side effect.
* Make it so you can't replace a post with itself (the post should be regenerated instead).
* Refactor uploads and replacements to save the ugoira frame data when
  the MediaAsset is created, not when the post is created. This way it's
  possible to view the ugoira before the post is created.
* Make `download_file!` in the Pixiv source strategy return a MediaFile
  with the ugoira frame data already attached to it, instead of returning it
  in the `data` field then passing it around separately in the `context`
  field of the upload.
2021-10-18 05:18:46 -05:00
evazion
1d034a3223 media assets: move more file-handling logic into MediaAsset.
Move more of the file-handling logic from UploadService and
StorageManager into MediaAsset. This is part of refactoring posts and
uploads to allow multiple images per post.
2021-10-18 00:10:29 -05:00
evazion
0731b07d27 posts: store duration of animations and videos.
Start storing the duration of animations and videos in the `duration`
field on the media_assets table. This had to wait until 3d30bfd69 was
deployed, which had to wait until Postgres was upgraded in order to add
the duration column to the media_assets table without downtime.

Also add a fix script to backfill the duration on existing posts. Usage:

    TAGS=animated ./script/fixes/079_fix_duration.rb
2021-10-07 03:21:08 -05:00
evazion
c99d0523bb /media_assets: add basic index and show pages.
* Add a basic index page at https://danbooru.donmai.us/media_assets.
* Add a basic show page at https://danbooru.donmai.us/media_assets/1.
* Add ability to search /media_assets.json by metadata. Example:
** https://danbooru.donmai.us/media_assets.json?search[metadata][File:ColorComponents]=3
* Add a "»" link next to the filesize on posts linking to the metadata page.

Known issues:

* Sometimes the MD5 links on the /media_assets page return "That record
  was not found" errors. These are unfinished uploads that haven't been
  made into posts yet.
* No good way to search for custom metadata fields in the search form.
* Design is ugly.
2021-09-29 07:46:11 -05:00
evazion
79fdfa86ae Fix various rubocop warnings. 2021-09-27 00:46:13 -05:00
evazion
01cdc7da7f media assets: add status column. 2021-09-26 08:06:13 -05:00
evazion
ab3f35580f metadata: move metadata parsing into ExifTool::Metadata.
Move the metadata parsing code from MediaAsset to ExifTool::Metadata so
we can use it outside the context of a MediaAsset, in particular when
dealing with a MediaFile that hasn't been saved to disk yet.
2021-09-26 07:19:36 -05:00
evazion
74b03a7bd0 posts: fix incorrect exif rotation for PNGs.
Fix a bug where where PNG images could be incorrectly detected as
exif-rotated. This would happen when a PNG contained the
IFD0:Orientation flag. It's technically possible for a PNG to contain
this flag, but it's ignored by libvips and by browsers.

post #3762340 (nsfw) is an example of a PNG like this.

The fix is to use `autorot` to let libvips apply the rotation instead of
trying to interpret the exif data ourselves. Note that libvips-8.9 has a
bug where it doesn't strip the orientation flag after applying
`autorot`, which leads to the image being incorrectly rotated a second
time when generating the thumbnail. Use libvips-8.11 instead.
2021-09-23 00:10:00 -05:00
evazion
6740ef17ab posts: fix detection of exif_rotation tag.
`IFD0:Orientation` is the orientation of the main image.
`IFD1:Orientation` is the orientation of the embedded thumbnail, if it
has one. Using `IFD1:Orientation` was incorrect here because some images
have a non-rotated main image but a rotated thumbnail. Post #1023563 is
an example.
2021-09-22 11:17:28 -05:00
evazion
c69ba54b5a Fix #4442: Autotag image metadata.
Autotag `greyscale`, `non-repeating_animation`, and `exif_rotation`.

Note that this does not detect all (or even most) greyscale images.
Artists often save greyscale images as RGB instead of as greyscale.
2021-09-21 11:18:06 -05:00
evazion
d5981754c4 posts: automatically tag animated_gif & animated_png on tag edit.
Automatically tag animated_gif and animated_png when a post is edited.
Add them back if the user tries to remove them from an animated post,
or remove them if the user tries to add them to a non-animated post.

Before we added these tags at upload time, but it was possible for users
to remove them after upload, or to incorrectly add them to non-animated
posts. They were added at upload time because we couldn't afford to open
the file and parse the metadata on every tag edit. Now that we save the
metadata in the database, we can do this.

This also makes it so you can't tag ugoira on non-ugoira files.

Known bug: it's possible to have an animated GIF where every frame is
identical. Post #3770975 is an example. This will be detected as an
animated GIF even though visually it doesn't appear to be animated.

Fixes #4041: Animated_gif tag not added to preprocessed uploads
2021-09-21 08:26:02 -05:00
evazion
ea6e47125e metadata: add ability to search exif metadata.
Usage:

* https://danbooru.donmai.us/media_metadata?search[has_metadata]=true
* https://danbooru.donmai.us/media_metadata?search[has_metadata]=false
* https://danbooru.donmai.us/media_metadata?search[metadata_has_key]=GIF:GIFVersion
* https://danbooru.donmai.us/media_metadata?search[metadata][GIF:GIFVersion]=89a
* https://danbooru.donmai.us/media_metadata?search[metadata][GIF:GIFVersion]&search[metadata][GIF:BackgroundColor]=0
2021-09-16 00:25:21 -05:00
evazion
3d660953d4 Add MediaMetadata model.
Add a model for storing image and video metadata for uploaded files.

Metadata is extracted using ExifTool. You will need to install ExifTool
after this commit. ExifTool 12.22 is the minimum required version
because we use the `--binary` option, which was added in this release.

The MediaMetadata model is separate from the MediaAsset model because
some files contain tons of metadata, and most of it is non-essential.
The MediaAsset model represents an uploaded file and contains essential
metadata, like the file's size and type, while the MediaMetadata model
represents all the other non-essential metadata associated with a file.

Metadata is stored as a JSON column in the database.

ExifTool returns all the file's metadata, not just the EXIF metadata.
EXIF is one of several types of image metadata, hence why we call
it MediaMetadata instead of EXIFMetadata.
2021-09-08 05:00:54 -05:00
evazion
b068c113a8 Add MediaAsset model.
A MediaAsset represents an image or video file uploaded to Danbooru. It
stores the metadata associated with the image or video. This is to work
on decoupling files from posts so that images can be uploaded separately
from posts.
2021-09-02 06:07:52 -05:00