Commit Graph

568 Commits

Author SHA1 Message Date
evazion
644dfaf74c tests: fix broken tests. 2022-03-15 04:45:30 -05:00
evazion
cf8b8207e2 artists: change how artist urls are normalized.
Change how artist URLs are normalized in artist entries. Don't try to secretly
convert image URLs to profile URLs in artist entries. For example, if someone puts a
Pixiv image URL in an artist entry, don't secretly try to fetch the source and
convert it into a profile URL in the `normalized_url` field.

We did this because years ago, it was standard practice to put image URLs in artist
entries. Pixiv image URLs used to contain the artist's username, so we used to put
image URLs in artist entries for artist finding purposes. But Pixiv changed it so
that image URLs no longer contained the username, so we dealt with it by adding a
`normalized_url` column to artist_urls and secretly converting image URLs to profile
URLs in this field. But this is no longer necessary because now we don't normally put
image URLs in artist entries in the first place.

Now the `profile_url` method in `Source::URL` is used to normalize URLs in artist
entries. This lets us parse various profile URL formats and normalize them into a
single canonical form.

This also removes the `normalize_for_artist_finder` method from source strategies.
Instead the `profile_url` method is used for artist finding purposes. So the profile
URL returned by the source strategy needs to be the same as the URL in the artist
entry in order for artist finding to work.
2022-03-13 03:54:17 -05:00
evazion
9343f7c912 Source::URL: add profile_url method.
Add a method for converting a source URL into a profile URL. This will
be used for normalizing profile URLs in artist entries.

Also add the ability to parse a few more profile URL formats.
2022-03-13 03:54:17 -05:00
evazion
787b5c8e27 sources: merge Sta.sh strategy into DeviantArt strategy.
This turns out to be a little simpler than keeping them separate. The
only thing special we have to do for Sta.sh is use the Sta.sh page when
we have a DeviantArt image with a Sta.sh referer.
2022-03-12 00:57:43 -06:00
evazion
f2028c14fb Fix #5045: Exception on uploads when SauceNAO is the referrer URL.
Bug: We assumed the referer URL was from the same site as the target
URL. We tried to call methods on the referer only supported by the
target URL.

Fix: Ignore the referer URL when it's from a different site than the
target URL.
2022-03-12 00:04:39 -06:00
evazion
28971fe103 sources: factor out site_name method. 2022-03-11 23:20:53 -06:00
evazion
b4aea72d04 sources: remove preview_urls method from base strategy.
Remove the `preview_urls` method from strategies. The only place this was used was
when doing IQDB searches, to download the thumbnail image from the source instead of
the full image.

This wasn't worth it for a few reasons:

* Thumbnails on other sites are sometimes not the size we want, which could affect
  IQDB results.
* Grabbing thumbnails is complex for some sites. You can't always just rewrite the
  image URL. Sometimes it requires extra API calls, which can be slower than just
  grabbing the full image.
* For videos and animations, thumbnails from other sites don't always match our
  thumbnails. We do smart thumbnail generation to try to avoid blank thumbnails, which
  means we don't always pick the first frame, which could affect IQDB results.

API changes:

* /iqdb_queries?search[file_url] now downloads the URL as is without any modification.
  Before it tried to change thumbnail and sample size image URLs to the full version.

* /iqdb_queries?search[url] now returns an error if the URL is for a HTML page that
  contains multiple images. Before it would grab only the first image and silently
  ignore the rest.
2022-03-11 03:22:23 -06:00
evazion
2f61486ac6 sources: remove image_url method from base strategy.
Remove the `image_url` method from source strategies. This method would
return only the first image if a source had multiple images. The
`image_urls` method should be used instead. Tests were the main place
that still used `image_url` instead of `image_urls`.

Also make post replacements return an error if replacing with a source
that contains multiple images, instead of just blindly replacing the
post with the first image in the source.
2022-03-11 01:59:21 -06:00
evazion
4701027f45 sources: remove unused methods from base strategy.
Remove unused `urls`, `parsed_urls`, and `domains` methods.
2022-03-10 23:11:00 -06:00
nonamethanks
a6549bc6fe Add Fantia support
Also fixes a regression in 74fdeef10c
that stopped mastodon urls from being given the right priority.
2022-03-10 17:43:32 +01:00
evazion
43a665a66d sources: factor out Source::URL::NicoSeiga. 2022-03-10 04:53:51 -06:00
evazion
34854185be sources: factor out Source::URL::DeviantArt and Source::URL::Stash. 2022-03-10 00:29:49 -06:00
evazion
8a50148823 pixiv: fixup bug with fetching image_urls for bad_id posts.
Fix `image_urls` returning `[nil]` when fetching data for a image URL
that was bad_id. In that case `original_urls` is empty, so we fall back
to using the deleted image URL as-is.
2022-03-09 01:14:09 -06:00
evazion
77c88fd867 Merge pull request #5038 from nonamethanks/remove-redundant-comments
sources: remove redundant comments
2022-03-08 23:28:29 -06:00
evazion
6afb2f8e3c Merge pull request #5037 from nonamethanks/tumblr-refactor
sources: factor out Source::URL::Tumblr
2022-03-08 23:26:30 -06:00
evazion
cf4b9a6114 Merge pull request #5039 from nonamethanks/simplify-lofter-tag-parsing
Lofter: simplify tag extraction logic
2022-03-08 23:21:57 -06:00
evazion
987f2985d3 Merge pull request #5040 from nonamethanks/fix-weibo-404
Weibo: fix exception for deleted url
2022-03-08 23:08:37 -06:00
evazion
52a2d3418c pixiv: fixup bugs in 1c620f805.
* Fix error when uploading non-ugoira files.
* Fix sample image URLs not being rewritten to full images correctly. We
  have to get the full image URL from the API because given an
  /img-master/ URL, we don't know what the original file extension is.
2022-03-08 23:07:24 -06:00
nonamethanks
c9be77d1f8 Weibo: fix exception for deleted url 2022-03-09 05:31:38 +01:00
evazion
1c620f8055 sources: factor out Source::URL::Pixiv.
* Drop support for preview_urls. This means that IQDB lookups may be
  slower, especially for ugoiras, since we have to download the full
  ugoira now. However, ugoira lookups should produce better results,
  since the ugoira thumbnail chosen by Pixiv wasn't necessarily the same
  as the thumbnail chosen by Danbooru.

* Drop support for uploading single manga pages:

    http://www.pixiv.net/member_illust.php?mode=manga_big&illust_id=18557054&page=2

  Previously uploading an URL like this would only upload a single image
  out of a multi-image work. Now it will upload all images in the work.
  Pixiv no longer supports URLs like this, so we don't either.

* Add support for parsing URLs like this:

    https://i.pximg.net/c/360x360_70/custom-thumb/img/2022/03/08/00/00/56/96755248_p0_custom1200.jpg

  Apparently artists can choose a custom thumbnail now (not like anyone
  will try to upload one though).
2022-03-08 22:17:38 -06:00
evazion
df0bb70486 sources: factor out Source::URL::PixivSketch.
Add upload support for Pixiv Sketch. Fetch tags, commentary, and artist,
and rewrite sample images to full images.

Authentication isn't required. R18 images are hidden in the browser but
visible in the API.
2022-03-08 18:24:12 -06:00
nonamethanks
ff6bfff311 Lofter: simplify tag extraction logic
Now that we have a separate parsing class we can just use it to properly
parse tag urls as well.
2022-03-08 17:01:50 +01:00
nonamethanks
ebd3670076 sources: remove redundant comments
These comments are already present under the parse blocks, so the huge
walls of text before the code are not needed anymore.
2022-03-08 16:56:00 +01:00
nonamethanks
b9c7e467e5 sources: factor out Source::URL::Tumblr
Also adds support for fetching source data from direct image urls when
possible.
2022-03-08 15:06:06 +01:00
nonamethanks
d8e2f2ee33 sources: factor out Source::URL::Weibo
Additionally, fixed some broken tests and changed normalization for urls
of album type to point to the mobile version instead, because they're
only visible to logged-in users.
2022-03-07 16:52:43 +01:00
evazion
1609059bf4 sources: factor out Source::URL::Fanbox.
Also fix it so that we grab the full image for cover URLs like this:

* Sample: https://pixiv.pximg.net/c/1620x580_90_a2_g5/fanbox/public/images/creator/1566167/cover/QqxYtuWdy4XWQx1ZLIqr4wvA.jpeg
* Full: https://pixiv.pximg.net/fanbox/public/images/creator/1566167/cover/QqxYtuWdy4XWQx1ZLIqr4wvA.jpeg
2022-02-28 06:25:06 -06:00
evazion
317ec886bc sources: factor out Source::URL::Nijie.
Also fixes the uploader uploading all images when trying to upload only a
single image in a multi-image work. Caused by `image_urls` incorrectly
returning all images when the source strategy was given a url for a
single image.
2022-02-27 02:27:35 -06:00
evazion
fcf517834d sources: factor out Source::URL::ArtStation. 2022-02-26 21:03:49 -06:00
evazion
9169f00e80 sources: factor out Source::URL::Moebooru. 2022-02-26 17:46:44 -06:00
evazion
74fdeef10c sources: factor out Source::URL::Mastodon. 2022-02-26 15:08:27 -06:00
evazion
86d8e2d13d sources: factor out Source::URL::Lofter. 2022-02-25 23:43:10 -06:00
evazion
f062f2d145 sources: factor out Source::URL::Newgrounds.
Also fix it so that the image URL is set as the source for Newgrounds
posts, not the page URL. It's possible to generate the page URL from the
image URL (except for images after the first in multi-image posts).

* Page: https://www.newgrounds.com/art/view/natthelich/weaver
* Image: https://art.ngfiles.com/images/1520000/1520217_natthelich_weaver.jpg?f1606365031
2022-02-25 23:04:03 -06:00
evazion
64472a7b7e sources: factor out Source::URL::HentaiFoundry.
Add support for these URL types:

* http://pictures.hentai-foundry.com//s/soranamae/363663.jpg
* http://www.hentai-foundry.com/piccies/d/dmitrys/1183.jpg
* http://www.hentai-foundry.com/pic-149160.php
* http://www.hentai-foundry.com/user-RockCandy.php
* http://www.hentai-foundry.com/profile-sawao.php

These URL types are obsolete, but still present in some old posts.
2022-02-25 22:01:17 -06:00
evazion
e6ded89f85 sources: factor out Source::URL::Plurk.
Also fix it so that for adult works, we get the images posted by the
artist in the replies. Example: https://www.plurk.com/p/omc64y (nsfw).
2022-02-25 02:06:57 -06:00
evazion
26f4cf1ebd sources: factor out Source::URL::Skeb. 2022-02-25 02:06:57 -06:00
evazion
ffe52f5ead sources: factor out Source::URL::Foundation.
Add support for a couple more URL types:

* https://foundation.app/@asuka111art/dinner-with-cats-82426
* https://f8n-production-collection-assets.imgix.net/0x3B3ee1931Dc30C1957379FAc9aba94D1C48a5405/128711/QmcBfbeCMSxqYB3L1owPAxFencFx3jLzCPFx6xUBxgSCkH/nft.png

Also include these URLs in the list of profile URLs:

* https://foundation.app/0x7E2ef75C0C09b2fc6BCd1C68B6D409720CcD58d2 (for https://foundation.app/@mochiiimo)

These URLs should be stable even if the user changes their name.
2022-02-23 23:49:31 -06:00
evazion
043c08eb05 sources: factor out Source::URL::TwitPic. 2022-02-23 23:49:31 -06:00
evazion
7ed8f95a8e sources: add Source::URL class; factor out Source::URL::Twitter.
Introduce a Source::URL class for parsing URLs from source sites. Refactor the Twitter
source strategy to use it.

This is the first step towards factoring all the URL parsing logic out of source
strategies and moving it to subclasses of Source::URL. Each site will have a subclass
of Source::URL dedicated to parsing URLs from that site. Source strategies will use
these classes to extract information from URLs.

This is to simplify source strategies. Most sites have many different URL formats we have
to parse or rewrite, and handling all these different cases tends to make source
strategies very complex. Isolating the URL parsing logic from the site scraping logic
should make source strategies easier to maintain.
2022-02-23 23:46:04 -06:00
evazion
112b323f01 foundation: fix exception when uploading new Foundation url format.
Fix 'null value in column "source_url"' exception when uploading urls like this:

* https://foundation.app/@KILLERGF/kgfgen/4
* https://foundation.app/@mochiiimo/foundation/97376
2022-02-22 13:29:28 -06:00
evazion
7b009cc893 nicoseiga: fix inability to login to nicoseiga.
NicoSeiga changed it so that on every login, you must enter a 2FA code
sent by email. This broke the NicoSeiga strategy. The fix is to just use
a static session cookie instead (and hope it doesn't expire, and isn't
tied to an IP).

The `nico_seiga_login` and `nico_seiga_password` config settings have
been removed from config/danbooru_default_config.rb and replaced by
`nico_seiga_user_session`. If you run your own Danbooru instance, you
will have to update your config file manually.
2022-02-22 12:23:01 -06:00
evazion
7d49ab6130 Add Danbooru::URL class.
Introduce a Danbooru::URL class for dealing with URLs. This is a wrapper
around Addressable::URI that adds some additional helper methods. Most
significantly, the `parse` method only allows valid http/https URLs, and
it returns nil instead of raising an exception when the URL is invalid.
2022-02-22 00:17:53 -06:00
evazion
68ba447494 uploads: remove batch upload page.
* Make /uploads/batch redirect to /uploads/new.
* Remove /uploads/image_proxy.
2022-02-21 00:03:43 -06:00
evazion
9a5a04d74e nijie: fix uploads not working for new image URL format.
Fix uploads not working for image URLs like this:

    https://pic.nijie.net/07/nijie/17/95/728995/illust/0_0_403fdd541191110c_c25585.jpg
2022-02-15 20:45:28 -06:00
evazion
7cfbd891ae pixiv: avoid unnecessary API call when uploading Pixiv posts.
Do one less API call when fetching the image URLs for a Pixiv post. The
`is_ugoira?` check in `image_urls` caused us to do an extra API call
when fetching the image URLs for a non-ugoira post.

API calls to Pixiv take around ~800ms, so this reduces minimum upload
time for Pixiv posts from ~1.6 seconds (two calls) to ~0.8 seconds.
2022-02-15 18:55:12 -06:00
evazion
e4d7453180 uploads: improve error messages.
Improve upload error messages when downloading an URL fails, or it isn't
an image or video file.
2022-02-15 18:54:55 -06:00
evazion
b6538fde38 uploads: fix NicoSeiga sources not working.
Fix uploads for NicoSeiga sources not working because the strategy
returned URLs like the one below in the list of image_urls, which
require a login to download:

    https://seiga.nicovideo.jp/image/source/10315315

Also fix certain URLs like https://dic.nicovideo.jp/oekaki/52833.png not
working, because they didn't contain an image ID and the image_urls
method returned an empty list in this case.
2022-02-15 17:12:02 -06:00
evazion
37075988ce uploads: fix page_url for null strategy.
Fix the null source strategy setting the page URL. The page URL is
expected to be nil when we can't determine the page containing the image URL.

Fixes the upload_media_assets.page_url field being filled for uploads
from unknown sites.
2022-02-15 00:59:22 -06:00
evazion
27d71f2727 uploads: raise download timeout.
Raise the timeout for downloading files from the source to 60 seconds globally.

Previously had a lower timeout because uploads were processed in the
foreground when not using the bookmarklet, and we didn't want to tie up
Puma worker processes with slow downloads. Now that all uploads are
processed in the background, we can have a higher timeout.
2022-02-15 00:56:51 -06:00
evazion
26da728a07 deviant art: fix new image URLs not being recognized.
Partial fix for #5008. DeviantArt now returns https://wixmp-ed30a86b8c4ca887773594c2.wixmp.com
URLs instead of https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com for images in the
API. Fix these URLs not being recognized by the DeviantArt strategy.
2022-02-14 00:33:50 -06:00
evazion
d6f7725a1e nijie: fix exception in login process.
Fix an exception when we can't find the 'url' field in the login form
because we're rate limited by Nijie and couldn't scrape the login page.
2022-02-12 17:26:25 -06:00