Add methods to Source::URL for determining whether a URL is an image
URL, a page URL, or a profile URL.
Also add more source URL tests and fix various URL parsing bugs.
Refactor source strategies to remove the `canonical_url` method.
`canonical_url` returned the URL that should be used as the source of
the post after upload. Now we simply use `Source::URL#page_url` to
determine the source after upload. If the source is an image URL that is
convertible to a page URL, then the image URL is used as the source. If
the source is an image URL that is not convertible to a page URL, then
the page URL is used as the source.
This simplifies source strategies so that all they have to care about is
implementing the `Source::URL#page_url` and `Sources::Strategies#page_url`
methods, and the preferred source will be chosen for posts automatically.
Add partial support for fetching videos from ArtStation posts that
contain videos. Most of this code is disabled for now because actually
downloading these videos requires bypassing a Cloudflare captcha.
Remove the `preview_urls` method from strategies. The only place this was used was
when doing IQDB searches, to download the thumbnail image from the source instead of
the full image.
This wasn't worth it for a few reasons:
* Thumbnails on other sites are sometimes not the size we want, which could affect
IQDB results.
* Grabbing thumbnails is complex for some sites. You can't always just rewrite the
image URL. Sometimes it requires extra API calls, which can be slower than just
grabbing the full image.
* For videos and animations, thumbnails from other sites don't always match our
thumbnails. We do smart thumbnail generation to try to avoid blank thumbnails, which
means we don't always pick the first frame, which could affect IQDB results.
API changes:
* /iqdb_queries?search[file_url] now downloads the URL as is without any modification.
Before it tried to change thumbnail and sample size image URLs to the full version.
* /iqdb_queries?search[url] now returns an error if the URL is for a HTML page that
contains multiple images. Before it would grab only the first image and silently
ignore the rest.
Remove the `image_url` method from source strategies. This method would
return only the first image if a source had multiple images. The
`image_urls` method should be used instead. Tests were the main place
that still used `image_url` instead of `image_urls`.
Also make post replacements return an error if replacing with a source
that contains multiple images, instead of just blindly replacing the
post with the first image in the source.
* Move the source normalization logic out of the post model
and into individual sources' strategies.
* Rewrite normalization tests to be handled into each source's test,
and expand them significantly. Previously we were only testing
a very small subset of domains and variants.
* Fix up normalization for several sites.
* Normalize fav.me urls into normal deviantart urls.
* Normalize spaces to underscores when saving other names. Preserve case
since case can be significant.
* Fix WikiPage#other_names_include to search case-insensitively (note:
this prevents using the index).
* Fix sources to return the raw tags in `#tags` and the normalized tags
in `#normalized_tags`. The normalized tags are the tags that will be
matched against other names.
* Don't fail on urls that don't contain the project id (direct image urls).
* Don't fail when the work is deleted.
* Parse artist name from url when possible. This way the artist finder works on bad_artstation_id posts.
* Set canonical source url to `https://artist.artstation.com/projects/12345` instead of
`https://www.artstation.com/artwork/1235` (this way we preserve the artist name).
* Cache api call.
* Include api call results in /source.json.