Commit Graph

87 Commits

Author SHA1 Message Date
evazion
03d2098d6d artists: fix artist finder returning wrong results when given nil url.
Fix the artist finder returning incorrect results when given a nil URL.
This only happened when an artist with a URL like this existed:

    http:///blog.naver.com/dan_rak

Note the triple `///`; the extra `/` messed up the artist finder.

The artist finder may be given a nil URL when a source strategy returns
a nil profile URL, usually because the source is bad_id.
2022-03-18 06:01:36 -05:00
evazion
cf8b8207e2 artists: change how artist urls are normalized.
Change how artist URLs are normalized in artist entries. Don't try to secretly
convert image URLs to profile URLs in artist entries. For example, if someone puts a
Pixiv image URL in an artist entry, don't secretly try to fetch the source and
convert it into a profile URL in the `normalized_url` field.

We did this because years ago, it was standard practice to put image URLs in artist
entries. Pixiv image URLs used to contain the artist's username, so we used to put
image URLs in artist entries for artist finding purposes. But Pixiv changed it so
that image URLs no longer contained the username, so we dealt with it by adding a
`normalized_url` column to artist_urls and secretly converting image URLs to profile
URLs in this field. But this is no longer necessary because now we don't normally put
image URLs in artist entries in the first place.

Now the `profile_url` method in `Source::URL` is used to normalize URLs in artist
entries. This lets us parse various profile URL formats and normalize them into a
single canonical form.

This also removes the `normalize_for_artist_finder` method from source strategies.
Instead the `profile_url` method is used for artist finding purposes. So the profile
URL returned by the source strategy needs to be the same as the URL in the artist
entry in order for artist finding to work.
2022-03-13 03:54:17 -05:00
evazion
f2028c14fb Fix #5045: Exception on uploads when SauceNAO is the referrer URL.
Bug: We assumed the referer URL was from the same site as the target
URL. We tried to call methods on the referer only supported by the
target URL.

Fix: Ignore the referer URL when it's from a different site than the
target URL.
2022-03-12 00:04:39 -06:00
evazion
28971fe103 sources: factor out site_name method. 2022-03-11 23:20:53 -06:00
evazion
b4aea72d04 sources: remove preview_urls method from base strategy.
Remove the `preview_urls` method from strategies. The only place this was used was
when doing IQDB searches, to download the thumbnail image from the source instead of
the full image.

This wasn't worth it for a few reasons:

* Thumbnails on other sites are sometimes not the size we want, which could affect
  IQDB results.
* Grabbing thumbnails is complex for some sites. You can't always just rewrite the
  image URL. Sometimes it requires extra API calls, which can be slower than just
  grabbing the full image.
* For videos and animations, thumbnails from other sites don't always match our
  thumbnails. We do smart thumbnail generation to try to avoid blank thumbnails, which
  means we don't always pick the first frame, which could affect IQDB results.

API changes:

* /iqdb_queries?search[file_url] now downloads the URL as is without any modification.
  Before it tried to change thumbnail and sample size image URLs to the full version.

* /iqdb_queries?search[url] now returns an error if the URL is for a HTML page that
  contains multiple images. Before it would grab only the first image and silently
  ignore the rest.
2022-03-11 03:22:23 -06:00
evazion
2f61486ac6 sources: remove image_url method from base strategy.
Remove the `image_url` method from source strategies. This method would
return only the first image if a source had multiple images. The
`image_urls` method should be used instead. Tests were the main place
that still used `image_url` instead of `image_urls`.

Also make post replacements return an error if replacing with a source
that contains multiple images, instead of just blindly replacing the
post with the first image in the source.
2022-03-11 01:59:21 -06:00
evazion
4701027f45 sources: remove unused methods from base strategy.
Remove unused `urls`, `parsed_urls`, and `domains` methods.
2022-03-10 23:11:00 -06:00
evazion
7ed8f95a8e sources: add Source::URL class; factor out Source::URL::Twitter.
Introduce a Source::URL class for parsing URLs from source sites. Refactor the Twitter
source strategy to use it.

This is the first step towards factoring all the URL parsing logic out of source
strategies and moving it to subclasses of Source::URL. Each site will have a subclass
of Source::URL dedicated to parsing URLs from that site. Source strategies will use
these classes to extract information from URLs.

This is to simplify source strategies. Most sites have many different URL formats we have
to parse or rewrite, and handling all these different cases tends to make source
strategies very complex. Isolating the URL parsing logic from the site scraping logic
should make source strategies easier to maintain.
2022-02-23 23:46:04 -06:00
evazion
7d49ab6130 Add Danbooru::URL class.
Introduce a Danbooru::URL class for dealing with URLs. This is a wrapper
around Addressable::URI that adds some additional helper methods. Most
significantly, the `parse` method only allows valid http/https URLs, and
it returns nil instead of raising an exception when the URL is invalid.
2022-02-22 00:17:53 -06:00
evazion
68ba447494 uploads: remove batch upload page.
* Make /uploads/batch redirect to /uploads/new.
* Remove /uploads/image_proxy.
2022-02-21 00:03:43 -06:00
evazion
e4d7453180 uploads: improve error messages.
Improve upload error messages when downloading an URL fails, or it isn't
an image or video file.
2022-02-15 18:54:55 -06:00
evazion
27d71f2727 uploads: raise download timeout.
Raise the timeout for downloading files from the source to 60 seconds globally.

Previously had a lower timeout because uploads were processed in the
foreground when not using the bookmarklet, and we didn't want to tie up
Puma worker processes with slow downloads. Now that all uploads are
processed in the background, we can have a higher timeout.
2022-02-15 00:56:51 -06:00
evazion
abdab7a0a8 uploads: rework upload process.
Rework the upload process so that files are saved to Danbooru first
before the user starts tagging the upload.

The main user-visible change is that you have to select the file first
before you can start tagging it. Saving the file first lets us fix a
number of problems:

* We can check for dupes before the user tags the upload.
* We can perform dupe checks and show preview images for users not using the bookmarklet.
* We can show preview images without having to proxy images through Danbooru.
* We can show previews of videos and ugoira files.
* We can reliably show the filesize and resolution of the image.
* We can let the user save files to upload later.
* We can get rid of a lot of spaghetti code related to preprocessing
  uploads. This was the cause of most weird "md5 confirmation doesn't
  match md5" errors.

(Not all of these are implemented yet.)

Internally, uploading is now a two-step process: first we create an upload
object, then we create a post from the upload. This is how it works:

* The user goes to /uploads/new and chooses a file or pastes an URL into
  the file upload component.
* The file upload component calls `POST /uploads` to create an upload.
* `POST /uploads` immediately returns a new upload object in the `pending` state.
* Danbooru starts processing the upload in a background job (downloading,
  resizing, and transferring the image to the image servers).
* The file upload component polls `/uploads/$id.json`, checking the
  upload `status` until it returns `completed` or `error`.
* When the upload status is `completed`, the user is redirected to /uploads/$id.
* On the /uploads/$id page, the user can tag the upload and submit it.
* The upload form calls `POST /posts` to create a new post from the upload.
* The user is redirected to the new post.

This is the data model:

* An upload represents a set of files uploaded to Danbooru by a user.
  Uploaded files don't have to belong to a post. An upload has an
  uploader, a status (pending, processing, completed, or error), a
  source (unless uploading from a file), and a list of media assets
  (image or video files).

* There is a has-and-belongs-to-many relationship between uploads and
  media assets. An upload can have many media assets, and a media asset
  can belong to multiple uploads. Uploads are joined to media assets
  through a upload_media_assets table.

  An upload could potentially have multiple media assets if it's a Pixiv
  or Twitter gallery. This is not yet implemented (at the moment all
  uploads have one media asset).

  A media asset can belong to multiple uploads if multiple people try
  to upload the same file, or if the same user tries to upload the same
  file more than once.

New features:

* On the upload page, you can press Ctrl+V to paste an URL and immediately upload it.
* You can save files for upload later. Your saved files are at /uploads.

Fixes:

* Improved error messages when uploading invalid files, bad URLs, and
  when forgetting the rating.
2022-01-28 04:13:22 -06:00
evazion
a7dc05ce63 Enable frozen string literals.
Make all string literals immutable by default.
2021-12-14 21:33:27 -06:00
evazion
bc506ed1b8 uploads: refactor to simplify ugoira-handling and replacements:
* Make it so replacing a post doesn't generate a dummy upload as a side effect.
* Make it so you can't replace a post with itself (the post should be regenerated instead).
* Refactor uploads and replacements to save the ugoira frame data when
  the MediaAsset is created, not when the post is created. This way it's
  possible to view the ugoira before the post is created.
* Make `download_file!` in the Pixiv source strategy return a MediaFile
  with the ugoira frame data already attached to it, instead of returning it
  in the `data` field then passing it around separately in the `context`
  field of the upload.
2021-10-18 05:18:46 -05:00
nonamethanks
9a6a6e52ea Lofter: raise timeout for file download 2021-09-10 13:10:29 +02:00
evazion
bb7f24d279 Add HTTP proxy support.
Add support for using a proxy for HTTP requests. Only used for external
requests, such as downloading files or talking to source sites such as
Pixiv or Twitter, not for internal requests, such as talking to IQDB or
Reportbooru.
2021-08-28 04:53:33 -05:00
evazion
0249c290fd skeb: remove skeb from site_name in base strategy.
Fixup a mistake with the way the merge conflict was resolved in 9dd903d21.
2021-03-08 03:56:44 -06:00
evazion
1716cc5bf9 artists: add more artist url icons. 2021-03-08 01:30:02 -06:00
evazion
7b60a476e5 sources: add artist profile links to fetch source data box.
Add site icons linking to all the artist's sites in the fetch source
data box.

Some artist entries have a large number of URLs. Various heuristics are
applied to try to present the most useful URLs first. Dead URLs and
redundant URLs (Pixiv stacc and Twitter intent URLs) are filtered out.
Remaining URLs are sorted first by site (to put sites like Pixiv and
Twitter first), then by URL (to break ties when an artist has multiple
accounts on the same site).

Some sites have shitty hard-to-read icons. It can't be helped. The icons
are the official favicons of each site.
2021-02-26 01:24:30 -06:00
evazion
5af50b7fcd danbooru::http: factor out Cloudflare Polish bypassing.
* Factor out the Cloudflare Polish bypass code to a standalone feature.

* Add `http_downloader` method to the base source strategy. This is a
  HTTP client that should be used for downloading images or making
  requests to images. This client ensures that referrer spoofing and
  Cloudflare bypassing are performed.

This fixes a bug with the upload page reporting the polished filesize
instead of the original filesize when uploading ArtStation images.
2020-06-24 22:54:04 -05:00
evazion
4074cc99f9 uploads: fix incorrect remote sizes on pixiv uploads.
Bug: the uploads page showed a remote size of 146 bytes for Pixiv uploads.

Cause: we didn't spoof the Referer header when making the HEAD request
for the image, causing Pixiv to return a 403 error.

Also fix the case where the Content-Length header is absent.
2020-06-24 03:02:45 -05:00
evazion
db3407caa3 uploads: fix uploading from source not working.
ref: 26ad844bbe (r40077579).
2020-06-22 15:32:48 -05:00
evazion
a4efeb2260 gems: drop Mechanize, HTTParty, and Sinatra gems. 2020-06-21 15:13:42 -05:00
evazion
7e471fe223 sources: replace HTTParty with Danbooru::Http in http_exists?. 2020-06-21 15:11:56 -05:00
evazion
26ad844bbe downloads: refactor Downloads::File into Danbooru::Http.
Remove the Downloads::File class. Move download methods to
Danbooru::Http instead. This means that:

* HTTParty has been replaced with http.rb for downloading files.

* Downloading is no longer tightly coupled to source strategies. Before
  Downloads::File tried to automatically look up the source and download
  the full size image instead if we gave it a sample url. Now we can
  do plain downloads without source strategies altering the url.

* The Cloudflare Polish check has been changed from checking for a
  Cloudflare IP to checking for the CF-Polished header. Looking up the
  list of Cloudflare IPs was slow and flaky during testing.

* The SSRF protection code has been factored out so it can be used for
  normal http requests, not just for downloads.

* The Webmock gem can be removed, since it was only used for stubbing
  out certain HTTParty requests in the download tests. The Webmock gem
  is buggy and caused certain tests to fail during CI.

* The retriable gem can be removed, since we no longer autoretry failed
  downloads. We assume that if a download fails once then retrying
  probably won't help.
2020-06-20 00:20:39 -05:00
evazion
1aa0f65187 sources: fix rubocop warnings. 2020-06-16 00:10:37 -05:00
evazion
88d9fc4e5e sources: simplify artist finder url normalization.
Get rid of `normalized_for_artist_finder?` and `normalizable_for_artist_finder?`.
This was legacy bullshit that was originally designed to avoid API calls
when saving artist entries containing old Pixiv direct image urls that
had already been normalized, or that couldn't be normalized because they
were bad id.

Nowadays we store profile urls in artist entries instead of direct image
urls, so we don't normally need to do any API calls to normalize the
profile url. Strategies should take care to avoid triggering API calls
inside `profile_url` when possible.
2020-05-29 15:35:15 -05:00
nonamethanks
307df3b3e4 Refactor source normalization
* Move the source normalization logic out of the post model
  and into individual sources' strategies.
* Rewrite normalization tests to be handled into each source's test,
  and expand them significantly. Previously we were only testing
  a very small subset of domains and variants.
* Fix up normalization for several sites.
* Normalize fav.me urls into normal deviantart urls.
2020-05-21 22:46:51 +02:00
evazion
e3187e0bd0 tags: add general?, character?, copyright?, artist?, meta?, empty? helper methods. 2020-05-10 23:56:50 -05:00
evazion
f38c38f26e search: split tag_match into user_tag_match / system_tag_match.
When doing a tag search, we have to be careful about which user we're
running the search as because the results depend on the current user.
Specifically, things like private favorites, private favorite groups,
post votes, saved searches, and flagger names depend on the user's
permissions, and whether non-safe or deleted posts are filtered out
depend on whether the user has safe mode on or the hide deleted posts
setting enabled.

* Refactor internal searches to explicitly state whether they're
  running as the system user (DanbooruBot) or as the current user.
* Explicitly pass in the current user to PostQueryBuilder instead of
  implicitly relying on the CurrentUser global.
* Get rid of CurrentUser.admin_mode? (used to ignore the hide deleted
  post setting) and CurrentUser.without_safe_mode (used to ignore safe
  mode).
* Change the /counts/posts.json endpoint to ignore safe mode and the
  hide deleted posts settings when counting posts.
* Fix searches not correctly overriding the hide deleted posts setting
  when multiple status: metatags were used (e.g. `status:banned status:active`)
* Fix fast_count not respecting the hide deleted posts setting when the
  status:banned metatag was used.
2020-05-07 03:29:44 -05:00
evazion
ddffffb413 artists: factor out artist finder to separate module. 2020-03-06 23:23:38 -06:00
evazion
60bf21ff80 twitter: fix preview_urls when source url is a direct image.
Fix preview_urls returning an empty array when the source url is a
direct image from Twitter.

Also return preview_urls in /source.json.
2020-01-21 16:34:03 -06:00
evazion
aff3d3b18f Fix various rubocop issues. 2020-01-11 19:01:40 -06:00
evazion
309821bf73 rubocop: fix various style issues. 2019-12-22 21:23:37 -06:00
evazion
eba6440b8b Fix #4144: Deviantart Eclipse update broke strategy. 2019-08-28 23:40:29 -05:00
evazion
1f73e60514 sources: add methods for customizing new artist entries.
* Rename `unique_id` to `tag_name`.

* Add `other_names` and `profile_urls` methods that sources can override
  to provide extra names or urls when creating new artist entries.
2018-12-27 15:03:11 -06:00
evazion
2170961f47 artists: improve prefilling of new artist form (#4028)
* When creating an artist by clicking the '?' next to the artist tag in
  the tag list, prefill the new artist form by finding the artist's last
  upload and fetching its source data.

  Previously we filled the urls with the source of the artist's last
  upload, which was wrong because it was usually a direct image URL (#3078).

* Fix the other names field not escaping spaces within names to underscores.

* Fix the other names field being potentially prefilled with duplicate names.
2018-12-27 15:03:11 -06:00
evazion
c700ea4b5f Fix #4016: Translated tags failing to find some tags.
* Normalize spaces to underscores when saving other names. Preserve case
  since case can be significant.

* Fix WikiPage#other_names_include to search case-insensitively (note:
  this prevents using the index).

* Fix sources to return the raw tags in `#tags` and the normalized tags
  in `#normalized_tags`. The normalized tags are the tags that will be
  matched against other names.
2018-12-16 11:37:57 -06:00
evazion
811bad5a86 /source.json: include raw api responses in output (#3940). 2018-11-30 00:19:00 -06:00
evazion
308a5021b4 wiki pages: convert other_names to array (#3987). 2018-11-13 19:18:11 -06:00
evazion
628341f7f0 Fix #3969: Translated tags should ignore artist tags. 2018-11-04 17:07:35 -06:00
evazion
5cf6a43918 sources: fix sources sometimes choosing wrong strategy (fix #3968)
Fix sources choosing the wrong strategy when the referer belongs to a
different site (for example, when uploading a twitter post with a pixiv
referer).

* Fix `match?` to only consider the main url, not the referer.

* Change `match?` to match against a list of domains given by the `domains` method.

* Change `match?` to an instance method.
2018-11-04 13:00:17 -06:00
evazion
b0d7d90103 tumblr: extract info from url when api data is unavailable.
Derive the artist name / profile url / page url from the source URLs when
the API response is unavailable because the Tumblr post was deleted.

This fixes the artist finder to work on bad_tumblr_id posts.
2018-10-09 12:44:59 -05:00
evazion
09a8198979 /artists: add wildcard, regex search to url field (#3900)
Allow searching the URL field by regex or by wildcard.

If the query looks like `/twitter/` do a regex search, otherwise if it
looks like `http://www.twitter.com/*` do a wildcard search, otherwise if
it looks like an url do an artist finder search, lastly if it looks like
`twitter` do a `*twitter*` search.
2018-09-21 21:19:01 -05:00
evazion
d9ce953752 Fix #3906: Moebooru strategy raises NotImplementedError. 2018-09-16 21:00:11 -05:00
evazion
bd47641601 twitter: don't fail when api key isn't configured. 2018-09-16 15:03:47 -05:00
evazion
325120ee51 twitter: fix parsing of the artist name from the url.
Fixes URLs like https://twitter.com/intent/user?user_id=123 being
incorrectly normalized to http://twitter.com/intent/ in artist entries.

Also fixes the artist name to be taken from the url when it can't be
obtained from the api (when the tweet is deleted).
2018-09-16 15:03:23 -05:00
evazion
583f8457f0 artists: clean up artist finding logic.
Rename Artist#find_all_by_url to url_matches and drop previous
url_matches method, along with find_artists and search_for_profile.

Previously find_artists tried to lookup the url, referer url, and profile
url in turn until an artist match was found. This was wasteful, because
the source strategy already knows which url to lookup (usually the profile
url). If that url doesn't find a match, then the artist doesn't exist.
2018-09-11 20:14:46 -05:00
Albert Yi
4972c998f8 rely on preview urls if available for gallery 2018-09-11 15:06:12 -07:00