sources: add Source::URL class; factor out Source::URL::Twitter.

Introduce a Source::URL class for parsing URLs from source sites. Refactor the Twitter
source strategy to use it.

This is the first step towards factoring all the URL parsing logic out of source
strategies and moving it to subclasses of Source::URL. Each site will have a subclass
of Source::URL dedicated to parsing URLs from that site. Source strategies will use
these classes to extract information from URLs.

This is to simplify source strategies. Most sites have many different URL formats we have
to parse or rewrite, and handling all these different cases tends to make source
strategies very complex. Isolating the URL parsing logic from the site scraping logic
should make source strategies easier to maintain.
This commit is contained in:
evazion
2022-02-23 17:35:39 -06:00
parent 6f5aef1cef
commit 7ed8f95a8e
6 changed files with 221 additions and 63 deletions

View File

@@ -4,29 +4,42 @@ module Danbooru
class URL
class Error < StandardError; end
# @return [String] The original URL as a string.
attr_reader :original_url, :url
delegate_missing_to :url
# Parse a string into an URL, or raise an exception if the string is not a valid HTTPS or HTTPS URL.
# @return [Addressable:URI] The parsed and normalized URL.
attr_reader :url
delegate :domain, :host, :site, :path, to: :url
# Parse a string into a URL, or raise an exception if the string is not a valid HTTPS or HTTPS URL.
#
# @param string [String]
# @return [Danbooru::URL]
def initialize(string)
@original_url = string
@url = Addressable::URI.heuristic_parse(string).display_uri
# @param url [String, Danbooru::URL]
def initialize(url)
@original_url = url.to_s
@url = Addressable::URI.heuristic_parse(original_url).display_uri
@url.path = nil if @url.path == "/"
raise Error, "#{string} is not an http:// URL" if !@url.normalized_scheme.in?(["http", "https"])
raise Error, "#{original_url} is not an http:// URL" if !@url.normalized_scheme.in?(["http", "https"])
rescue Addressable::URI::InvalidURIError => e
raise Error, e
end
# Parse a string into an URL, or return nil if the string is not a valid HTTP or HTTPS URL.
# Parse a string into a URL, or raise an exception if the string is not a valid HTTPS or HTTPS URL.
#
# @param string [String]
# @param url [String, Danbooru::URL]
# @return [Danbooru::URL]
def self.parse(string)
new(string)
rescue StandardError => e
def self.parse!(url)
new(url)
end
# Parse a string into a URL, or return nil if the string is not a valid HTTP or HTTPS URL.
#
# @param url [String, Danbooru::URL]
# @return [Danbooru::URL]
def self.parse(url)
parse!(url)
rescue Error
nil
end
@@ -42,7 +55,7 @@ module Danbooru
# @return [Hash] the URL's query parameters
def params
url.query_values.with_indifferent_access
url.query_values.to_h.with_indifferent_access
end
end
end