Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist: date support #34

Open
atz opened this issue Apr 29, 2015 · 7 comments
Open

Wishlist: date support #34

atz opened this issue Apr 29, 2015 · 7 comments

Comments

@atz
Copy link
Member

atz commented Apr 29, 2015

Solr date range faceting (start/end/gap) is more performant than trying to manually build facet query ranges and allows for the simplest possible way to drill down. It seems like a great match for the interface you have here.

Example query:

https://example.edu/solr/my_core/select?q=accessioned_dt%3A*&fl=id%2Caccessioned_dt&wt=json&indent=true
&facet=true
&facet.date=accessioned_dt
&f.accessioned_dt.facet.date.start=2012-01-01T00:00:01Z
&f.accessioned_dt.facet.date.end=NOW
&f.accessioned_dt.facet.date.gap=%2B1MONTH
&facet.query=accessioned_dt%3A%5B%222011-01-01T00%3A00%3A00Z%22+TO+%222012-01-01T00%3A00%3A00Z%22%5D

Partial response:

  "facet_counts":{
    "facet_queries":{
      "accessioned_dt:[\"2011-01-01T00:00:00Z\" TO \"2012-01-01T00:00:00Z\"]":1476},
    "facet_fields":{},
    "facet_dates":{
      "accessioned_dt":{
        "2012-01-01T00:00:01Z":0,
        "2012-02-01T00:00:01Z":0,
        "2012-03-01T00:00:01Z":0,
        "2012-04-01T00:00:01Z":0,
        "2012-05-01T00:00:01Z":0,
        "2012-06-01T00:00:01Z":0,
        "2012-07-01T00:00:01Z":1,
        "2012-08-01T00:00:01Z":2,
        "2012-09-01T00:00:01Z":0,
        "2012-10-01T00:00:01Z":0,
        "2012-11-01T00:00:01Z":1,
        "2012-12-01T00:00:01Z":1,
        "2013-01-01T00:00:01Z":4,
        "2013-02-01T00:00:01Z":4,
        "2013-03-01T00:00:01Z":5,
        "2013-04-01T00:00:01Z":2,
        "2013-05-01T00:00:01Z":7,
        "2013-06-01T00:00:01Z":0,
        "2013-07-01T00:00:01Z":144,
        "2013-08-01T00:00:01Z":54,
        "2013-09-01T00:00:01Z":0,
        "2013-10-01T00:00:01Z":0,
        "2013-11-01T00:00:01Z":13,
        "2013-12-01T00:00:01Z":0,
        "2014-01-01T00:00:01Z":55,
        "2014-02-01T00:00:01Z":40,
        "2014-03-01T00:00:01Z":74,
        "2014-04-01T00:00:01Z":1313,
        "2014-05-01T00:00:01Z":7,
        "2014-06-01T00:00:01Z":10,
        "2014-07-01T00:00:01Z":1,
        "2014-08-01T00:00:01Z":15815,
        "2014-09-01T00:00:01Z":10,
        "2014-10-01T00:00:01Z":9,
        "2014-11-01T00:00:01Z":4,
        "2014-12-01T00:00:01Z":3,
        "2015-01-01T00:00:01Z":6,
        "2015-02-01T00:00:01Z":3,
        "2015-03-01T00:00:01Z":33,
        "2015-04-01T00:00:01Z":5,
        "gap":"+1MONTH",
        "start":"2012-01-01T00:00:01Z",
        "end":"2015-05-01T00:00:01Z"}},
    "facet_ranges":{},
    "facet_intervals":{}}}

Example shows both styles. You can see why the facet.query enumeration would get tedious. Perhaps we can use this issue to identify what existing impediments to implementation are?

Note: this is all without getting into the new (5.x) Solr DateRangeField.

@cbeer
Copy link
Member

cbeer commented Apr 29, 2015

I'll have to refresh my memory on this, but I think there's some reason why range limits end up being a little tricky (maybe something about how to seed the initial start / end dates?), but 👍 if you can make it work.

@atz
Copy link
Member Author

atz commented Apr 29, 2015

Yeah, an ideal implementation would have cached up the min and max values for all sortable fields to provide concrete start/end values (also caching all top level facets, while we're at it). This would occur at server spin-up (de facto index-warming query). Barring that, configurable endpoints would satisfy (e.g., for most public libraries, 1600 TO NOW/YEAR+1 on a pubdate is good enough). Because we would be targeting a Solr (trie) date field, we can make use of the syntax support for date math as needed.

@jrochkind
Copy link
Member

I also can't remember the details of why the Solr range facetting (the start/end/gap thing isn't limited to 'dates', right?) wasn't going to work right, but also remember there was some issue.

Some things:

  • The current implementation will work within a given query. If you've already searched for an arbitrary keyword 'baltimore', and within that result set the min/max is 1850/2015, then THAT is the range shown. However, that does already require a double solr query (getting the min/max from the first query, then making another Solr query to get the range segments within that min/max), so this is really an orthogonal concern. If I remember right, the current implementation already lets you set a fixed min/max instead of using the min/max within the current result set and requiring a double query -- it would be trivial to make it dynamically look up and cache the full index min/max on boot. If you wanted to add that feature, I think it would really be an orthogonal concern, and could be used with either the current client-calculated segments, or the Solr range auto-segment feature. But I would still make it an option, as I think it is now -- I think some people still want the dynamic "within the current result set" ranges.
  • Ah, i remember the other issue! The current code calculates ranges with nice human-friendly values. It tries to use powers of 10, 5, and 2. And adds on bits on the end to make this possible. For instance, if the total range of values were 1903 to 1961, the current code might give you ranges: 1903-1904; 1905-1919; 1920-1939; 1940-1959; 1960-1961. But the Solr feature, last time I looked, simply divides the total range equally, you might get segments that loo like 1903-1914; 1915-1926; 1927-1938; etc. I think the human-friendly segment feature is pretty neat, and useful enough that if you wanted to PR a feature to use Solr auto-calculated-ranges, it should probably be an option, so people could choose to use the existing human-friendly ranges instead.

(Of course, another option would be submitting a patch to solr itself to use the human-friendly ranges. It seems like something Solr ought to do. The code here that creates human-friendly ranges was originally ported from Flot, which has an MIT-style license).

Personally, I would want the existing functionality -- both ranges calculated within the current result set, not just globally; and human-friendly ranges. Which is why the plugin is currently written how it is.

Also, yet another issue -- you talk about Solr date fields specifically -- the current code actually works on int-like values, not dates. We use it with years, but put the years in a Solr int field. It could be enhanced to work with actual date fields, but it's a little bit tricky to plot them properly with Flot (the javascript graph-drawing library they use), and then translate back in the other direction when applying limits from the Flot graph. Again, I think this is a third orthogonal concern, which is actually independent to the other two. The solr auto-segmented range queries work on date as well as int, so you could use the on an int-like field too.

@jrochkind
Copy link
Member

Come to think of it, the first bullet point could also be a patch to solr -- use the min/max within the given result set, instead of making the client give you an explicit min/max for auto-segmented ranges. It would still require two lucene queries under the hood, but with a patch to Solr could be one Solr query.

I think patches to Solr would be the best way to handle both bullet points, but I'm not up to the Java myself. I think a PR to this code would be possible alternatively, if you really want the behavior you mention, but should probably be an alternative to the existing behavior configurable, rather than replacing the existing behavior; I prefer the existing behavior.

@atz
Copy link
Member Author

atz commented Apr 29, 2015

@jrochkind, I don't think you get how much of what you are talking about is already in Solr. The contrast is between "range faceting" as implemented here by separate facet queries (facet.query) vs. the newer approach using start/end/gap. To handle your second bullet point, we would just floor the start value and ceil the end value at the granularity of gap. So, for a breakout by "decade" between 1903-1961, we would supply:

&f.accessioned_dt.facet.date.start=1900-01-01T00:00:00Z
&f.accessioned_dt.facet.date.end=1970-01-01T00:00:00Z
&f.accessioned_dt.facet.date.gap=%2B10YEAR

To your first point, nothing about this stops working when you are "within" a query. Facets that are not query-specific sorta miss the point.

At this point, I only care about dates. Assuming we can suppress leading and trailing 0-valued facet counts, code-level default range of 0000-01-01 to NOW+1YEAR is fine. With basic config-level per-field defaults, we can achieve better efficiency. For many fields, you will always know a good default start value (date_created, date_updated, anything that audits an operation performed on a record) based on when the system was spun up (i.e., the age of the oldest record). But overall the best approach would be to cache top-level corpus-wide start/end values for the target fields, using them to seed the first query. After that for drilling down, we feed back the start/end values from the response (actually the min non-zero-valued container boundary and the top non-zero container boundary + GAP). I don't see a reason you would need to use two solr queries per user query for anything.

Solr's (int) range faceting now descends from the early implementation of date faceting. For details, check this solr.pl post from 2010.

@jrochkind
Copy link
Member

Thanks Joe.

I don't understand how you'd avoid two Solr querries, if you want the segments and display to be within the min/max of the current search results -- you have to get the min/max from the first query, then make another one with that min/max set for facet segments one way or another. Unless there's a feature I don't know about?

I think the current human-friendly segments implemented client-side go past what Solr can auto-segment. The current code tries to divide into roughly N segments whose boundaries are 2,5,10 (or multiples of those factors), and which end on even 0 or 5 or multiple-of-2 boundaries. If the range of the current search results is 12 years, the current code might divide into segments of 2 years, or for wider ranges 5 years, or 20 years, etc. Which I think is really nice -- but maybe what Solr can do is good enough, it's not neccesary?

At any rate, PR is of course welcome. My interest is just in maintaining the current behavior of 1) dividing into segments within the min/max of the actual search results not the entire index, and 2) making those segments have human-friendly widths/boundaries, not just mathematical range/N.

If you can do that all Solr-side, that would be sweet. If not, but there may be times when someone might prefer to trade-off 'ideal' UI for Solr efficiency, then I'd suggest the current behavior remain as a configurable option, probably on a per-field basis.

@atz
Copy link
Member Author

atz commented Apr 30, 2015

Yes, I think start/end/gap delivers what you want in a way that is different than than this implementation currently. The container width logic need not be materially different. If you want boundaries of 2, 5 or 20 years, you can specify that per query.

The min/max can come from four places:

  • In subquery, the previous query response
  • In original query, cached from the global top-level facets (this is how you get around 2 solr queries per user query -- have an implied "previous query" that covers the corpus)
  • Configurable per-field defaults
  • Code-level defaults

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants