ES聚合分页


theme: scrolls-light

聚合分页

ES支持同时返回查询结果和聚合结果,前面的博客在介绍聚合查询时,查询结果和聚合结果各自封装在不同的子句中。但有时我们希望聚合的结果按照每组选出前N个文档的方式进行呈现,最常见的一个场景就是电商搜索,如搜索苹果手机6S,搜索结果应该展示手机6S型号中的一款手机即可,而不论该型号手机的颜色有多少种。另外,当聚合结果和查询结果封装在一起时,还需要考虑对结果分页的问题,此时之前的博客介绍的聚合查询就不能解决这些问题了。

ES提供的Top hits聚合和Collapse聚合可以满足上述需求,但是这两种查询的分页方案是不同的。

1.1 Top hits聚合

Top hits聚合指的是聚合时在每个分组内按照某个规则选出前N个文档进行展示。

例如,搜索”金都“时,如果希望按照城市分组,每组按照匹配分数降序展示3条文档数据,DSL如下:

# Top hits聚合
GET /hotel_poly/_search
{
  "query": {
    "match": {
      "title": "金都"
    }
  },
  "aggs": {
    "group_city": {
      "terms": {
        "field": "city"
      },
      "aggs": {
        "my_avg": {
          "top_hits": {
            "size": 3
          }
        }
      }
    }
  }
}

可以看到,在索引中一共有3个文档命中match查询条件,在聚合结果中按照城市分成了两个组”北京“”天津“,在”北京“下面有两个文档命中,并且按照得分将展示文档进行了降序排列,”天津“只有一个文档命中。

在Java中使用Top hits聚合的逻辑如下:

public void getAggTopHitsSearch() throws IOException{
    //创建搜索请求
    SearchRequest searchRequest = new SearchRequest("hotel_poly");
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    String termsAggName="my_terms"; //聚合的名称
TermsAggregationBuilder termsAggregationBuilder = AggregationBuilders.terms(termsAggName).field("city");
BucketOrder bucketOrder = BucketOrder.key(true);
termsAggregationBuilder.order(bucketOrder);

String topHitsAggName="my_top"; //聚合的名称
TopHitsAggregationBuilder topHitsAgg = AggregationBuilders.topHits(topHitsAggName);
topHitsAgg.size(3);
//定义聚合的父子关系
termsAggregationBuilder.subAggregation(topHitsAgg);
//添加聚合
searchSourceBuilder.aggregation(termsAggregationBuilder);
searchSourceBuilder.query(QueryBuilders.matchQuery("title","金都"));
searchRequest.source(searchSourceBuilder);      //设置查询请求
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

//获取聚合结果
Aggregations aggregations = searchResponse.getAggregations();
//获取聚合返回的对象
Terms terms = aggregations.get(termsAggName);
for (Terms.Bucket bucket : terms.getBuckets()) {
    String bucketKey = bucket.getKey().toString();
    log.info("termsKey={}",bucketKey);
    TopHits topHits=bucket.getAggregations().get(topHitsAggName);
    SearchHit[] searchHits = topHits.getHits().getHits();
    for (SearchHit searchHit : searchHits) {
        log.info(searchHit.getSourceAsString());
    }
}

}

Top hits聚合能满足”聚合的结果按照每组选出N个文档的方式进行呈现“的需求,但是很遗憾,它不能完成自动分页功能。如果在聚合中使用Top hits聚合并期望对数据进行分页,则要求聚合的结果一定不能太多,因为需要由客户端自行进行分页,此时对分页内存的存储能力是一个挑战。可以一次性获取聚合结果并将其存放在内存中或者Redis中,然后自行实现翻页逻辑,完成翻页。

1.2 Collapse聚合

如前面所述,当在索引中有大量数据命中时,Top hits聚合存在效率问题,并且需要用户自行排序。针对上述问题,ES推出了Collapse聚合,即用户可以在collpase子句中指定分组字段,匹配query的结果按照该字段进行分组,并在每个分组中按照得分高低展示组内的文档。当用户在query字句外指定fromsize时,将作用在Collapse聚合之后,即此时的分页是作用在分组之后的。

以下DSL展示了Collapse聚合的用法:

# Collapse聚合
GET /hotel_poly/_search
{
  "from": 0,//指定每页的起始位置
  "size": 5,//指定每页返回的数量
  "query": {//指定查询的query逻辑
    "match": {
      "title": "金都"
    }
  },
  "collapse": {//指定按照城市进行Collapse聚合
    "field": "city"
  }
}

执行上述DSL后,ES返回的结果如下:

从结果中可以看到,与Top hits聚合不同,Collapse聚合的结果是封装在hit中的。在索引中一共有3个文档命中match查询条件,在聚合结果中已经按照城市分成了两个组,即”北京“”天津“,在”北京“下面有两个文档命中,其中得分最高的文档为003,”天津“只有一个文档命中。上述结果不仅能按照得分排序,并且具备分页功能。

在Java中使用Collapse聚合的逻辑如下:

public void getCollapseAggSearch() throws IOException{
    //按照城市进行分组
    CollapseBuilder collapseBuilder = new CollapseBuilder("city");
    SearchRequest searchRequest = new SearchRequest("hotel_poly");//新建搜索请求
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    //新建match查询
    searchSourceBuilder.query(QueryBuilders.matchQuery("title","金都"));
    searchSourceBuilder.collapse(collapseBuilder);  //设置Collapse聚合
    searchRequest.source(searchSourceBuilder);  //设置查询
    SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);//执行搜索
    SearchHits searchHits = searchResponse.getHits();   //获取搜索结果集
    for (SearchHit searchHit : searchHits) {
        String index = searchHit.getIndex();    //获取索引名称
        String id = searchHit.getId();          //获取文档_id
        float score = searchHit.getScore();     //获取得分
        String source = searchHit.getSourceAsString(); //获取文档内容
        log.info("index={},id={},score={},source={}",index,id,score,source);
    }
}

数据源

索引结构

PUT /hotel_poly
{
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "title":{
        "type": "text"
      },
      "city":{
        "type": "keyword"
      },
      "price":{
        "type": "double"
      },
      "create_time":{
        "type": "date"
      },
      "full_room":{
        "type": "boolean"
      },
      "location":{
        "type": "geo_point"
      },
      "tags":{
        "type": "keyword"
      },
      "comment_info":{
        "properties": {
          "favourable_comment":{
            "type":"integer"
          },
          "negative_comment":{
            "type":"integer"
          }
        }
      }
    }
  }
}

酒店数据

POST /_bulk
{"index":{"_index":"hotel_poly","_id":"001"}}
{"title":"文雅假日酒店","city":"北京","price":556.00,"create_time":"20200418120000","full_room":true,"location":{"lat":39.938838,"lon":106.449112},"tags":["wifi","小型电影院"],"comment_info":{"favourable_comment":20,"negative_comment":10}}
{"index":{"_index":"hotel_poly","_id":"002"}}
{"title":"金都嘉怡假日酒店","city":"北京","create_time":"20210315200000","full_room":false,"location":{"lat":39.915153,"lon":116.4030},"tags":["wifi","免费早餐"],"comment_info":{"favourable_comment":20,"negative_comment":10}}
{"index":{"_index":"hotel_poly","_id":"003"}}
{"title":"金都假日酒店","city":"北京","price":200.00,"create_time":"20210509160000","full_room":true,"location":{"lat":40.002096,"lon":116.386673},"comment_info":{"favourable_comment":20,"negative_comment":10}}
{"index":{"_index":"hotel_poly","_id":"004"}}
{"title":"金都假日酒店","city":"天津","price":500.00,"create_time":"20210218080000","full_room":false,"location":{"lat":39.155004,"lon":117.203976},"tags":["wifi","免费车位"]}
{"index":{"_index":"hotel_poly","_id":"005"}}
{"title":"文雅精选酒店","city":"天津","price":800.00,"create_time":"20210101080000","full_room":true,"location":{"lat":39.178447,"lon":117.219999},"tags":["wifi","充电车位"],"comment_info":{"favourable_comment":20,"negative_comment":10}}

这是一个从 https://juejin.cn/post/7369052013152501812 下的原始话题分离的讨论话题