`
lxwt909
  • 浏览: 566407 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Lucene5学习之PhraseQuery短语查询

阅读更多

    PhraseQuery:短语查询,就是查询文档中是否包含指定的一个Term或多个Term,多个Term之间可以指定间隔即slop参数,官方API解释如图:


    使用示例代码,如下:

 

package com.yida.framework.lucene5.query;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public class PhraseQueryTest {
	public static void main(String[] args) throws IOException {
		Directory dir = new RAMDirectory();
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(OpenMode.CREATE);
        IndexWriter writer = new IndexWriter(dir, iwc);

        Document doc = new Document();
        doc.add(new TextField("text", "quick brown fox", Field.Store.YES));
        writer.addDocument(doc);
        
        doc = new Document();
        doc.add(new TextField("text", "jumps over lazy broun dog", Field.Store.YES));
        writer.addDocument(doc);
        
        doc = new Document();
        doc.add(new TextField("text", "jumps over extremely very lazy broxn dog", Field.Store.YES));
        writer.addDocument(doc);
        
        
        writer.close();

        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        
        String term1 = "dog";
        String term2 = "jumps";
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("text",term1));
        phraseQuery.add(new Term("text",term2));
        phraseQuery.setSlop(15);
        
        TopDocs results = searcher.search(phraseQuery, null, 100);
        ScoreDoc[] scoreDocs = results.scoreDocs;
        
        for (int i = 0; i < scoreDocs.length; ++i) {
            //System.out.println(searcher.explain(query, scoreDocs[i].doc));
        	int docID = scoreDocs[i].doc;
			Document document = searcher.doc(docID);
			String path = document.get("text");
			System.out.println("text:" + path);
        }
	}
}

   pharseQuery.add(term),每次都是add到末尾,当然你也可以用add(term,position)明确指定add到哪个位置,示例代码中add了两个Term,则我们的查询短语是dog jumps,他们的间隔为0,然后我们设置slop值为5,

 

第2个索引文档里单词jumps往右移动5次刚好可以得到我们的查询短语dog jumps,因此它符合要求被返回了,而第1个索引文档直接不包含单词dog不符合要求,第3个索引文档需要移动7次才能得到dog jumps,所以最后返回的只有第2个索引文档。

   如果我把代码变一下,改成这样:

   

        String term1 = "dog";
        String term2 = "jumps";
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("text",term1),0);
        phraseQuery.add(new Term("text",term2),2);
        phraseQuery.setSlop(6);
        
        TopDocs results = searcher.search(phraseQuery, null, 100);

   这时候我们的查询短语就是dog xxx jumps,意思就是我们要查询包含dog和jumps字符的文档而且dog和jumps之间要有一个字符间隔(不包含停用词),这时候我们的slop就要加1了,即我们需要再多移动一次,所以这次slop值应该为6.

 

    PharseQuery下还有一个子类NGramPhraseQuery,这个子类涉及到N-Gram模型,算法之类的我就略过了。

   

   如果你还有什么问题请加我Q-Q:7-3-6-0-3-1-3-0-5,

或者加裙
一起交流学习!

  • 大小: 84.1 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics