I see that words like “the” are not able to be searched on in the ExternalIndex. (examine issue when text includes stop words).
However, I need to be able to search on those words in specific fields. Is that possible?
I think this has to do with the used analyzer. The analyzer tokenizes texts and removes stopwords. Im not sure if you can fix this in the query side to be honest.
Im no expert though, I find indexing, tokenizers, analyzers etc very complex. I need to index literal text and find text as-is soon, so I might find an answer then.
search for and boost a specific stop word in examine
Maybe you can still create and extend you own analyser?
Repeating the code here incase our
disappears.
using System.Collections;
using System.Linq;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
namespace My.Namespace
{
public class MyStandardAnalyser : StandardAnalyzer
{
public MyStandardAnalyser() : base(Lucene.Net.Util.Version.LUCENE_29, MyStandardAnalyser.ENGLISH_STOP_WORDS_SET)
{
}
private static Hashtable ENGLISH_STOP_WORDS_SET {
get
{
var stopWords = Lucene.Net.Analysis.Standard.StandardAnalyzer.STOP_WORDS;
var set = stopWords.Where(w => w.ToLower() != "will").ToArray();
var charSet = new CharArraySet(set, true);
return CharArraySet.UnmodifiableSet(charSet);
}
}
}
}
Then amended the ExamineSettings.config file so that the ExternalIndexer
& ExternalSearcher
uses the new Analyzer.
and presumably will require a full reindex
Custom indexing | Umbraco CMS
public void Configure(string? name, LuceneDirectoryIndexOptions options)
{
switch (name)
{
//NB you need to rebuild the examine index for these changes to take effect
case Constants.UmbracoIndexes.ExternalIndexName:
options.Analyzer = new MyStandardAnalyser();
I think this still doesn’t quite solve the issue as this will still change the stop words for all the fields instead of only a subset of fields. But I’ll take a look at the StandardAnalyzer to see if I can create my own and add the functionality that I need. Thanks!
Maybe a different approach, set the field you want to search in as raw for indexing, and then phrase match for the term on that field?
case Constants.UmbracoIndexes.ExternalIndexName:
options.FieldDefinitions.TryAdd(new FieldDefinition("rawField", FieldDefinitionTypes.Raw));
var query = searcher.CreateQuery(IndexTypes.Content).Field("rawField", searchTerm.Escape());
Oh that’s a good idea
Does the raw mean that it doesn’t use the stop words?
I think that raw means that what you put is put in the index as-is. Usually, analyzers will tokenize input for easier searching and can also be language specific, so that for instance you can find the German ‘ß’ with ‘ss’ or handle singular and plural forms of a word. And it also skips stopwords for a language. It’s really quit cool, but complex.
Interesting. I’ll take a look at that. Thanks for the explanation!