The goal
I would like to look into creating a clean provider model for switching the search and indexing engine used by Umbraco.
Lucene/Examine is a great default option, but for some solutions you want to use a centralized index such as Elasticsearch or Azure Search.
I would like it if a provider model would take care of:
- Common configuration for which fields to index
- An interface describing the indexing capabilities Umbraco Core needs
- An interface describing the searching capabilities Umbraco Core needs
- An interface describing some basic search capabilities that would be exposed to site builders through Umbraco Core
Advanced search for site builders should just go directly to whatever search engine they have chosen to use.
The current options
I have taken a look at the “Moriyama Azure Search” package. This is implemented by making “Dummy Providers” and subscribing to a lot of events, in order to gather data for the index. Then it intercepts angular requests to the search api, and sends them to a different controller instead. This does not feel like a clean interface for implementing new providers
Examine also has a concept of providers. However, it seems that the UmbracoIndexers inherit from the Examine ones, which means that in order to implement an indexer for a different engine, you would also need to rewrite/override the existing indexers in Umbraco. Also the examine provider requires you to understand what data might be in the XElement that is passed to the provider. This option also does not feel like a clean interface.
Proposal
I would like to propose that we make a set of interfaces and classes that describe in a structured way, which data structures the provider is expected to index, and which query operations the provider should support.
The operations should be kept relatively simple, to make it possible to use most engines. Maybe just supporting boosting and fuzzyness.
Umbraco Core would supply the configuration to the provider if needed. Core could also handle filtering out properties that should not be indexed before the data is sent to the provider.
The advantages to this approach would be that Core would handle all logic around when and what data to index, and the providers only need to handle persistence and querying. This helps avoid changes needed for packages when f.ex. event models change in core. It also allows other packages to still subscribe to core indexing events, regardless of which engine the data will eventually be stored in.
What do you think?
Would it be possible for a PR like this to get merged to V8?
Would it be accepted in general?
This is a companion discussion topic for the original entry at https://our.umbraco.com/forum/94836-provider-model-for-search-and-indexing