Follow

How do I manage the Dictionary?

Introduction

If you arrived at this page, it is very likely that you have been contacted by us in regards to the deployment of some dictionary-based features to your Cxense Search cluster. The purpose of this page is to explain a little bit about these features and guide you through the steps required to update these dictionaries whenever needed.

This is an Advanced Cxense Search feature.

 

If you do not have such dictionaries enabled in your cluster yet, please contact support@cxense.com to get the initial dictionaries enabled for your cluster.

Currently, any custom dictionaries will be applied for the whole cluster (i.e. used by all indexes). If you need custom dictionaries per index or support for different languages within one cluster, this is something that needs to be discussed with your Onboarding point of contact.

Working with Dictionaries

There are just a few steps that must be followed before you can start working with your custom dictionaries, as outlined below:

  1. Understand the dictionary format that must be used for each dictionary you want to update
  2. Understand the particularities of each distinct type of dictionary (e.g. suggest-whitelist vs. synonyms)
  3. Get your development environment ready to push updated dictionaries using our API

Dictionary Format

All customer provided dictionaries must be Excel Workbooks (.xslx). Each worksheet in the workbook needs to be given a specific name to identify its function. Bellow are the current valid options for worksheet names, depending on the type of dictionary (the @@ must be included as part of the worksheet name):

  • @@suggest-whitelist
  • @@suggest-blacklist
  • @@synonyms
  • @@spell-whitelist

Each worksheet also needs to have a corresponding property worksheet with the name in the format <sheet name>-properties (e.g. @@synonyms-properties), that will contain configurations such as language, normalization and type of matching to be performed. To make things easier, the examples included at the end of this page already contain those property worksheets and can be used as a starting point.

Define Dictionary Language

The examples contained in this page already have all the proper settings for each dictionary type, so the only information you must update in the property worksheet is the language code used for the dictionary, using a valid two-letter ISO 639-1 code, which is a requirement for linguistics features such as tokenization to work properly. The default language code configured in the sample workbooks is en (English).

You can find more information about all the possible configuration options in our documentation. Some of those configurations can severely impact how your dictionaries work, so if you need help and/or have questions, please contact support@cxense.com before modifying those.

Dictionary Types

Query Completion Dictionaries

A daily process will already populate query completion dictionaries for your cluster based on your most popular queries. The two additional dictionaries below allow you to have more control over which terms should be suggested:

  • @@suggest-whitelist - This list can be used to define terms that should always be available for query completion (i.e. potential suggestions), even if they are never queried for.
  • @@suggest-blacklist - This list can be used to prevent certain terms for being suggested for completion, even if they are frequently searched for by the users (e.g. offensive terms).

If you do not wish to automatically populate query completion dictionaries based on user queries, contact support@cxense.com to have this feature disabled for your cluster.

Spellchecking Dictionaries

A daily process will already populate spellchecking dictionaries for your cluster based on your most popular queries. The two additional dictionaries below allow you to have more control over which terms should be considered for spelling suggestions:

  • @@spell-whitelist - This list can be used to define terms that should always be available for query completion (i.e. potential suggestions), even if they are never queried for.
  • @@suggest-blacklist - The same list defined for blacklisting query suggestions is also used to prevent terms from being suggested as spelling corrections, even if those terms are frequently searched for by users.

If you do not wish to automatically populate spellchecking dictionaries based on user queries, contact support@cxense.com to have this feature disabled for your cluster.

Synonym Dictionaries

@synonyms

Use this list to define the synonyms that should be used to expand user queries. For every row in the Excel worksheet, the synonym expansion must be configured following these rules:

  • First column is key (source), which can only appear once as a key (i.e. it can be used as a synonym for another term, but can't be duplicated as a key).
  • Following columns are synonyms for the key
  • It's one-way expansion (e.g. apple => orange). If you need two-way expansion, add another row with the key/value inverted (e.g. orange => apple).

Below is an example of a synonyms dictionary and the resulting transformed queries once this synonyms dictionary is in place.

Dictionary

Key
Value
apple orange
i pod ipod
apple juice martinelli's apple juice
 

Result

Query

Result (equivalent to)

Note

query("apple")

query("apple") or query("orange")

 

query("i pod")

query("i pod") or query("ipod")

 

query("apple discount i pod")

query("apple discount i pod") or query("orange discount i pod") or query("apple discount ipod") or query("orange discount ipod")

 

query("apple juice")

query("apple juice") or query("martinelli's apple juice")

longest match (apple => orange doesn't take place)

query("ipod")

query("ipod")

only one-way expansion

 

Have more questions? Submit a request

Comments

Powered by Zendesk