Search workflow
Immediately after you install or upgrade Contenta and configure AppData, all object data in the database is indexed in Solr (if this is a new install there will not be any Contenta objects to index).
Upon subsequent object management activities, object data is dynamically indexed in Solr, keeping Contenta and the Solr collection in sync for reliable searching.
The Crawler connects to the Contenta Portal using a special token. The Contenta Portal queues for indexing changes to objects made by users in Contenta.
Object management activities include:
- Object creation during upload or import, for example
- Property sheet changes (edits made to property sheet metadata)
- Object deletion
- Object duplication (copy/paste, copy/paste reference)
- Object move
- Object copy (reuse/paste, reuse/paste reference)
- Object rename
- Object content modification (Edit check out, check in)
- Project posting (changes in projects released into stable data)
AppData will be re-read every N minutes (set in registry) to refresh the AppData settings in Crawler memory so that the Crawler responds accordingly to the changes. When the Crawler starts indexing one of the specified configurations, it will update the AppData configuraton_name/QUEUED_FOR_INDEXING setting to TRUE.
If configuraton_name/STATUS_REPORTING is set to TRUE, the Crawler will write status to configuraton_name/INDEXING_STATUS. Status consists of a count of objects indexed, and failed to index (see IDPATH in Figure 1: Global AppData Settings in Adding a New AppData Key and the REREAD_APPDATA_IN_MINUTES in Registry Keys used to Configure the Crawler for all Databases).
When Crawler completes indexing, the INDEXING_STATUS value will reflect an output similar to the following: completed - 81 objects indexed 0 objects failed. Thereafter, status will be recorded in the crawler.log file, but not written to AppData as the Crawler exports modified object data and indexes it in Solr.
The Crawler exports the Contenta object data. Tika is run on the data to extract text content for indexing in Solr. For example, PDF contains markup, so Tika will extract the text leaving the markup behind. Only certain file formats contain text content that can be harvested for indexing (see Supported File Formats).
Each language is indexed in a Solr index field of a type appropriate for indexing and querying that language. Language may also be determined using Tika to evaluate the extracted text content. This will only be done if the Crawler cannot find a language set in the object’s property data field named “Language ISO Code” or in the AppData value of LANGUAGE_CODE found under {Global} Collections/collection_name (see Supported Languages).
The object’s property data is also exported and indexed in Solr, as is.
The Crawler will index the AppData-specified configurations and all of its descendant objects (see INDEXING_STATUS and QUEUED_FOR_INDEXING in Figure 1: Global AppData Settings in Adding a New AppData Key).
In summary, the Solr Indexer catalogs each object’s Contenta pathname, Contenta ID path, Contenta object type, content, and property sheet metadata.