Supported file formats
Textual content is extracted from non-structured files using Apache Tika, except for text in CGM graphics, which is extracted using a proprietary Contenta process.
The extraction will be done during the process that prepares the data for indexing. When that process finishes, the prepared data will be posted to Solr for indexing. Only file formats that support textual content will be indexed, as shown in the left column of the following table.
Table 4: Supported File Formats
| Textual content will be indexed | Binary content will not be indexed |
|---|---|
| doc, docx | eps |
| ppt, pptx | sde |
| iso | |
| ram | esu |
| js | mpeg |
| 3ko | wav |
| css | bmp |
| sgml | tiff |
| txt | rh |
| xml | mp4 |
| xsl | xsl-fo |
| mp3 | avi |
| html | gif |
| svg | jpg |
| cgm (UTF-8 encoded content only) | mov |
| mp4 | |
| mpeg | |
| png | |
| rm | |
| swf | |
| wma | |
| zip |