Supported file formats

Textual content is extracted from non-structured files using Apache Tika, except for text in CGM graphics, which is extracted using a proprietary Contenta process.

The extraction will be done during the process that prepares the data for indexing. When that process finishes, the prepared data will be posted to Solr for indexing. Only file formats that support textual content will be indexed, as shown in the left column of the following table.

Table 4: Supported File Formats

Textual content will be indexedBinary content will not be indexed
doc, docxeps
ppt, pptxsde
pdfiso
ramesu
jsmpeg
3kowav
cssbmp
sgmltiff
txtrh
xmlmp4
xslxsl-fo
mp3avi
htmlgif
svgjpg
cgm (UTF-8 encoded content only)mov
 mp4
 mpeg
 png
 rm
 swf
 wma
 zip