Supported File Formats

Textual content will be extracted from non-structured files using Apache Tika.

The extraction will be done during the process that prepares the data for indexing. When that process finishes, the prepared data will be posted to Solr for indexing. Only file formats that support textual content will be indexed, as shown in the left column of the following table.

Table 4: Supported File Formats

Textual content will be indexedBinary content will not be indexed
doc, docxeps
ppt, pptxsde
pdfiso
ramesu
jsmpeg
3kowav
cssbmp
sgmltiff
txtrh
xmlmp4
xslxsl-fo
mp3avi
htmlcgm
cgmgif
svgjpg
mov
mp4
mpeg
png
rm
swf
wma
zip