Supported File Formats
Textual content will be extracted from non-structured files using Apache Tika.
The extraction will be done during the process that prepares the data for indexing. When that process finishes, the prepared data will be posted to Solr for indexing. Only file formats that support textual content will be indexed, as shown in the left column of the following table.
Table 4: Supported File Formats
| Textual content will be indexed | Binary content will not be indexed |
|---|---|
| doc, docx | eps |
| ppt, pptx | sde |
| iso | |
| ram | esu |
| js | mpeg |
| 3ko | wav |
| css | bmp |
| sgml | tiff |
| txt | rh |
| xml | mp4 |
| xsl | xsl-fo |
| mp3 | avi |
| html | cgm |
| cgm | gif |
| svg | jpg |
| mov | |
| mp4 | |
| mpeg | |
| png | |
| rm | |
| swf | |
| wma | |
| zip |