Database Size Estimation
SDL designed WorldServer to handle extremely large amounts of data. However, it is essential to estimate the amount of content a WorldServer installation will be required to handle and to provide adequate hardware including storage capacity. Estimation of storage requirements depends on the type of data to be used, the number of target languages supported, typical user activity patterns, total system load in terms of how many projects are turned around within a particular time frame, and many other factors.
Areas in WorldServer that can require a large amount of storage space include:
- Translation memory (TM)
- Terminology database (TD)
- Segmented documents
Translation Memory
Translation memory size can be measured in either words or entries, where an entry usually contains a single sentence. The average English sentence contains 15 to 20 words. An analysis by SDL of customer translation memories confirmed this range with the average word count being 17 words per sentence. If a translation memory size is measured in the number of entries, the number of words is calculated using the following formula:
TMwords = 17 * TMentries
To estimate disk space required for a translation memory, the total translation memory size in thousands of entries should be multiplied by a space factor that needs to be calculated experimentally and is database dependent.
If you do not have any existing translation memories, the future translation memory size can be estimated from the number of words that you expect to translate, using the formula above.
Customers using the Studio file types introduced in WorldServer, and processed by the FTS server, should expect that their translation memory size will increase over TMs created with legacy WorldServer filters. This size increase should be covered by the formula change described in the CONTENT entry for Estimate Guidance.
Terminology Database
Terminology database (TD or termbase) size is measured in number of entries. Entries usually contain one source term and one or more target terms. In most cases there is a one to one correspondence, but a terminology database entry might contain several target terms in each of several languages.
Segmented Documents
WorldServer segments all content into textual and markup elements. Usually, segmentation is done on a sentence level; each sentence becomes a separate segment. WorldServer stores this data in the application database and uses it as the internal document representation. Each content asset (file) becomes a separate segmented document structure. The size of this structure directly corresponds to the size of the document and the amount of textual content it contains. The estimating factor in the final formula is based on markup language formats (HTML, SGML, XML, and so on). For binary data formats (DOC, PPT) the space factor is much smaller because most of the binary data deals with formatting and does not get loaded into WorldServer.
Overhead
In addition to the three biggest storage contributors described above, there is also the overhead of keeping all other WorldServer data, including users, projects, workflows, and so on. This overhead is difficult to estimate because it is very dependent on the particular usage pattern. However, SDL finds that 1 GB of space is more than enough to cover the overhead for most WorldServer installations.
Estimate Guidance
Customers have used the following formula for estimating SQL Server database sizes for WorldServer 9.x and earlier versions. Take your own environment and data into account when you think about your own database needs.
Where:
- TOTAL SPACE
Total space required for the database, in gigabytes (GB)
- LANGS
Number of languages supported by the installation
- TM
Expected translation memory size, in thousands of entries
- TD
Expected terminology database size, in thousands of entries
- CONTENT
Total translatable content size, in megabytes (MB)
Customers using the studio file types introduced in WorldServer and processed by the FTS server should expect that they will need more database space. Studio file types may require up to three times the CONTENT space required for assets processed by legacy WorldServer filters.