With regard to the preservation and archiving considerations identified above, any project utilising electronic text must consider the purposes and limitations of the various digital formats available, not only to ensure that these meet current aims and needs, but also that they do not restrict future opportunities (NINCH 2002).
The National Archives advises that the choice of file format for the preservation and management of electronic records 'should always be determined by the functional requirements of the record-creating process', as well as longer-term requirements for preservation and access (Brown 2003a). It is also stressed that the costs of ensuring sustainability of a resource are kept to a minimum when these issues are considered at the very beginning of a project (Brown 2003a). Guidance specific to archaeological digital archives is provided by Richards and Robinson (2000).
'File formats encode information into a form which can only be processed and rendered comprehensible by very specific combinations of hardware and software'. Similarly, the physical storage medium chosen is also dependent upon very specific combinations of hardware and software for access (Brown 2003a; 2003b). The accessibility of electronic information, therefore, is vulnerable as technology evolves, even over relatively short timescales. As Brown (2003b) notes, irrespective of physical longevity, technical obsolescence is inevitable, and for the foreseeable future, electronic records will need to be periodically refreshed onto new media. The National Archives suggests that the following criteria be considered when selecting file formats for data creation: open standards, ubiquity, stability, metadata support, feature set, interoperability and viability (Brown 2003a). The choice of format also depends upon what the creator intends the end user to be able to do with the data.
Some formats are best suited to the creation of printed output, and may also allow for Web publication. However, if data need to be moved from these to another software platform, formatting and other information may be lost (NINCH 2002). Some formats are dependent upon proprietary software, they are produced and sold by a specific company for profit and their licensing terms and existence cannot be assured over the long term, for example the majority of word-processing packages and Adobe PDF format. Other plain text formats, such as ASCII and HTML, are open standards, independent of the particular software used. Proprietary formats are unsuited for archival purposes or for the creation of durable heritage resources; open source formats are generally preferred for digital preservation (NINCH 2002; Richards and Robinson 2000, 32).
A summary of the main digital formats in which textual material and reports may be created, or into which they may be converted is presented below.
Word processing has played a major part in the increasing popularity of personal computers in archaeology and it is for report writing that computers are perhaps most widely used. Westcott (2003) discusses some of the difficulties for digital archiving presented by the creation of electronic texts using word processors and how these might be avoided. Word-processed documents are saved in formats dependent upon the proprietary software used in their creation, and this may present problems for users of different software packages. Word-processed documents are primarily designed to be read in printed form. They can be edited without having to retype a document in full, giving an author greater control over content and appearance. A variety of fonts, sizes and colours can be used, text may be formatted and justified and tables, graphs and images can easily be included.
The Strategies for Digital Data survey found that digital datasets are created using a wide range of software; although Microsoft products dominated, the survey identified 162 other programmes in use (Condron et al. 1999). Whilst it is presently feasible to share most word-processed files between different software users, this is not always possible, nor can it be guaranteed into the future. Westcott (2003) notes that files created by the latest releases of word-processing packages are invariably inaccessible to users of older software. Those archiving word-processed documents are advised to convert files into a neutral format or for the formatting to be embedded in the document as easily recognisable tags and the text as plain text which can be extracted if necessary (Westcott 2002; 2003). Whilst Microsoft Word files are not favoured as a long-term preservation format, they are an accepted format for dissemination and for delivery over the Web (Richards and Robinson 2000, 33)
Almost any type of document that can be presented on paper can be converted to Portable Document Format, including files from word-processing programmes, as discussed above. Although the PDF format is proprietary to Adobe, the specification is publicly available. New releases of upgraded versions of Adobe PDF software appear on a regular basis. Increasingly, there are alternatives for the creation of PDF documents appearing on the market, such as Macromedia's FlashPaper 2. Whilst there is a cost for the software to create PDF files, the ability to retrieve, view and print a PDF file is open to all, as users can freely download the Adobe Reader from the Adobe website if this is not already available through the Web browser. At the time of writing, Adobe Reader 7.0 is the current version.
PDF files have a wide range of advantages. They retain the look and feel of a word-processed document as all of the formatting contained in the original source document is preserved. Creators can embed hypertext links and images; all graphics, special characters, and colours display as they were intended in the original, regardless of the user's software and operating systems. An Adobe PDF file, when printed, looks the same as it displays, and almost any printing device can be used. It can be published and distributed anywhere, including in print, as an email attachment, on websites, or on floppy discs or CD-ROM. PDF files are a popular way to make content available from a website. However, as PDF was designed to specify printable pages in the format originally envisaged by the creator of the document, the content is optimised for A4 standard-sized sheets of paper, not for display in a browser window. Users can, however, zoom in and out, and view more than one page at a time. Because Adobe PDF files can be very big, there is often a significant download time, and difficulties with navigation (Nielsen 2001).
Opinion is divided – whilst many see the advantages of PDF, others such as Taylor (2003) feel that a user should not be obliged to have to view these files on their desktop and urges webpage creators not to replace HTML content with PDF files. PDF is often directly compared with HTML and its advantages and disadvantages assessed accordingly. PDF files tend to be larger than the same data presented in HTML, and they are not designed to be modified. If changes are required, the original word-processed documents must be amended and a new Adobe PDF file created. The advantage of HTML format is that the master HTML documents can be modified directly. Whilst an Adobe PDF file looks identical on all machines, the Web browser adjusts the HTML to the resolutions and fonts that it has available. With HTML, the person viewing an HTML document can override the stylistic choices where a stylesheet has been used; with PDF the user will always view the document exactly as it was designed (Green 2004). The main disadvantage of the Adobe PDF format is that is a proprietary file format, the specification for which is defined by Adobe. Whilst Adobe PDF files are not, therefore, favoured as a long-term preservation format, they are one of the most widely used formats for dissemination and for delivery over the Internet (Richards and Robinson 2000, 33).
The ASCII format is the most basic, plain text format, in which the text of a document is captured as character representations. A document may be searched for particular words or phrases, but does not convey any information about the original appearance or structure of the document (NINCH 2002). ASCII code has several advantages as an archive format – it can be interchanged between different computer systems, and the small file size of text files means that they can be distributed quickly and easily. It also has its disadvantages – the number of codes is limited and thus only standard alphabetical characters, numbers and symbols are supported. If a word-processed document is converted to ASCII, formatting to convey document layout and structure such as bold or underlining will be lost and, potentially, the meaning will be lost with it. ASCII text is one of the long-term preservation formats recommended by the ADS, and is also advocated as a dissemination format (Richards and Robinson 2000, 33).
Rich Text Format is a form of tagged text, and its specification provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. Rich Text Format uses the American National Standards Institute (ANSI), PC-8, Macintosh, or IBM PC character set to control the representation and formatting of a document, both on the screen and in print. With the RTF Specification, documents created under different operating systems and with different software applications can be transferred between those operating systems and applications. Like ASCII files, RTF is not a format commonly used for Web dissemination. Westcott (2003) advises that, for archival purposes, it is extremely difficult to extract the text from an RTF file, as software that can read RTF format is required in order to access content. Unlike conversion to plain text, some aspects of formatted layout can be preserved by saving a word-processed file in RTF format.
An HTML document is a plain-text file that can be freely created, read and amended using any text editor and may be viewed on any computer that has a Web browser. Fundamental components of the structure of an HTML document are the elements and attributes, which are read by the Web browsers and define the formatting of the webpage, such as headings, paragraphs and lists through the use of tags. However, not all tags are supported by all Web browsers and different browsers may render HTML elements differently. Certain standard HTML tags are required in every document, which must comprise head and body text. Other than these, however, HTML tags describe how something should render; they do not contain any information about what the data is and describe only how it should look. There are a limited number of tags, and so formatting can only be used within the accepted parameters. As a markup tool for anything other than presentation, HTML is relatively weak. An HTML document is less concerned with structure and content, than it is with appearance.
A strong advantage of HTML is its ability to link text and images to another document, or section of a document through the use of hypertext links. HTML files can be as simple or dynamic as the creator wishes, including a range of multimedia applications. They can contain hypertext links, graphics, forms, JavaScript, embedded video and more. HTML documents maintain a relatively small file size, which means that download time may be far less than for an equivalent PDF file. HTML files can be distributed via the Internet, via e-mail or on CD-ROM or floppy disk, depending upon the file size. Unlike PDF files, however, an HTML page may not print out precisely on an A4 page, or with the same layout as on the screen. HTML is a recommended long-term preservation format, and is also advocated as a dissemination format (Richards and Robinson 2000, 33).
Examples of a piece of text formatted in various ways are discussed in detail by Westcott (2003).
The discussion of file formats above assumes that documents are already electronically available. However, as Richards and Robinson (2000) identify, distinctions can be made between data that is created digitally, is captured from a non-digital original, or acquired in digital form from another source. Digitisation is 'any process by which information is captured in digital form, whether as an image, as textual data, as a sound file, or any other format' (NINCH 2002). Digitisation of documents 'may refer either to the capture of page images, merely a picture of the document, or to the capture of a full-text version, in which the document is stored as textual characters' (NINCH 2002). If, for example, a grey literature report exists only in paper format, as the word-processed original no longer survives or cannot be found, there are a number of options available for a digital version to be created. This may also apply to reports where the text may be available electronically, but the illustrative material, such as figures and plates, is not.
For any project, the purpose of digitisation must be clear at the outset in order to justify the resources required. One of the principles of archiving digital data from fieldwork is that data already held safely in paper archives do not need to be digitised unless it is to provide a digital security copy or Internet access (Richards and Robinson 2000). Whilst some form of Internet access is desired for all reports literature, it is questionable whether project results always justify full digitisation; summary or abstract information may sometimes be enough for the needs of most users.
Scanning may be used to create images of pages of text. Alternatively, optical character recognition (OCR) software can read such an image and create a full-text version of the document by identifying individual character shapes and translating them into actual letters. Text may also be keyed directly from a source, whether an original or photocopy (NINCH 2002). These methods of digitisation are discussed in detail in Morrison et al. (2002, 10-20).
© Internet Archaeology
URL: http://intarch.ac.uk/journal/issue17/5/gf2-7to2-7-2.html
Last updated: Wed Apr 6 2005