IM@T Online July 2003

Archiving: the full-text solution

An innovative way to electronic archiving

Alan Turner, Managing Director, Somerset Computing Ltd and Pippa Steele, Independent researcher

Copyright 2003 Somerset Computing Ltd

Logo Somerset ComputingSOMERSET COMPUTING has completed an archiving solution, with ‘full text search’ ability, covering 40 years of the Proceedings of the British Academy. This archive will act as a pilot scheme and may eventually be expanded to include all of the Proceedings, which were first published in 1905. The British Academy is the national academy for the humanities and social sciences, and last year celebrated its Centenary.

Logo British AcademyThe Proceedings of the British Academy is the flagship of the Academy’s publishing programme, printing conference proceedings, the texts of scholarly lectures, and extended obituaries of the Fellows of the British Academy. Because the Proceedings of the British Academy covers such a wide range of subjects, there has been a need to help users unlock its rich store of much-cited articles. The British Academy became interested in producing an electronic archive of the annual volumes of the Proceedings as a way of enabling users to find what they wanted and then providing easy access. What was needed was a simple and affordable way of capturing the articles in electronic form, making them word-searchable, and then delivering them electronically to users in a widely accepted format.

Somerset Computing offers a distinctive electronic archiving solution as it allows the whole lecture or article to be searched rather than just the headers and abstracts without having the need for the expensive process of re-coding into XML or SGML and re-typesetting the entire document. The service provided by Somerset Computing is a very cost-effective solution to publishing articles on-line that are not already structured in an electronic format. Somerset Computing is also able to take data in a variety of forms – hard copies and electronic files – and convert them so that they may be stored together in the one database.

New volumes of the Proceedings, now produced in a structured way as a spin-off of the conventional production process, can then easily be added to the archive to keep it up to date.
The British Academy provided Somerset Computing with the Proceedings of the British Academy volumes covering the years 1964 to 1993, which were scanned without these hardback volumes being taken apart. The scanned pages were saved as PDF (Portable Document Format) files and also put through the latest optical character recognition software. The Proceedings for the years 1994 onwards were provided as QuarkXpress and 3B2 files (a professional typesetting program). Final output of the article is delivered in the PDF format, ensuring the user obtains standardized and easily accessible files. The PDF is similar to a photocopy in that the user will obtain a copy of the article as it is seen in its original published form.

Diagram SOMCNV and SOMFTSThe in-house program named “SOMCNV” merges the data from both the scanned and electronic sources producing an Internet based archive database with the structure specific to the client’s end application. Having been developed in-house and being script based, “SOMCNV” can be readily altered, making a bespoke solution possible. The full text search of the database is enabled by the unique in-house software program, “SOMFTS”. This program has been written to permit the end-user the very greatest control and speed in even high-level search requests.

The archive database and the in-house programs are currently hosted on the Somerset Computing dedicated server based in London. Alternatively, the in-house programs can be licensed, allowing the electronic archive to be hosted on any third party server.

The “SOMCNV” and “SOMFTS” programs combined allow complete automation of the structuring and the indexing of the archive database keeping the manual work required to a minimum and the accuracy of the finished product to a maximum. These programs have been designed to thrive on large multi-volume publications and can handle data from a few thousand to millions of pages.

From the British Academy web site the user can access the Full Search Screen and is then able to key in up to fifty words. The program will search the entire archive for matches in seconds. The way in which the data is structured automatically by “SOMCNV” enables the user to select whether to search the entire archive or just the memoirs or lectures. The data has also been indexed so that it is possible to view all articles written by a certain author or part of a certain lecture series.

Once the program has completed the search, the Search Results Screen is displayed. The Search Results Screen lists all the articles that match the search, along with the author, the volume of the Proceedings and statistics regarding the occurrence of the word(s) being searched for.

The user is able to view the PDF of the first page of each article listed and the page that has the optimum match. This allows the user to peruse the article and decide whether it is the correct article to download. If the user finds the article they require, the PDF of the whole article can be downloaded via an e-commerce or password-restricted area of the web site. Selected users (e.g. Fellows of the British Academy) listed on another database can access the password-restricted area. The database will contain information specifying the user’s rights to free access to certain articles. For example, a user may have free access to articles from certain years or in certain lecture series.

Each time a file is to be downloaded, it is sent to a unique temporary directory which the user has access to for a limited time as dictated by the client. This ensures that users are unable to obtain PDF files that have not been authorized for downloading. If further security is required the PDF files can also be encrypted.

According to James Rivington, the Publications Officer of the British Academy, ‘the whole process has been very smooth and painless. Somerset Computing took the material in whatever form we were able to supply it, and solved any technical problems – where necessary talking direct to the original typesetters. In preparing the web interface they have proved flexible and responsive to our particular requirements. The most important thing about what Somerset Computing has done for us so far is that it works!’

The electronic archive is not limited to Internet access alone. Somerset Computing is able to produce a CD-ROM version of the archive which still has the full text search facility and looks and responds just as the client’s web site. The CD-ROM will include software so that it can link back to the Internet-based archive for updating. This provides an organization such as the British Academy with the option of generating revenue by offering the CD-ROM to libraries and other customers.

Computer-based information systems are a rapidly growing area of management science, which allows archiving, information storage and retrieval to occur with greater ease and speed.

Somerset Computing can provide the complete solution in a single package, incorporating aspects of e-commerce, web design and password restriction into the electronic archive.
Somerset Computing have worked on numerous prestigious publications providing them all with unique and customer specific electronic data manipulation solutions.

This article illustrates just one solution that Somerset Computing has developed for its clients. Testimonials and descriptions of the other services offered by Somerset Computing are available on their website – www.somcom.co.uk

References
www.proc.britac.ac.uk
• Anderson, Kent. The useful archive. Learned Publishing 2002: 15 (2), pp85–89
• University of Leicester (2002) MBA Implementing Strategies – pp25

 



IM@T Online July 2003

Previous item Contents Next item