Encoded Archival Description (EAD) Project
The Utah State Archives supports the ongoing efforts of the archival community in the creation and implementation of descriptive standards for finding aids. To that end, we began our own project in April 2000 to convert all of our existing record series inventories and agency histories to the format specified by the Library of Congress, in association with the Society of American Archivists' EAD Roundtable. This project was completed in August 2000, with a total of 614 series inventories and 125 agency histories converted to XML. New finding aids are being added to that number as records are processed.
Methodology
About eight members of the Archives staff took the SAA EAD training (using XMetaL) in mid April 2000. Up until that time, no one was familiar with XML or stylesheets. Two staff members had experience with the web and HTML, one of whom also had limited programming experience. At that time the decision was made that the webmaster would create the stylesheets and do all of the conversion of the existing finding aids to EAD/XML; another member of the web staff would create the template of our rendition of EAD to be used in XMetaL, and validate all legacy documents after they were marked up; and the remaining staff would concentrate on creating new finding aids in EAD.
Separating Each Series Inventory
All of our finding aids have been available on the web since 1996, contained in a Folio infobase, which essentially is one very large SGML document that is converted to HTML on the fly when a portion of the infobase is called by a browser. The first step in moving this information to XML was to remove the data from Folio and into separate documentsone for each finding aid, named by series number. The Folio software has a utility that automatically separates the data into discrete files, based upon the structure (similar to a table of contents) that infobase creators embed into it. This worked well for us, as the data were saved in Rich Text Format. All of the new documents did need to be renamed by hand. Some minimal data cleanup also needed to be done in each.
If our finding aids had not been in Folio, we would have started the process with whatever electronic copies existed in word processing, and scanned any paper copies not otherwise available, using OCR to make the data editable. Once the data is editable, the procedures below can be applied.
About two-thirds of the series inventories included container listsmost very short, maybe a table of ten rows, but some very long, with the table extending 100 or more pages. For purposes of this project, the container lists were removed from the descriptive text and placed in a separate document temporarily.
The documents with the descriptive data were then copied to a separate folder and renamed to have a .txt extension, though they were still in Rich Text Format. When opened in Notepad, all of the formatting coding was then visible. The coding made the documents look very messy, but it provided a way to run search-and-replace commands that would replace the coding with XML markup. The great part was that the RTF coding was distinctive enough that the search and replace function could distinguish a <p> from a </p>.
Search and Replace
To run the search and replace commands, the software chosen was Allaire's (now Adobe's) HomeSite. This is an inexpensive HTML text editor that happened to come bundled with Macromedia's (now Adobe's) Dreamweaver, the software we use to develop our web pages. HomeSite was flexible enough to easily work with text files and XML files, and the search and replace commands could apply to every document contained in a folder, which was a real time saver.
HomeSite was also customizable with regard to tags it recognizes. The "snippet" feature was used to create the various EAD XML tags that could be wrapped around text with the click of the mouse.
Doing the search and replace commands, and adding the extra necessary XML tags by hand to the descriptive text of the 614 series inventories, took three weeks. After those were completed, they were individually validated against the EAD DTD with XMetaL which took about a day. At the time of the project, the Archives only had one copy of XMetaL, and it was not on the PC (indeed, in a different building across town) of the staff member doing the conversion. That was partially why HomeSite was chosen, because it was already available.
Stylesheets and Parsing to HTML
As each series inventory was completed, an XSL stylesheet was used to transform it to HTML with James Clark's XT XML parser. To create the stylesheet, instructions were followed from Elliotte Rusty Harold's book XML Bible, published by IDG Books in 2000. The stylesheet examples from the SAA's EAD Roundtable were also useful. The stylesheet itself was fairly simplified, but met our needs. We had one in place in about 3-4 days. Since the initial project, XSLT transformations have been done using Altova's XMLSpy. The current stylesheet we use follows the EAD 2002 schema.
The stylesheet was written so that the HTML included all of the formatting and comment fields necessary for our Dreamweaver software to think the HTML was created using a Dreamweaver template. This was done so that when we update the design of our web page, all we need to do is update the template, and all associated documents formed with that template will be changed accordingly.
Container Lists
The container lists, which up to then had been patiently waiting for attention in separate documents, were still a problem. They were formatted in tabbed columns, a quasi-table. The style they had been written in specified that the container numbers (box, folder, etc.) be listed once and implied dittoes fill up the rows and cells until the numbers changed. To some extent, this style was also used in the column that described the container contents. With EAD, we realized that all of the implied dittoes needed to be filled in with real data. That was done as each was converted to XML. This part of the project required considerable tweaking of data to get the columns to line up properly. Some people had separated columns by using a long stretch of spaces, and others had used tabs. Sometimes the tabular pattern changed from row to row. The search and replace functions of the word processor were invaluable in getting rid of "extra stuff." The object of the data tweaking was to get all container content into well-structured table rows and columns. Once in a table, other procedures could be applied to wrap the XML around the field contents.
We discovered that the easiest way to convert our container lists to XML was to use Corel's WordPerfect. While other word processors work, there are some advantages with WordPerfect, especially during search and replace operations. When you highlight tabbed columns and then click on "insert table", the contents of those columns are placed into a WordPerfect table. We were careful to have only one tab between columns and for the columns to all line up together from one row to the next. Any variation to this rule caused problems with the next step.
Then, with the data in a table, we told WordPerfect to select the table (use the Edit drop-down menu for this), then hit the delete key. The software then pops up a menu challenging you about what you actually want to delete. At the very bottom is a feature to convert the contents of the table to a merge document, using the table header as field names. Select that choice. This creates a data file that corresponds with a merge form. We ended up needing about 20 merge forms, each a variation on a theme of <container>, <unittitle>, and <unitdate> tags within <c01> and <did> tags. When you run the merge, the data from the table have the EAD tags perfectly wrapped around them. The resulting data were then cut/pasted inside the <dsc> tags in the XML documents with the series descriptions. Alternatively, we could have used a spreadsheet such as Microsoft Excel and used the Word merge options (see demo video).
Although there were fewer container lists than series descriptions, this part of the project took longer, about four weeks. In contrast, the agency histories took about four days to complete all 125 of them. The EAD coding for those was rather minimal since EAD does not yet support all of the note fields we currently use.
Perfect Printed Copies
Having all of our inventories and agency histories available to the public (and search engines) in HTML only solved one part of the problem that finding aids create for us. The other problem was the creation of a printed copy that looks clean and professional, with table headers, page numbers, and series numbers printed on each page. This is the copy used in our Research Room by both staff and researchers. Previously, all of the finding aids had always had a master copy in WordPerfect, and another copy used for electronic distribution (without headers, footers, etc.), and several duplicate copies in people's personal directories. The hope with EAD was to make the XML copy the master copy and eliminate any other "master" or duplicate copy which existed. Everything was stored in a central place on the LAN to reduce confusion. When a finding aid was updated, it was run through the stylesheet and posted on the website.
The question remained. How could a good Research Room copy be made? After some investigation, a new stylesheet was created using XSL formatting objects. (A formatting object stylesheet, incidently, is a completely different animal than the HTML stylesheet, and took four solid days to figure out how to make it work, and has continued to be tweaked ever since.) The formatting objects file was then converted to a PDF file using Apache's FOP software. One of the advantages of using the PDF file is that the very large series inventories we have (those over 1 megabyte in size) can be updated in XML and made available on the web without necessitating the creation of multiple HTML documents that divide up the contents into reasonable size for download (50 k per page being our state standard). The resulting PDF file is compressed, so downloading is much easier.
Advent of the Database
In early 2008, the way we produce finding aids was changed from the creation of individual XML files to one big database. Much of the data had been in the database all these years, except for the container lists and few lengthy descriptions. The application (homegrown using APPX Software, with Oracle as the database) was altered to allow export of the data using the EAD 2002 schema standard. This took the Archives' staff about four weeks total, including testing. Users click a button, the XML file is created, and they can view it in a browser, all in one single step. The resulting document can then be saved to the website. Further development of the application will include a way for the public to access the finding aids directly from the database, without needing to produce intermediate XML files, as well as making them OAI-harvestable.
The most difficult part of this migration was to import all of the container list data from XML into the database. Sometimes we weren't as careful as we should have been in how we did our XML coding. While the data was structured, it was far from normalized. Using Altova's MapForce, all of the container list data from every series was moved to one long tab-delimited file. This was then imported to the database, and scripts were run to copy any date information found within the <unititle> field to the database's date fields. Because the date patterns found within <unititle> were wide and varied (even unexpected), we hired a database consultant to do the work for us. That project took two weeks. At the same time, the consultant created a new feature in the application to allow users to import data to the database directly from an Excel file, which makes data entry so much easier. One nice advantage to the database option is that since dates are contained within real date fields, we can wrap the date information any way we want within <unittitle> or <unitdate> upon export, while the data is physically stored separately. So if it makes more sense for a container list to have a separate column for <unitdate>, we do that. If it looks better for the date to show at the beginning or ending of the <unittitle>, we do that.
The database also keeps track of other record series and the finding aids that have been produced for them, so if there is a <relatedmaterial> notation, the database will automatically create a link to that finding aid if it exists. Previously, we would create the link if we knew the finding would eventually be created, which caused a lot of broken links for series groups that were published individually before everything was complete. That was easier than going back and editing every series that was related after a new finding aid was produced, so the database saves us a lot of work and users a lot of frustration.
Conclusions
The process we chose for initially converting our finding aids worked well for us. Perhaps a better/faster way existed that we could have used, but this process was within the skills of existing staff. The largest of our finding aids (which continue to grow) are difficult to manage using XML, unless in a database. Still, creating an XML document once and then using a stylesheet to transform it into something else is very useful and has made the EAD project worthwhile.
Elizabeth Perkes, August 2008
Page Last Updated August 21, 2008.