By Tim Kindseth MSIS candidate (May 2016), School of Information, The University of Texas at Austin
Yesterday, TARO Steering Committee Co-chair Amy Bowman e-mailed members of the consortium a link to all of the EAD <controlaccess> datasets, broken down by repository, that we extracted and wrangled this spring. I spoke briefly about our work during last month’s TARO brown bag presentation at the Society of Southwest Archivists’ annual meeting in Oklahoma City. Both Amy and I also thought it would be a good idea to publish here on TARO Today the more relevant sections of the report on consortial authority control that I wrote and submitted to the TARO standards committee as part of my final master’s degree Capstone project at UT-Austin’s School of Information. Each (work) day for the next week or so I’ll be posting, sequentially, another section of the document, beginning with the overview below. For a copy of the full report, please get in touch with Amy or e-mail me at tim [dot] kindseth [at] utexas [dot] edu. —Tim Kindseth
OVERVIEW
Control access, or index, terms are a well-established bibliographic convention. Within archival practice, however, the selection, use, search, and browsing of such terms is not so straightforward. Whereas books and other published items typically have well-defined scopes (and thereby topics), making the choice of control access terms rather intuitive or self-evident, it is much more difficult to choose just a handful of subjects or other authorities (persons, corporate bodies, genres, geographic place names) for, say, a collection of twenty-five boxes of unpublished manuscript material generated over four decades in the course of entirely unrelated activities and life events. Yet since the adoption of Encoded Archival Description in the late 1990s, archivists across the United States, Texas included, have been trying to do just that: select three to five (occasionally ten or more) representative index terms that will somehow do justice, will encompass, the startling breadth and depth of topics that a single archival collection can cover.
The hope is that these representative control access terms might function as arterials into archival finding aids, a genre that is still the source of much researcher confusion. Before EAD, the reference archivist was, for most researchers, among the main sources of information about any particular repository’s collections. Online EAD finding aids, one could argue, have come to play a similar role, transmitting to researchers, many of whom cannot easily travel to this or that collecting institution, not just information about individual collections but, in the case of an EAD consortium like the Online Archive of California (OAC) or Texas Archival Resources Online (TARO), information about how those collections relate to one another as well.
Relational collection mapping in theory makes material easier to find, more accessible and retrievable, and is the basis and goal of larger movements within information science like Linked Open Data and the Semantic Web. To get collections to talk to collections, though, is no easy task. Metadata from one finding aid must be able to converse with that of another, which requires an unforgiving level of shared data structure. For index terms to link up and self-aggregate across the repositories that comprise any consortium, control access terms must be crafted in exactly the same way across potentially dozens of institutions with varied familiarity with EAD and generally differing levels of archival expertise. Enter controlled vocabularies and best practices guidelines, both gentle nudges toward synchronicity in the ways in which archivists, many with dissimilar levels of experience or institutional support, encode their repository’s finding aids.
Rules are one thing; following them, however, is another. Katherine M. Wisser and Jackie Dean’s analysis of EAD tag usage across 1,136 finding aids from 108 anonymized repositories, published in The American Archivist in 2013, found that “little uniformity exists in encoding practices.” They concluded, “Variability in implementation of encoding standards has the potential to diminish the ability to aggregate records and effectively leverage structures for management and retrievability.” In 2014, Dr. Ciaran Trace and three others at UT-Austin looked at a set of 8,729 TARO finding aids and reached similar conclusions as Wisser and Dean about EAD data quality. “With humans in the mix,” they realized, “issues with the quality of the encoding can be expected.” This human hurdle must first be recognized before the issue of inconsistency can be surmounted. “Finding and documenting such problems with EAD encoding,” they argued, “is a key first step in instituting more rigorous control over descriptive and encoding practices that facilitate the aggregation, visualization and analysis of archival data.” Such aggregations and visualizations, which make possible the subject browsing and searching (faceted or otherwise) features that TARO is considering during its redesign, require clean data, and in order to clean it, you first have to locate the mess.
From January through May 2016, for my master’s Capstone project at UT-Austin’s School of Information, that was precisely my task: find where and in which ways TARO <controlaccess> values were dirty and, moreover, come up with ways to clean, or normalize, that data so that index terms, not currently searchable through TARO’s online interface, might in the future, with a revamping of that interface, be harnessed to provide subject searching and/or browsing, thereby increasing discoverability of the archival material described by TARO’s online finding aids. Amy Bowman of the Briscoe Center for American History, who supervised the project, and I performed BaseX queries on the more than 14,000 EAD documents from 46 repositories currently stored on TARO’s server. Over 153,000 <controlaccess> terms were extracted, converted into spreadsheets (grouped both by institution and by EAD element), and analyzed for common encoding errors or inconsistencies using OpenRefine’s clustering algorithms. All the while, a literature review on authority control and subject searching in archival settings was conducted. Several underlying, interrelated, unresolved sets questions emerged during the project:
- If the 153,000-plus <controlaccess> terms encoded in TARO finding aids are to be normalized, against which controlled vocabularies should they be reconciled, and should the reconciliation occur federally (by TARO) or individually by each contributing repository?
- What are the online information-seeking behaviors of archives researchers? In the age of Google and keyword searching, is topic/name browsing a thing of the past? If so, is consortial authority control a hobgoblin, an unnecessary expenditure of time and other resources? Have subject browsing features been effective for the consortiums, like Archives West, that have implemented them?
- How will eventual implementation of EAD3, which was released last year, change the way contributing institutions must encode <controlaccess> terms, and what will be the benefits for search and discovery? To avoid repeating the same (rather complicated and onerous) process twice, should TARO wait until consortial adoption of EAD3 to normalize those terms in accordance with new encoding requirements?
- How can the future selection and encoding of index terms (whether per EAD2002 or EAD3) be standardized (and remain so) across 46 contributing repositories? What best practices should be in place, and how strictly should they be enforced?
That final set of questions is perhaps the most crucial. My own personal belief is that for authority control to work, control must be part of the equation. Even if TARO is able to normalize all of its current <controlaccess> terms, without consortial enforcement of some kind there will be no guarantee, given the heterogeneous ways that institutions encode finding aids (manually keying the EAD in a text editor vs. generating it automatically with archival management software tools like ArchivesSpace), that future <controlaccess> metadata will be crafted uniformly across all repositories. To date, as our extraction and analysis of TARO’s 153,000 index terms has revealed, there has been very little consistency in the encoding of such terms. Tables breaking down the extracted data in various broad categories, by element, by controlled vocabulary, and by individual repository, can be found near the end of this document. What follows in the next section details some of the more frequent encoding errors and inconsistencies both across and within TARO’s contributing members. It is not at all unusual, for instance, for a subject, person, corporation, place, or other <controlaccess> element to be encoded in divergent ways by the same repository.
The section following that is more speculative, outlining general issues to bear in mind as TARO redesigns its interface. How well that interface functions hinges on the quality of the metadata beneath it, which the title of a 2009 OCLC report written by Jennifer Schaffner makes clear: “The Metadata is the Interface.” Schaffner emphasizes what’s at stake in any effort (like TARO’s) to improve the quality of descriptive metadata: “It would be heartbreaking,” she writes, “if special collections and archives remained invisible because they might not have the kinds of metadata that can easily be discovered by users on the open Web.”