In February of 2008 a 'Joint International Conference on Digital Buddhist Studies: EBTI after 15 and CBETA at 10 Years' was held at Dharma Drum Buddhist College,which assembled more than hundred participants from many of the most important institutions in the field.
At this conference, a roundtable on the future of Digital Buddhist Studies was held, which brought together both scholars in the field of Buddhist Studies and practitioners involved in the creation of digital resources related to Buddhism. At that roundtable many opinions regarding desirable future developments in this field have been expressed, of which a great deal concerned the desirability of further integration of existing ressources.
There are for example, a number of sites that contain (more or less complete) digital versions of canonical collections, but they all exist separately. What is needed is a way to interact with them as one (virtually) integrated digital resource, so that the collections can be searched together and similar texts (or translations) can be found, read and compared.
To foster this development, which involves not only, but also technical questions, an initiative was started, which was initially called "Integrated Buddhist Archives Network" (IBA-net). The 'I' in IBA-net has later been renamed to 'International', so it is now called 'International Buddhist Archives Network'. More information on IBA-net can be found on the website iba-net.org.
In this presentation, I will report on the work done so far within IBA-net and outline some of the challenges and chances that I see lying ahead.
The way research is done not just in Buddhist Studies, but in nearly every field of study, has been profoundly changed in the last decade. For Buddhist Studies, no small part of that change was brought about by the increasing availability of digital resources, combined with the increasingly different ways of networked communication. A few years ago, I made an attempt to evaluate these changes (Wittern 2000). The main impact of the digital, as recounted there, was the availability of resources for primary research and new ways to communicate. While that still holds, the details are now embarrassingly out of date, so it is time to at least briefly look at things that developed since then. I will then proceed to focus
The web as we have it today is still based on TCP/IP based networking developed in the late 1960s and early 1970s, on the HTTP protocol developed in the early 1990s and (mostly) on HTML from the mid 1990s. Some of these basic standards have of course been revised and further developed, and there have been some changes, for example in the character encoding, which more and more drifts to Unicode, and the adaption of the stricter XML as the base for web pages, rather than the more permissive HTML of the earlier Web. These are however evolutionary developments, not fundamental changes to the protocol stack that drives the Web. However, as we are witnessing, the possibilities that open up as we find more imaginative uses of these technologies are nearly without limit and the changes we witness are more due to more people participating in the communication with more diverse types of devices (and ideas). The question that needs to be asked is thus first and foremost, what do we want to be there and what can we do to make that possible.
These are just some of the new things that showed up in the eight years since the above article was published. At that time, scholarly communication was still firmly based on journals, with the move to online publication just starting to catch on and become a bit more organized. I do not have the time to delve into this more deeply as would be appropriate, so the following is more of a casual, anecdotal reminder of what many of us have witnessed over the last years, than a thorough analysis of the developments.
It was always easy to publish on the Web, but this nevertheless required access to a web server somewhere, which was only available to academics or those willing to pay for it. With the advent of Wikis, that is webpages that could be directly created, edited and interlinked within the web browser, in most cases by all web users, without any restrictions, sometimes by communities or groups of users, the barrier to publication was lowered substantially. Jimmy Wales famously took advantage of this in 2001, when he created the free online encyclopedia Wikipedia (http://wikipedia.org). Wikipedia did have problems with the quality of content at the beginning and still has to fight vandalism, but it is now one of the most successful web sites, is recognized for both the quality and breadth of its content and is frequently the first stop to research a topic, even for academics.
Another development, which started in the late 1990s, was the advent of 'blogs' or 'web-logs' – diary like web-pages that are typically hosted for free at some service provider and can also be immediately updated from every web browser on the internet. Gartner estimated that as of 2007 there were 100 million bloggers regularly updating their pages, while the population of former bloggers was more than 200 million (according to Wikipedia). Now it would be quite impossible to keep track of all these blogs by visiting the webpages where they are posted. To overcome this problem, blogs are also published as so-called 'feeds', that is index pages in a specific format that contain information about which entries had been updated. Somebody interested in a blog would then subscribe to such a feed and be able to easily discover and read new or updated posts. This in fact reversed the direction of publication on the web, which up to then had been a 'pull' type of publication, where the users would go actively to the content they wand to read, rather than be presented, like on TV, with a pre-manufactured program of content that is 'pushed' at them. Using these feeds, however, internet users could assemble their own reading program by subscribing to feeds of interest to them.
Other parts of what came to be called with the buzzword 'Web 2.0', but which really centers around content created by ordinary, mainstream users of the web, not publishing experts, is the whole system of social networking services. Of relevance to the topic here are sites such as del.icio.us (now delicious.com) which allow users to bookmark websites, attach some arbitrary identifiers (tags) to them and share them with other users of the site. Many other similar sites exist, some specialized to certain fields, other concerned with other types of resources, for example the sharing of bibliographic information, like CiteULike (citeulike.org) with a focus on scholarly literature.
More recent developments include instant messaging, internet relay chat, or internet based voice phone services (skype.com), which even allow free telephone conferencing. All of these are to my knowledge heavily used by scholars as well, but they do not offer any services specially directed towards scholars.
One last phenomena that should be mentioned in this short inventory of relevant developments of the last years, is the 'mashup' of websites. The idea behind mashup is to use generic websites, for example Google Maps and overlay them with other content to generate a new, more specialized site, for example a map of archeological sites in China, or to show the locations of photographs stored in online photo albums like Flickr on a map. The possibility to create mashups is an important expansion for the creation of websites. A mashup makes use of data provided by web servers not for the direct consumption of the user's browser, but for use by other servers, this is commonly called an API (application programming interface). Many sites that provide information about books, for example use the API of the Amazon website to display cover images of books.
Technologies and efforts broadly labelled as part of the 'Semantic Web', that is, attempts to create a Web that has more knowledge of its data, have been deliberately excluded from the above list; at least up to this moment they do not seem to flow with the fabric of the internet, but rather try to impose new rules of their own on how content should be created. Without an immediate, tangible benefit for users and publishers of web content, this does not seem likely to work. But there are other, grassroots-like efforts, which might provide a better answer, which center around the idea to embed small bits of additional information (so-called 'microformat') into web sites, information that is not visible to the user, but can be interpreted by the browser and act upon it. A microformat is a web-based approach to semantic markup that seeks to inobstrusively add to existing XHTML and HTML tags to convey metadata. This approach allows information intended for end-users (such as contact information, geographic coordinates, calendar events, and the like) to also be automatically recognized and processed by software; more information can be found in Khare and Çelik 2006 and Allsop 2007.
A short look beyond scholarly communication might be permitted. The recently concluded campaign of Barack Obama for the presidency of the US has shown how many of these tools can be used for communication about political issues and other topics of relevance to a huge number of people. What was essentially a toy for technically minded nerds rapidly became a mainstream tool of communication skillfully used by the Obama campaign, as the net encourages openness and dialogue. As for the election day itself, a special channel on the chat-platform Twitter (#votereport) allowed all voters to immediately report about the conditions at their polling stations, including the waiting time or irregularities, thus providing a unprecedented way for grassroots watching and reporting, which surely helped to prevent irregularities of earlier voting days from happening again.
Of the developments discussed so far, it was in fact only the voice messaging that required new protocols, all the other services are built on simple (but ingenious) conventions of using the standard infrastructure of the Web and did not require any new technology to be deployed.
As can be seen in the examples discussed above, new developments are only to a rather small degree due to new technical possibilities. A much larger role is played by the vision and imagination that puts the existing technical means to new use and thus creates something new. So I think a question that should be asked is, what kind of infrastructure would we like to see for Digital Buddhist Studies and IBA?
At the moment, there are two main groups of players involved in Digital Buddhist Studies: The producers or publishers of content (Digital Archives, DA for short) on the one hand, and researchers or other users (USERS for short) who are using these resources for their research or other purposes. Researchers in turn are publishers of monographs, articles, but increasingly also contributors to mailing lists, online journals, blogs, and a host of other web formats.
While a certain amount of the academic publications does explain, comment, emend, compare or in other ways refer to texts provided by the DA, there is at the moment no way to close this circle to provide feedback pointing from the USERS' publications to the DA other than maybe by mailing the DA, who then might provide a version that reflects the feedback (by adding a note or correcting a mistake) with the next update of the resource.
It seems quite clear that this process is too cumbersome and not adequate to the requirements of internet communication. But how can this process be improved? As I see it, this is one of the central question IBA needs to address.
I would like to imagine a new infrastructure of this, which is based on some fundamental assumptions that have to be fulfilled for this to work:
All participants take and keep ownership of the content they publish. For the DA, this is not much of an issue, but the same should be true for the users, who want to publish the results of their work as they see fit, and can not live with content locked away in the user storage areas of 20 different servers.
The terms of usage (licensing terms) have to be stated clearly and acted upon.
In addition to publishing their respective contents in the usual way (DA on their website(s), USERS in articles, blog-entries etc.) the content is also exposed using a yet to defined protocol, like a feed for a blog. In order to make this work, it should be possible to do this with as little extra effort as possible.
The items mentioned here are further developed in later sections.
Since there are a number of Buddhist Digital Archives, providing access to different types of resources, any attempt at integration should allow a distributed architecture with no need for centralized servers. However, a list of what kind of resources are available where has to be maintained and the different archives will need to know about the other archives, or at least have a protocol to find out. One way to solve this problem would be a new catalog of Buddhist scriptures, that would provide a wealth of metadata, but also serve as an inventory of digital resources.
IBA should define or encourage work towards definition of, a reference scheme to uniquely identify the entities that are of interest in Buddhist Studies. Such an identifier can than for example be used in a microformat to be defined, that allows writers to identify references, that can be act upon automatically. If we had a microformat that identifies a text passage as being a translation of the first three lines of the Heart-Sutra into English, this could be published in a feed pointing to this resource. DA, mashup-sites, or other researchers could then subscribe to such a feed. This would enable a DA to produce a link (or in other ways refer / embed the passage). If the microformat allows so, one could even mark a passage as tentative or disputed and leave it to the consumers of such a feed to decide whether and how to use such content.
In addition to that, it seems desirable or maybe even necessary to define a web service API, that allows programmatic access to the resources published by other DAs, for example to aggregate search results from different sites.
While the catalog would take care of the linking to texts, which is the most immediate need, there is a similar need to provide a way to link information about people, places and concepts. Some important parts of an infrastructure to be defined are already in place and efforts should be made to integrate them, these include the collaborative Digital Dictionary of Buddhism (DDB), founded and coordinated by Charles Muller, the Indica et Buddhica site maintained by Richard Mahoney and the Buddhist article databases maintained in Taiwan and Japan to name just a few, whereas for information about people clues might be taken from the Knowledgebase of Tang Persons developed at the Institute for Research in Humanities.
Electronic Buddhist texts and the problem of cataloging them has been at the center of my research interests for most of the last 20 years. One of the things that came out of this interest is the WWW Database of Chinese Buddhist texts, which was first published on the Web in 1996. Some more details about that project have been published in Wittern 2007.
This database had the rather modest aim of providing easy cross-references to several printed versions of the Chinese Buddhist Canon, so that a text could be found, even if a source referenced a different canonical edition.
Now, within the context of IBA, I think time has come for a much more ambitious catalog of Buddhist scriptures that reflects the recent advances both in the possibilities of electronically mediated information and scholarly understanding of the textual history of Buddhist scriptures. This catalog should allow us to build a comprehensive inventory of all Buddhist scriptures, wether existing today or only known through citations or historical scriptural inventories, including the complete transmission history, known editions, sources, and provenance of items. It goes without saying, especially in the present context, that editions would of course also include electronic editions of the text, with links to the versions at the DAs.
A major challenge in creating such a catalog is to adequately model the complexity of Buddhist scriptures with their long tradition, many canonical languages, complex intertextual relationships and wealth of related research and findings, in a way that allows efficient handling of the technical aspects, structural architectural requirements and at the same time manages to be future-proof in respect to newly arising possibilities of use and re-use of the information.
While some of these problems are specific to Buddhist scriptures, most of the issues are of a more generic nature. It therefore might be useful to look at how other communities are trying to solve similar issues.
One such attempt to solve a very similar problem is discussed in the report of the Cataloguing Section of the International Federation of Library Associations and Institutions (IFLA) on the "Functional Requirements for Bibliographic Records" (FRBR), which was first issued in 1997 (in print 1998, online at http://www.ifla.org/VII/s13/frbr) , with a final report published in February of 2008 (cited as IFLA 2008 in the following).
FRBR views the bibliographic entities it attempts to catalog in three groups:
Group 1 entities are Work, Expression, Manifestation, and Item, and represent the products of intellectual or artistic endeavour.
Group 2 entities are person and corporate body, responsible for the custodianship of Group 1’s intellectual or artistic endeavour.
Group 3 entities are subjects of Group 1 or Group 2’s intellectual endeavour, and include concepts, objects, events, places.
Fig. 1: Entity relationships in FRBR
In the present context, the main focus lies on Group 1 entities, however a complete model will likely have to include Group2 and Group 3 entities as well. A biographical entity is seen by FRBR can be analyzed as (see also Fig.1):
A work, which is a “distinct intellectual or artistic creation.” (IFLA 2008)
An expression, which is “the specific intellectual or artistic form that a work takes each time it is ‘realized.’” (IFLA 2008)
A manifestation, which is “the physical embodiment of an expression of a work. As an entity, manifestation represents all the physical objects that bear the same characteristics, in respect to both intellectual content and physical form.” (IFLA 2008)
An item is “a single exemplar of a manifestation. The entity defined as item is a concrete entity.” (IFLA 2008)
These definitions are somehow fuzzy and not easy to grasp; the progression is from rather abstract notions of a work to the individual items of, say a printed copy of that work. I will try to illustrate them with the example of the Avataṁsaka sūtra, the same that has been used as an example in Wittern 2007.
Work Avataṁsaka sūtra [w1]
Expression
translation by Buddhabadra (ca. 418-420 or 398~) [e1]
translation by Śikṣananda (ca. 695-699) [e2]
(partial) translation by Prajñā (ca. 795-798) [e3]
Manifestation
the text as contained in the Taishō edition, reprinted by Xinwenfeng, Volume 9, No. 278, p. 395 to 788 (60 juan) [m1]
the text as contained in the Tripitaka Koreana, reprinted by Xinwenfeng, Volume 8, No 80, p. 425 to 944 (60 juan) [m2]
the text as contained in the Taishō edition, reprinted by Xinwenfeng, Volume 10, No. 279, p. 1 to 444 (80 juan) [m3]
the text as contained in the Tripitaka Koreana, reprinted by Xinwenfeng, Volume 8, No 79, p. 1 to 424 (80 juan) [m4]
the text as contained in the Taishō edition, reprinted by Xinwenfeng, Volume 10, No. 293, p. 661 to 851 (40 juan) [m5]
the text as contained in the Tripitaka Koreana, reprinted by Xinwenfeng, Volume 36, No 1262, p. 1 to 229 (40 juan) [m6]
Item
My copy of volumes 9 [i1] and 10 [i2] of the Taishō edition on the bookshelves of my office in Kyoto.
What is not yet expressed here sufficiently are the relationships between these entities, which do include part -> whole relationships in the case of [e3], which is only a translation of a part of the work. FRBR does indeed have a sophisticated view of how to express these relationship, including the part/whole or container/contained relationship (IFLA 2008, p. 60 ff).
The first key to the answer, as I see it, lies in realizing that whatever steps are to be taken, they will have to be considered in a way that they can easily blend into the fabric of the Web. The Web is built around decentralized servers, who communicate using protocols based on open standards that can easily be adopted, expanded and built upon or glued together to form new, complexer standards. The following is a list of activities I would like to see, listed in the order of dependency, with the most generic mentioned first. This does however not mean that they would have to be addressed in this sequence, or in sequence at all, since some of the problems are orthogonal.
One necessity to make this possible might be the creation of an unique identifier that can be used to identify the objects of study. For a start, one may attempt to define such identifiers for Buddhist scriptures, which I will tentatively call an IBA-ID here.
This is quite an important tool, so it needs some careful consideration. It will be useful, to also look at how other fields solved similar problems.
A mechanism for identifying books is in fact already in place in the international book trade and publishing industry, the ISBN number (see http://isbn.org) and this is used in online library catalogs, or, sometimes with extensions, by online bookstores such as Amazon.com. However, this identifies only the book as a physical object and is not sufficient for the purpose at hand.
Other identifiers, such as the Digital Object Identifier (DOI, see http://doi.org) give a permanent identifier to electronic documents. Similar to a Uniform Resource Name (URN), but in contrast to a Uniform Resource Locator (URL), it is not dependent upon the electronic document's location. The International DOI Foundation (IDF) defines DOI name as "a digital identifier for any object of intellectual property"; it explains that the DOI is used for "persistently identifying a piece of intellectual property on a digital network and associating it with related current data in a structured extensible way." DOI is to used to give a scholarly article a unique identifying number that anyone can use to obtain information about the publication's location on a digital network. A DOI is issued by a designated registration agency. However, what we need to identify are not only digital objects, but also abstract entities, like the Avataṁsaka as a work in the FRBR sense, Śikṣananda's translation of it as an expression in the FRBR sense and so on.
A third identifying system to look at is the Life Science Identifier (LSID), which has been used in the life sciences and related fields since its introduction in late 2004. It has a hierarchical structure, constructed according the rules for an URN as follows:
<LSID>::= ‘urn:’ ‘lsid:’ <AuthorityID>‘:’ <AuthorityNamespaceID>‘:’ <ObjectID>[‘:’ :<RevisionID>]
This means that different authorities can handle their own namespace as they wish and thus can re-use existing identifiers where necessary. LSID identifiers are for example also used by PubMed or in the Catalogue of Life project, the latter records the identifier for homo sapiens as follows:
Homo sapiens urn:lsid:catalogueoflife.org:taxon:d84baba6-29c1-102b-9a4a-00304854f820:ac2008
LSID authorities can provide an LSID resolution service (LSRS) for their LSIDs. LSRS specification defines a standard interface for retrieval of data and metadata regarding identified objects or named concepts via multiple protocols. Metadata can be used to describe further every named data object or concept directly and also to convey semantically meaningful relationships among the objects and concepts referred to by LSIDs.
LSID resolution consists of four steps: a client acquires the LSID to an object of interest; the client locates an LSRS for the LSID through a priori knowledge or use of the LSID Resolution Discovery Service (LRDS); the client sends a getAvailableServices() request to the LSRS providing the LSID as a parameter and the LSRS returns locations and protocols for services capable of retrieving data or metadata about this LSID; the client selects one of the services and sends it a getData() or getMetaData() request.
A review article (Martin, Hohmann and Liefeld, 2005) is trying to evaluate its impact and looks into some problems with the specification. The problems mentioned have to do with the specifics of how a LSID is defined (that is, it refers to a specific sequence of data: a new LSID has to be generated if even one byte changes) and with the protocols used, another complaint is the under-specification of metadata.
An IBA-ID should take these experiences into account when developing a specification.
It should be clear from the above that there is a lot of work to do for IBA in coming up with a good numbering scheme, but some hints can certainly be taken from the LSID and the experience made with it over the last years..
In addition to the numbering scheme, as outlined above, the development of microformats is desirable. These would make use of the IBA-ID and specified how this can be embedded, for example in web pages, blogs, online articles and other online publications to make dynamic resource discovery possible.
Based upon the model, work has to start on actual editing the catalog. Many components that could go into the catalog are already available somewhere, but they need to be (at least virtually) unified under a common model and referenced using the IBA-ID. The catalog would thus also function as a resolution service for the IBA-ID, enabling the discovery of related resources.
Since there will be no single central web site that provides access to the IBA digital archives, it is essential that the associated digital archives to provide access to their collections not only through the human readable main entrance, but also through a side entrance that can be used by other web sites and general programs to access the content of the collections. This will likely include requests like listings of the collections, but also search requests. Some protocols to exist and are already used in other communities, but the IBA will have to evaluate them and agree on specific conventions to implement them.
IFLA Cataloguing Section. Functional Requirements for Bibliographic Records (FRBR). http://www.ifla.org/VII/s13/frbr/.
Allsop, John. 2007. Microformats: Empowering Your Markup for Web 2.0. Apress.
Anon. Catalogue of Life Indexing the world's known species. http://www.catalogueoflife.org/search.php.
Anon. Delicious. http://delicious.com/.
Cameron, Richard. CiteULike: Everyone's library. http://www.citeulike.org/.
Hull, Duncan, Steve Pettifer, and Douglas Kell. 2008. Defrosting the Digital Library: Bibliographic Tools for the Next Generation Web. PLoS Comput Biol 4, no. 10 (October 31): e1000204.
Illich, Ivan. 1973. Tools for Conviviality. Harper & Row. http://www.davidtinapple.com/illich/1973_tools_for_convivality.html.
Institute for Research in Humanities. Knowledgebase of Tang Persons. http://tkb.mydns.jp/pers-db/.
Khare, Rohit, and Tantek Çelik. 2006. Microformats: a pragmatic path to the semantic web. In Proceedings of the 15th international conference on World Wide Web, 865-866. Edinburgh, Scotland: ACM.
Mahoney, Richard. Indica et Buddhica - Portal. http://www.indica-et-buddhica.org/.
Martin, Sean, Moses Hohman, and Ted Liefeld. 2005. The impact of Life Science Identifier on informatics data. Drug Discovery Today 10, no. 22 (November 15): 1572, 1566.
Muller, A. Charles. 1995. Digital Dictionary of Buddhism. http://www.buddhism-dict.net/ddb/.
Murugesan, S. 2007. Understanding Web 2.0. IT Professional 9, no. 4: 34-41.
Object Management Group. Life Science Identifier specification. http://www.omg.org/cgi-bin/doc?dtc/04-10-08.
Rheingold, Howard. 2000. The Virtual Community: Homesteading on the Electronic Frontier, revised edition. {The MIT Press}.
The International DOI Foundation. The Digital Object Identifier System. http://www.doi.org/index.html.
Wittern, Christian. WWW Database of Chinese Buddhist texts. http://www.kanji.zinbun.kyoto-u.ac.jp/~wittern/can/.
---. 2000. Buddhist Studies in the Digital Age. Chung-Hwa Buddhist Journal 13,2: 461-501.
---. 2005. The Text in the Age of Digital Reproduction. In The Role of Buddhism in the 21st Century. Proceedings of the Fourth Chung-Hwa International Conference on Buddhism, 389-414. Taipei, Taiwan.
---. 2007. Digital Text, Meaning and the World : Preliminary considerations for a Knowledgebase of Oriental Studies. In 東アジアにおけろ儀礼と刑罰, 41-58. Seoul, Korea: Institute for Research in the Humanities.