Competing in a World Scoped by Google

Dr Joanna Richardson [HREF1], Digital Repository Administrator, Division of Information Services [HREF2], Griffith University [HREF3], Queensland, 4111. J.Richardson@griffith.edu.au

Abstract

This paper explores the challenges facing organisations--particulary libraries and educational institutions--in getting their content in front of the end-user with the same degree of apparent ease that search engines such as Google have achieved. The author examines a range of initiatives which make more accessible material traditionally viewed as part of the deep Web and concludes that the current generation--as it grows up with the Internet--will help to create the resource discovery tools and content management systems of the future.

Introduction

In today's university, librarians and academic staff are acknowledging and dealing with the impact of the Internet on students. Richardson and Hopkins [HREF4] advise that libraries need to know their customers and to make use of market research and surveys such as the Pew Internet & American Life project to underpin planning. Balas (2002) recommends these kinds of surveys in the quest to attract potential library patrons: "We need to design and develop library services that use the Internet effectively to serve the needs of the new technically competent generation of potential library patrons." The Pew Internet Data Memo "College Students and the Web", for example, contains information about the kinds of web sites college students use and their e-commerce habits [HREF5].

Significant numbers of library users, especially in the academic environment, have grown up with the Internet. Jones [HREF6] reports that "For most college students the Internet is a functional tool, one that has greatly changed the way they interact with others and with information as they go about their studies." They use it for communication via email, instant messaging and online chat for both social and educational purposes, entertainment, file sharing, shopping.

Jones affirms that the "Internet has changed the way students use the library" [HREF6]. Computer use within libraries has been predominantly for commercial search engines rather than university and library web sites. This trend has been coupled with reported decreases in traditional scholarly citations in student research papers. Among school students who go online every day "the ease and speed of online research [are] their main reasons for relying on the Web instead of the Library" [HREF7].

In questioning the future role of the library, Sommers suggests that libraries are and will continue to be in competition with the Internet. "It's a battle not for the person our age but for those people who are five and six years old. They're going to grow up with a whole different perspective. The future of libraries 20 years from now is being set today" [HREF8]. Roha (2000) provides a vision of the Internet future, which is characterised by pervasive computing--with the PC taking a backseat to gadgets, wireless connection round-the-clock and highly personalised web services. She says that this may take time to be realised, but the technology already exists.

The Impact of Google

If at AusWeb95 we were talking about Mosaic (later developed commercially as Netscape and Internet Explorer), then 10 years later we are talking about Google [HREF9], a search engine which has experienced a meteoric rise in a somewhat fickle environment. The neophyte PC user immediately learns to cope with email and Google.

It is this ubiquitous use of web technology--including Google, of course--which has helped influence students as already outlined by Jones. In a paper describing the ARROW project [HREF10], Treloar [HREF11] explains that access would include SRW (Search/Retrieve Webservice [HREF12], an XML oriented protocol designed to be a low "barrier to entry" solution to performing searches and other information retrieval operations across the Internet) and Google or "otherwise students won't find it!"

As a corollary, in educational programs there are concerns that the growing availability of material on the web has increased the temptation--and ease--for students to plagiarise. Educators for their part have used the Web to validate their suspicions, eg entering keywords from students' papers into Google. The obvious disadvantage is that this process is very laborious because of the necessity to pick out keywords for each assignment topic. In addition URLs cited in papers tend to disappear quite quickly from the Web, which can make it challenging to follow up a citation--let alone for a student to actually cite the original source. This has led to the development of the Internet Archive [HREF13] as well as a number of electronic plagiarism detection products, including the very successful "Turnitin" software [HREF14].

MacKenzie Smith [HREF15] has quite logically questioned the role of the library in an age of digital publishing and Web indexes like Google. New formats present new preservation challenges in which the responsibilities have not yet been worked out. Therefore it is not surprising that libraries are becoming more aggressive about having a role in the life-cycle of the scholarly communication process. Institutional repositories are a first step towards addressing these challenges. Such repositories are institution-based, collect scholarly material in digital formats, and are cumulative, perpetual, open and interoperable (but not necessarily free).

Developed jointly by MIT Libraries and Hewlett-Packard, DSpace [HREF16] is a digital institutional repository system to capture, store, index, preserve, and redistribute the intellectual output of a university's research faculty in digital formats. It is now freely available to research institutions worldwide as an open source system that can be customised and extended. This is in keeping with MIT's "gift economy" philosophy: you give something away and you get something back, because the community at large contributes to enhancing the software. MIT is not alone of course in espousing this concept; Pinchot [HREF17] and Barbrook [HREF18] among others have debated its relative merit.

The project started with preprints and articles, but academics indicated that they were really interested in "grey" literature. They also wanted to organise their own teaching materials. Therefore along with its original purpose, it could be used as a simple digital asset management system. What is of specific interest to this paper is that DSpace is working with Google to enhance access to content, which echoes Treloar's comment about the Monash experience.

Lorcan Dempsey [ HREF19] from OCLC also echoed this theme when he advised that OCLC had released records to Google to ensure that they were indexed. Google's executives themselves consider that "cataloguing the Web is only the beginning"[ HREF20]. "My guess is about 300 years until computers are as good as, say, your local reference library in doing [a] search," says Craig Silverstein, Google's Director of Technology [ HREF20].

Despite the advances in Web search engines, the problems associated with performing a keyword search have been well documented in the literature. According to Tudhope and Koch [HREF21], knowledge organisation systems / services (KOS) are emerging which can enhance resource discovery and retrieval. A recent special issue of the Journal of Digital Information has been devoted to just this topic [HREF22].

What Google does not [yet] access

In contrast with the apparent power of Google we need to balance the following concern: How do you access material that is hidden from search robots in OPACs or behind a database or another search engine? This large amount of volume, which is inaccessible to most users, is known as the deep Web [HREF23] or dark matter [HREF24].

In this environment institutions, including universities, are trying to collect and preserve their teaching and learning and research output, hence the emergence of digital repositories. The oft-cited OCLC Collections Grid diagram [ HREF25] graphically illustrates the major types of content: published (eg books, journals), special collections (eg rare books, theses), open web (eg open source software), and institutional (eg learning objects, research data). It is a useful diagram for showing the complexity of collections that institutions are trying to archive.

Increasingly libraries are taking responsibility for born-digital collections, eg geospatial or numeric data sets, faculty or class websites [HREF26]. As Waters [HREF27] has reported, seven institutions were funded by the Mellon Foundation to design projects to explore the utilisation of the then newly developed OAI Metadata Harvesting Protocol. Particular emphasis was placed on delivering scholarly information from the deep / hidden Web, eg internet-accessible databases, including library catalogues, that are not normally accessible to Internet search engines.

Earlier this year one of the Mellon Foundation participants, the University of Michigan [HREF28], announced a new agreement between the University and Yahoo! Inc, which makes available a repository developed through Michigan's University Library OAIster Project. OAIster offers information that links to hidden digital resources such as the complete contents of books and articles, white papers, preprints, technical reports, movies, images and audio files. In addition the service provides a direct link to actual digital objects, not just the relevant descriptive information.

In looking at trends in libraries, Dempsey highlights increased disclosure and licensing of special collections along with increased archiving. As for the Google impact, he decries: "Will people who live in a world scoped out by Google ever access the other 3 parts of the [Collections] Grid?" [ HREF29]

Cole's ongoing research into the effects of the Internet [HREF30] offers hope: "[The Internet's] impact on how we learn, both formally and informally, has been minimal and limited to the periphery of education, in areas like Web sites for courses and small amounts of distance learning. Over the next 10 years, as children who grew up with the Internet become teachers and administrators, they will begin to apply the Internet to the foundations of learning."

This can be extended to how these same children -- once grown up-- will publish / share software, research and creative output. They will probably do all of this via the Web (or its replacement) because there will be a strong culture to do so.

We are sitting on the cusp of a major re-thinking of how intellectual property (IP), and especially research, is managed and shared. Researchers are being encouraged to deposit their research results in open access archives. Professor Tony Hey, head of the UK's e-Science Core Programme, is responsible for setting up the computing infrastructure in the UK for "e-science", ie global collaborative science [HREF31]. He is anxious for academic libraries to take on a role as curators to universities' digitised IP, which includes managing databases of research data and maintaining complex metadata.

If we examine the role that the World Wide Web Consortium (W3C) has taken in developing the Semantic Web (formerly known as Metadata Activity) [HREF32], there is clearly a vision for the sharing of scientific, commercial, organisational and cultural data. The goal of the Semantic Web initiative is to "create a universal medium for the exchange of data". What is heartening about the W3C's view of current trends is that "facilities to put machine-understandable data on the Web are quickly becoming a high priority for many organizations, individuals and communities. The Web can reach its full potential only if it becomes a place where data can be shared and processed by automated tools as well as by people."

Of course it remains to be seen how widespread this standard will become--if at all. In an effort to move this promising initiative along, a Working Group has initiated a 'challenge' to encourage developers to "build an online application that integrates, combines, and deduces information needed to assist users in performing tasks". Entries for 2003 [HREF33] give a preliminary indication of the range of possible applications. However, regardless of the ultimate success of the Semantic Web, the important point is that there is clear recognition of the importance of and difficulties in sharing and reusing data across applications as well as organisations.

The Impact of Copyright and Digital Rights Management

There is considerable tension right now regarding digital rights management (DRM), intellectual property (IP) and copyright, all of which have a major impact on what we get to see, especially via the Internet. Barry and Richardson [HREF34] have explored the tensions between those in government and business who wish to control the Internet and those who wish to use its capabilities to introduce new types of services, and between those who lessen the power of the individual and those who create synergisms between individuals and groups. The authors concluded that a general pattern was emerging: on the one hand actions of large institutions which are attempting to retain their pre-internet business and governmental models, and on the other hand the actions of individual groups trying to preserve the original spirit of the Internet.

Despite the fact that new technologies should be making materials more readily available, many valuable resources are in fact behind database "walls" or available only on a user-pays basis [HREF35]. In a joint Council of Australian University Librarians-National Library of Australia issues paper [HREF36], some of the key issues in information access, use and delivery affecting higher education were identified as: cost of purchasing or licensing information, duplication of resources by institutions, establishment of benchmarking and quality assurance measures, digital rights management and the cost of using copyright material.

Van de Sompel [HREF37] identifies "roadblocks" in the digital library arena that may seriously impede efforts to move towards an integrated scholarly knowledge environment as envisioned in the Cyber Infrastructure Report [HREF38]. In that report Atkins argues that a new age has dawned in scientific and engineering research, pushed by continuing progress in computing, information and communication technology. The capacity of this technology has now made possible a comprehensive "cyberinfrastructure" on which to build new types of knowledge environments. Van de Sompel for his part is concerned about the inability to freely process scientific assets, the complexities that we are forced to build into our systems in order to accommodate the existing rights framework, and that datasets, simulations, etc are in danger of going down the same path as research papers.

In looking at the whole scholarly communication system, Van de Somple describes it as "merely a scanned copy of the paper-based system". Within it assets that are inherently linked are not technically linked. So we have to post communicate these relationships. "The problem is that relationships, which are known at the moment a scholarly asset goes though a step in a value chain, are lost the immediate moment thereafter, and in many cases are forever lost." We do not yet have the underlying infrastructure that can capture these relationships so they can be analysed later. This is undoubtedly the type of material which Dempsey had in mind when he wondered whether / how users would access non-Google accessible Web content.

In their paper Barry and Richardson [HREF34] highlighted the Creative Commons as an important initiative to monitor the resistance to the privatisation of public knowledge. According to its founder, Lawrence Lessig, "[It] will make available flexible, customizable intellectual-property licenses that artists, writers, programmers and others can obtain free of charge to legally define what constitutes acceptable uses of their work. The new forms of licenses will provide an alternative to traditional copyrights by establishing a useful middle ground between full copyright control and the unprotected public domain" [ HREF39].

Since 2002 the Creative Commons Project [HREF40] has evolved into a worldwide initiative with the Queensland University of Technology becoming the first Australian institutional affiliate in February 2004. At the public launch --via video link from Silicon Valley-- Lessig [ HREF41] discussed the need to balance the law in order to fit new technology(s), otherwise restrictive laws will inevitably lead to low growth in digital materials. An important role of the International Creative Commons (iCommons) is to translate / port software licences into each national jurisdiction.

Lessig's view of the future of the Internet as opposing 'forces' underpins much of the discussion of the Barry and Richardson paper: technology backed up by law, which controls every transaction OR free content co-existing with digital rights management.

Conclusion

There is light on the horizon. A culture is currently evolving which not only values all content within the Collections Grid but also actively attempts to 'web enable' that content. Key initiatives include projects such as the Creative Commons which strive to achieve a balance between intellectual property and creativity, on the one hand, and the emergence of institutional digital repositories to enhance resource discovery, on the other hand. Coupled with this is the fact that the children of today, who are growing up with the Internet, are the very people who will create the Web of the future.

References

Balas, J. (2002). "Meeting Expectations" in Computers in libraries v.22 n.1 p.46-48.

Roha, R.R. (2000). "The Dot-Me Decade" in Kiplinger's Personal Finance v.54 n.1 p.124-126.

Hypertext References

HREF1
http://jprglobal.tripod.com/
HREF2
http://www.griffith.edu.au/ins/org/flas/access/repository/
HREF3
http://www.griffith.edu.au/
HREF4
http://www.vala.org.au/vala2004/2004pdfs/47RicHop.PDF
HREF5
http://www.pewinternet.org/reports/pdfs/PIP_College_Memo.pdf
HREF6
http://www.pewinternet.org/reports/pdfs/PIP_College_Report.pdf
HREF7
http://www.pewinternet.org/reports/pdfs/PIP_Schools_Report.pdf
HREF8
http://www.libraryjournal.com/article/CA302408?display=searchResults&stt=001&text=%22integrated+library+systems%22
HREF9
http://www.google.com
HREF10
http://eprint.monash.edu.au/archive/00000046/
HREF11
http://www.vala.org.au/vala2004/2004pdfs/21HrSaTr.pdf
HREF12
http://www.oclc.org/research/projects/webservices/default.htm
HREF13
http://www.archive.org/
HREF14
http://www.turnitin.com
HREF15
http://www.vala.org.au/vala2004/2004pdfs/69Smith.PDF
HREF16
http://www.dspace.org/
HREF17
http://www.context.org/ICLIB/IC41/PinchotG.htm
HREF18
http://www.firstmonday.dk/issues/issue3_12/barbrook/
HREF19
http://www.oclc.org/research/presentations/dempsey/vala_20040205.ppt
HREF20
http://www.cbsnews.com/stories/2004/03/25/sunday/main608672.shtml
HREF21
http://jodi.ecs.soton.ac.uk/Articles/v04/i04/editorial/
HREF22
http://jodi.ecs.soton.ac.uk/?vol=4&iss=4
HREF23
http://searchwebservices.techtarget.com/sDefinition/0,,sid26_gci558034,00.html
HREF24
http://www9.org/final-posters/poster30.html
HREF25
http://www.oclc.org/membership/escan/appendices/collectiongrid.htm
HREF26
http://www.arl.org/newsltr/225/main.html
HREF27
http://www.arl.org/newsltr/217/waters.html
HREF28
http://www.umich.edu/news/index.html?Releases/2004/Mar04/r031004
HREF29
http://www.oclc.org/research/staff/dempsey/recombinant_library/dempsey_recombinant_library.htm
HREF30
http://chronicle.com/prm/weekly/v50/i30/30b01801.htm
HREF31
http://www.cilip.org.uk/update/issues/mar04/article4march.html
HREF32
http://www.w3.org/2001/sw/
HREF33
http://challenge.semanticweb.org
HREF34
http://ausweb.scu.edu.au/aw02/papers/refereed/barry/paper.html
HREF35
http://www.vala.org.au/vala2002/2002pdf/15McCEva.pdf
HREF36
http://www.caul.edu.au/caul-doc/infrastructure-caul-nla.doc
HREF37
http://www.sis.pitt.edu/%7Edlwkshop/paper_sompel.html
HREF38
http://www.communitytechnology.org/nsf_ci_report/
HREF39
http://www.sfgate.com/cgi-bin/article.cgi?file=/gate/archive/2002/02/11/creatcom.DTL
HREF40
http://creativecommons.org/
HREF41
http://www.law.qut.edu.au/files/Creative_Commons_Launch_Brisbane.pdf

Copyright

Joanna Richardson, © 2004. The author assigns to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author also grants a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.