Merveilles du web 2.0… mon « copier bloguer » du web

6 mai 2009

ELAG 2009 (jour #2) | BibLibre

Classé dans : Web 2.0 — Rémi SOUBEYRAND @ 14:16
Tags:

Sebastian Hammer. A tool-based approach to library application development

S. Hammer est le fondateur de la société IndexData. Nous travaillons avec leurs outils, en particulier : le moteur Zebra et le client/serveur Z3950 YAZ, mais aussi pazpar2, leur outil de recherche fédérée. Tous ces outils sont sous licences de logiciels libres.

J’aime beaucoup IndexData, une petite compagnie (11 personnes) qui sait exactement ce qu’elle veut faire : des outils “de fond”, qui ont une longue durée de vie et s’attaquent à des problèmes difficiles à régler.

Sebastian a présenté un outil qui sera disponibles cet été (la licence reste à déterminer) et qui est, pour l’essentiel, un plugin firefox qui vous permet, graphiquement, très (très) facilement, de construire un connecteur de recherche fédérée sur à peu prêt n’importe quel site. Il a fait une petite démo sur le site www.npr.org, et c’est effectivement très facile. Le plugin enregistre le connecteur sous forme d’un fichier xml simple. Ce fichier contient deux types d’informations : les informations qui permettent de faire fonctionner le connecteur lui-même (nom des champs, etc.) et des données tests (par ex. une valeur à chercher) qui permettront de faire tourner régulièrement un script qui va tester que le connecteur marche toujours.
En gros, on peut ainsi produire en 30mn un connecteur qui, auparavant, pouvait prendre jusqu’à 4 jours à produire.

ELAG 2009 (jour #2) | BibLibre

Blogged with the Flock Browser

21 avril 2009

Helpful Searching Tips : MyResearch Help : vufind

Classé dans : Web 2.0 — Rémi SOUBEYRAND @ 13:46
Tags:

Helpful Searching Tips

Wildcard Searches

To perform a single character wildcard search use the ? symbol.

For example, to search for “text” or “test” you can use the search:

te?t

To perform a multiple character, 0 or more, wildcard search use the * symbol.

For example, to search for test, tests or tester, you can use the search:

test*

You can also use the wildcard searches in the middle of a term.

te*t

Note: You cannot use a * or ? symbol as the first character of a search.

Fuzzy Searches

Use the tilde ~ symbol at the end of a Single word Term. For example to search for a term similar in spelling to “roam” use the fuzzy search:

roam~

This search will find terms like foam and roams.

An additional parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:

roam~0.8

The default that is used if the parameter is not given is 0.5.

Proximity Searches

Use the tilde ~ symbol at the end of a Multiple word Term. For example, to search for economics and keynes that are within 10 words apart:

"economics Keynes"~10
Range Searches

To perform a range search you can use the { } characters. For example to search for a term that starts with either A, B, or C:

{A TO C}

The same can be done with numeric fields such as the Year:

[2002 TO 2003]
Boosting a Term

To apply more value to a term, you can use the ^ character. For example, you can try the following search:

economics Keynes^5

Which will give more value to the term “Keynes”

Boolean Operators

Boolean operators allow terms to be combined with logic operators. The following operators are allowed: AND, +, OR, NOT and -.

Note: Boolean operators must be ALL CAPS

OR

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching record if either of the terms exist in a record.

To search for documents that contain either “economics Keynes” or just “Keynes” use the query:

"economics Keynes" Keynes

or

"economics Keynes" OR Keynes
AND

The AND operator matches records where both terms exist anywhere in the field of a record.

To search for records that contain “economics” and “Keynes” use the query:

"economics" AND "Keynes"
+

The “+” or required operator requires that the term after the “+” symbol exist somewhere in the field of a record.

To search for records that must contain “economics” and may contain “Keynes” use the query:

+economics Keynes
NOT

The NOT operator excludes records that contain the term after NOT.

To search for documents that contain “economics” but not “Keynes” use the query:

"economics" NOT "Keynes"

Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:

NOT "economics"
-

The - or prohibit operator excludes documents that contain the term after the “-” symbol.

To search for documents that contain “economics” but not “Keynes” use the query:

"economics" -"Keynes"
Blogged with the Flock Browser

VuFind FAQ

Classé dans : Web 2.0 — Rémi SOUBEYRAND @ 13:30
Tags:

What is VuFind?
Where is VuFind?
What are the current VuFind development priorities?
What’s new with VuFind?
Is the classic catalog (WebVoyage) going away?
How do I choose between VuFind and the classic catalog (WebVoyage)?
Why are search results different between VuFind and the classic catalog?
How do I place a request in VuFind?
What about a “Request first available copy” feature for VuFind?
Why do I see the same title multiple times on the VuFind Results (hit list) page?
How can I search VuFind? What is indexed?
What can I search for in VuFind? What is included in the database?
What are VuFind “favorites”?
What are VuFind “tags”?
What if someone adds an inappropriate tag or comment to VuFind?
Why do I sometimes see the cover image and sometimes not?
Who uses VuFind?
What is the longer-term future for VuFind at CARLI?
Is VuFind the same as the eXtensible Catalog?

What is VuFind?

VuFind is an exciting new alternative interface to the I-Share catalog, offering users what may be a better way to search and discover library resources. In the fall of 2007, CARLI staff began experimenting with VuFind as an optional, accessible alternative to WebVoyage for interested I‑Share libraries. VuFind searches the Voyager database, the same data as the classic WebVoyage search option. In VuFind, just as in WebVoyage, users may check item availability and place requests.

Features of VuFind include:

  • A single simple search box
  • Facets to refine search by subject, title, topic, language, format, and more
  • The ability to request and to renew items from any I-Share library
  • Focus on home library catalog or expand to search all I-Share libraries at once
  • Up-to-the-minute item status and location information
  • User-created username and password, for easier access to MyAccount information
  • Links to electronic full text when available
  • One-click links to reviews for some titles
  • One-click links to book previews in Google Book Search for some titles

VuFind was developed initially at Villanova University for use in libraries. CARLI is working with the open source community to continue to develop and improve VuFind so that it will serve the unique needs of the consortial community, and serve libraries of all sizes within the CARLI I‑Share community and beyond.

Where is VuFind?

VuFind for all I-Share libraries: http://vufind.carli.illinois.edu/vf/
VuFind for a particular I-Share library: http://vufind.carli.illinois.edu/vf-xxx , where xxx is that library’s three-letter code. See the CARLI URL Builder for details about linking to I-Share in VuFind as well as to I-Share “Classic” in WebVoyage.

What are the current VuFind development priorities?

November 2008: Major local development efforts on VuFind are on hold, while we wait for the release of VuFind version 1.0.

Once VuFind 1.0 is released, re-integrating our local changes (primarily changes in support of Voyager Universal Borrowing) will be CARLI’s primary VuFind development activity for some time.

What’s new with VuFind?

Selected changes since February 2009:

  • Bibliographic data are now current as of 9PM yesterday
  • “Unstemmed” (exact match) search results now float to the top of author, title, and “all fields” Results pages
  • Date sort improved

Is the classic catalog (WebVoyage) going away?

No. Both the software vendor Ex Libris and the CARLI Office anticipate supporting WebVoyage (the web-based public catalog software for the Voyager library system) for years to come. Currently there are many features in the classic catalog that are not replicated in VuFind. VuFind has never been intended to replace the classic catalog; it is an alternative for library users.

How do I choose between VuFind and the classic catalog (WebVoyage)?

<!–

–>

VuFind Classic / WebVoyage
VuFind is rapidly evolving. WebVoyage is a mature product.
VuFind is well suited for “discovery” & narrowing large search sets. WebVoyage is better suited for known-item searching.
VuFind has a more modern look-and-feel. WebVoyage allows more sophisticated, library-savvy searching.
VuFind makes titles searchable if they were in the catalog as of 9PM yesterday. The very newest records cannot be searched in VuFind. WebVoyage makes titles searchable as soon as the library adds them to the catalog. The very newest records can only be searched in WebVoyage.
In VuFind, for now, the user selects the specific copy to request. This can be an advantage when a title is owned by many libraries and the classic catalog times out before it can finish the work of selecting an available copy for the user to request. In WebVoyage, the user selects the title to request, then the system selects an available copy at random.
VuFind is more compliant with accessibility standards. Users who rely on screen readers and other adaptive technologies will have an easier time with VuFind WebVoyage is currently less compliant with accessibility standards. (WebVoyage version 7.0 will be better in this regard.)
VuFind already links to Google Book Search for “previews” of books’ content. WebVoyage does not yet link to Google Book Search. (This will change.)
VuFind displays the user’s library’s SFX button (which might say “Find it,” “Get it,” “More,” etc.) for any title where electronic full text would be available to the user, even if the title is in another library’s catalog. WebVoyage displays a library’s SFX button (“Find it,” “Get it,” “More,” etc.) on that library’s records only.
VuFind offers social software features like user-added comments and tagging. WebVoyage does not offer social software features.

Why are search results different between VuFind and the classic catalog?

  • VuFind bibliographic data are about one day older than data searched via the classic catalog: VuFind is current as of 9PM yesterday. The very newest titles (those added today) will not be findable in VuFind today.
  • VuFind searches the local databases’ records, not the deduplicated I-Share union catalog.
    • Sometimes the same title is cataloged differently in different libraries. Even though the records are similar enough to be merged into the same record in the classic I-Share union catalog their differences can affect their “findability” in I‑Share via VuFind.
    • VuFind searches records that are in the local library’s database but, because of a library decision, are not in the classic I‑Share union catalog (the “Universal Catalog”) at all.
  • There is no exact correlation between VuFind indexes and classic catalog indexes, so it’s impossible to do “exactly the same search,” although in some circumstances you can get the same search result.

How do I place a request in VuFind?

At the moment, and unlike in the classic catalog (WebVoyage), requests in VuFind are for the particular copy, not for “any available copy” of a title.

To request an item via VuFind:

  1. Do a search, identify the copy of the title you want to request.
  2. To request it, you may either click on the word “Available” on the Results (hit list) page, or click on the Request tab on the Record (single-title display) page.
  3. If you have not already logged in to VuFind, you’ll be prompted to do so at this point.
  4. Log in with your VuFind username and password. If you do not already have a VuFind username, click on Create New Account. Your VuFind account needs to be “profiled” with your library borrower ID. Once you’ve set this up, you never have to worry about it again (unless your borrower ID changes). If you’ve never told VuFind what your library borrower ID is, then go to VuFind’s Your Account page and click on the User Account tab at the bottom of the list on the right. Enter your borrower ID, last name and library affiliation.
  5. Select the location at which you would like to pick up the item and submit the request.
    In the future, all you’ll need to do is sign on with your VuFind username and password. You won’t need to remember your borrower ID in order to request materials via VuFind.

What about a “Request first available copy” feature for VuFind?

We’re thinking of requesting in VuFind in three phases. Phase 1 is what’s in place now: “request this copy.” Phase 2 will be more like requesting in the classic catalog (WebVoyage): “request any available copy.” Phase 3 is more theoretical. If we get there, it’ll be “request any edition” or “request one of these very similar titles” or something like that.

Why do I see the same title multiple times on the VuFind Results (hit list) page?

VuFind searches across each of the I-Share libraries’ local catalogs. It does not search the deduplicated I-Share union catalog. If multiple libraries own exactly the same title, you will see that title repeated on the Results page for each owning library. You can use the facet options on the right side of the Results page to narrow your result set to a particular library quickly. The “Library” facet is always at the top of the list of facet options.

How can I search VuFind? What is indexed?

CARLI has not yet made many local changes to which fields are indexed in VuFind. This is something that will change over time. We anticipate indexing more fields in the future. Currently, the following fields are indexed:

SOLR field MARC field [repeatable fields] Description VuFind Search Qualifier
All Fields Title Author Subject ISBN/
ISSN
allfields 010-999 All bibliographic fields
author 100a Author’s name
author2 110a, [700a] Corporate author’s name or personal name of “other author”
confProc 111ac, 711ab, 811ac Conference name
contents 505a Table of contents
contentsAuthor [505r] Author names from the table of contents
contentsTitle [505t] Titles from the table of contents
dateSpan 362a Dates a serial was published (not necessarily the dates the library owns)
era [650y],[651y] Time period described by the work
genre [655a] Genres like “science fiction”
geographic [651a] Geographic subject headings
isbn 020a International Standard Book Number
issn 022a International Standard Serial Number
lccn 010a Library of Congress Control Number
newTitle 785t Succeeding title for a serial
oldTitle 780t Former title for a serial
otherStdNbr 024a Technical report number or other standard number
physical 300b Physical description (pagination, etc.)
publishDate 260c Year of publication
publisher 260b Publisher name
series 440a, 830a Series title
subject [600a], [610a], [630a] Names as subject headings
systemNbr 035a OCLC number or other system control number
title 245ab Title
title2 240a, 130a Uniform title
topic [650a] Topical subject headings
url [856u] Address of electronic resource

In addition, VuFind allows searching by user-added tags.

What can I search for in VuFind? What is included in the database?

VuFind searches each I-Share library’s local Voyager database: the same data that are searched by the classic (WebVoyage) catalog for each library. Note that an aggregation of local library catalogs is different from the I-Share union catalog. See “Why are search results different between VuFind and the classic catalog?” VuFind does not index for searching any data other than I-Share library local Voyager catalog data and user-added tags. Although book reviews, author notes, etc. are accessible from VuFind, they are not searchable in VuFind.

What are VuFind “favorites”?

The “Favorites” list is comparable to the Bookbag feature that some libraries have enabled in the classic catalog (WebVoyage). Users may add titles to their favorites at any time, as long as they have signed in to their account. Adding a title to a favorites list is not like adding it to a shopping cart: it is not a prerequisite—or indeed any part of the process—for requesting a title or checking it out. The same title may appear on many users’ favorites lists.

When you mark an item as a favorite you have the opportunity to add tags. See “What are VuFind ‘tags’?” Favorites can be deleted, but tags cannot be deleted, except by CARLI staff.

What are VuFind “tags”?

The tag feature allows you to add tags (descriptors, subject headings, whatever you want to call them) of your own choosing to a record. The idea is that it will make it easier for you to find the records again if you want to. Tags are not indexed for searching via the general search input box, and right now CARLI has turned off the ability for you to begin a search by browsing for a tag you have (or anyone else has) entered, but after we upgrade to VuFind 1.0 it’s possible we’ll turn that feature back on. Right now, if you pull up a record that has a tag on it you can click on the hyperlinked tag to execute a search for anything else in the database to which someone has assigned the same tag.

When you mark an item as a “favorite,” you have the opportunity to add tags, but it is not necessary for an item to be a favorite in order for you to add a tag. You must be logged in to assign tags; tags are associated (in the background, not publicly) with the user who entered them.

On the single-title Record page, anyone can see all tags that have been added to a record, and from the Record page anyone can click to execute a search on any of the tags that were added in the process of making a record someone’s favorite. On the Favorites page, you see only your own tags for a title.

What if someone adds an inappropriate tag or comment to VuFind?

We do not anticipate a problem with offensive tags or comments. If someone discovers an offensive tag or comment in the database, CARLI staff can remove it. If offensive tagging or commenting becomes a problem we will require that taggers’ and commenters’ VuFind accounts be associated with a Voyager borrower ID, for greater accountability.

Why do I sometimes see the cover image and sometimes not?

The cover images that display in VuFind come from Syndetics Solutions. Not all I-Share libraries subscribe to this service. In VuFind for all I-Share libraries, cover images will display if they are available from Syndetics Solutions. In VuFind for a particular I-Share library, cover images will display if they are available from Syndetics Solutions and if that library has purchased a subscription to the cover image service.

Who uses VuFind?

VuFind is an alternative catalog interface available to all I-Share libraries. Some I-Share libraries have chosen to make VuFind their primary catalog interface; others still consider the classic (WebVoyage) catalog their primary catalog interface, but offer VuFind as an experimental alternative. Beyond Illinois, VuFind is in use at a number of academic and research libraries.

What is the longer-term future for VuFind at CARLI?

VuFind has been an experiment for CARLI: in rapid, open source development and in offering alternative catalog interfaces to existing catalog data. CARLI will continue to develop and support VuFind for as long as it makes sense to do so.

Is VuFind the same as the eXtensible Catalog?

No. The eXtensible Catalog project, based at the University of Rochester, is separate. However, CARLI’s experience with VuFind will inform CARLI’s participation in the eXtensible Catalog project. Many of the challenges we have faced in adapting VuFind to work in a large, resource-sharing consortial environment we will face again with the eXtensible Catalog. The work we have done to optimize the export and reindexing of bibliographic data from Voyager for VuFind each night will be directly applicable to the eXtensible Catalog project, where data will be harvested and reindexed from a variety of sources on a variety of schedules.

Blogged with the Flock Browser

The Code4Lib Journal – Respect My Authority

Classé dans : Web 2.0 — Rémi SOUBEYRAND @ 12:57
Tags:

Respect My Authority

Some simple modifications to VuFind, an open source library resource portal, improve the retrieval of both lists of works and information about authors from Wikipedia. These modifications begin to address ways that current “next-generation” catalogs fail to fully harness all of the bibliographic tools available for indexing and presenting author information. Simple methods such as those described in this article, which make use of full headings for authors, can offer marked improvements to these systems.

by Jonathan Gorman

Introduction

As current “next-generation” catalogs attempt to overcome the inadequacies of the previous generation, they have abandoned useful techniques that have evolved in the practice of cataloging. A good example is the display of search results for works by an individual author. Many of the last generation of catalogs offered large browse lists of unique authors with little guidance for choosing between authors. The current generation lets you find all the books written by people with the same name but still offers little to those who want the works of just one author. Relatively simple changes allow authority practice to be used more effectively in the next generation. This article will show a quick enhancement to the VuFind system that improves use of its heading information to group books by particular authors. A similar technique could be applied to many of the next-generation catalogs.

Generations of Catalogs and Authority Control

Many of us can still remember working with the card catalog. Given a large enough collection, you would find yourself flipping through “Johnson, James, 1705-”, “Johnson, James, 1777-”,”Johnson, James, 1835-”, and so on. Each card would have the author’s name along with the information about a particular work. Fanning the names gave a sense of what the person wrote or created. If the titles did not seem to match the search, you could just skip ahead until the pattern changed. One could learn to quickly scan the catalog in this manner, flipping the cards in a blur.

Then came the first generation of web interfaces. This generation typically used the concept of an author browse list. Instead of seeing sequences like:

Johnson, James, 1705-
Articles of visitation and enquiry

Johnson, James, 1705-
Sermon preached before the Right Honourable ...

Johnson, James, 1777-
Economy of health

Johnson, James, 1777-
Tour in Ireland: with meditations and reflections

You saw:

Johnson, James, 1705-
Johnson, James, 1777-

Examples of such lists can be seen in many currently used systems:

LoC Author BrowseLibrary of Congress – Voyager catalog
Urbana Author BrowseUrbana Free Library – Horizon Information Portal.

In theory, instead of having to look through a hundred cards to find an author, the searcher need only view a screen or two. In reality, the old card based system’s speed was limited by manual dexterity and mental agility. In newer systems, though, users must wade through the tedium of multiple distinct actions, and their speed is limited by the time spent waiting for pages to load.

Compounding the problem was a maze of see, see also, and scope notes, presented in a poorly explained and unfriendly interface. Clicking an author link in an individual record may take you to a list of works by that author in some systems like Voyager, while others like Horizon took you to the appropriate place in the author browse.

Many next-generation catalogs take the opposite approach to the author browse list. Works by authors who share a name get lumped together, usually derived from the subfield a of the main entry or added entries in the MARC record. Records for “Carter, John, 1921-” will be interspersed with records for “Carter, John, 1912-”. This can be seen in implementations of VuFind, Evergreen, Worldcat.org, and Koha.

For smaller collections or less common names, this is not as big of a hurdle. There is one famous Ray Bradbury and getting all the books by him is relatively easy using just his name. However, as collections grow, more common names start colliding. In the case of large collections with works spanning centuries it becomes much more difficult to find what works a library may have by a single individual. Your search can return hundreds of volumes with no obvious way to narrow the results down to the books of just one author.

VuFind's Author PageVuFind’s Author Page

Can we do better in the next generation? I believe so.

Modifying VuFind

Why use VuFind?

  • The interface pages use PHP, the indexing and searches use Solr, and transforming the records for both indexing and display are done using XSLT. I am already familiar with all of these technologies.
  • VuFind is under the GPL (version 2), so I can share my modifications.
  • It’s easy to use as a stand-alone system. One just needs bibliographic records and minor modifications to the codebase to make it run without an ILS, allowing for easy experimentation.
  • There seems to be a surge of local interest in central Illinois (where I work) about VuFind.

Getting VuFind

Currently, VuFind is hosted at SourceForge. VuFind 0.7, running on Kubuntu 7.10, was used while writing this article, although release 0.8 is due soon with some potential changes (VuFind is a young and active project, so details here may differ from the most current version). You can either download the files or check it out of svn (svn co https://vufind.svn.sourceforge.net/svnroot/vufind/releases/VuFind-0.7 vufind).

Setting Up VuFind.

You will want to follow the README file that comes with the VuFind distribution. See the wiki and check out the mailing lists if you have any issues.

Getting some records

For testing this system a small sample of around 2,000 records was used (catalog.xml). They were combined from two different searches using the YAZ Z39.50 client [1] and querying the z39.50 server at CARLI [2], an academic consortium in Illinois (see Appendix I for details of this process). Feel free to use the created catalog.xml file by placing it into the import directory. The file isn’t meant to be an exhaustive scientific test, just an exercise in seeing how we can treat authorities differently.

Now, if you’re following along, you should see how VuFind normally uses the headings. Take the catalog.xml file, put it in the import directory, and follow the VuFind import steps (see the README) and experiment with it. This will make sure you have the setup correct. However, the changes we are about to make will require re-indexing. The easiest way to do this again is to simply remove the existing indexes and records in Solr. To do so, run the following curl commands while VuFind.sh is running:

curl http://127.0.0.1:8080/solr/update \
--data-binary '<delete><query>[* TO *]</query></delete>' \
-H 'Content-type: text/xml; charset=utf-8'

curl http://127.0.0.1:8080/solr/update \
--data-binary '<commit />' \
-H 'Content-type: text/xml; charset=utf-8'

Later, after modifying the index you will need to run “./vufind.sh restart” and re-run the import steps described in the README.

Modifying the index

To collocate all the records by a particular author, we’re going to take advantage of the system specified by AACR2 to allow people to distinguish headings.

A heading in a catalog record, when created according to the rules of AACR2 chapter 22 [3], should provide a unique string for a given author to be used in all the records for works by that author or about that author. Many libraries keep track of these headings by using authority files, one of the most frequently referred to being the Library of Congress Authorities.

The heading is derived mostly from the most well-known or commonly published form of an author’s name. Information is also added to make sure the heading of a new author does not duplicate an existing heading. In a MARC record these headings will be found in the main entry (100), used for the main author of a work; added entries (700), used to indicate people who contributed to a work but are not the main author; and subject added entries (600), used to indicate people the book is about. The MARC record splits the heading into several subfields, such as the personal name (subfield a), titles (c), and date (d) [4].

VuFind does index author information, but only a normalized version of the subfield a. It does not include the information used to distinguish between authors. When searching for authors we can search for authors that are Jackson, Michael or by just part of the author name, but we cannot get a particular author.

This is fine for user searches where people are likely to be searching using names. However, for internal purposes something that uses a more unique identifier is needed so we can do things like describe just a particular author or show works by just one person. Why not use the heading in the record which is already functioning as an identifier?

To change VuFind to use the full heading, it is useful to understand Solr, the index engine underlying VuFind. Queries are done by passing an HTTP GET request to the index engine. That returns information in a variety of ways about the search and the search results. Similarly, the index is constructed by uploading XML files consisting of name-value pairs via HTTP POST requests. An example might look like:

<add>
   <doc>
     <field name="id">1</field>
     <field name="author">Johnson, James</field>
   </doc>
</add>

The various fields Solr will index in these uploaded documents are specified in a file called schema.xml (located in VuFind at solr/conf/schema.xml). In order to allow the indexing engine to keep track of the full heading, we’ll add a line to the schema.xml file. Find the fields element and add the line:

<field name="authornaf" type="string" indexed="true" stored="true" />

Now if the uploaded record contains the field name “authornaf” it will be added to the index. The type “string” is used since the “string” type requires a search to be an exact match in order to return a method. So search for an authornaf value of aJACKSON,_MICHAEL would not return a record that had aJACKSON,_MICHAEL,d1942-. Currently, the “author” field uses “text”, which allows for partial matches in search. So searching for Jackson would return the record “Jackson, Michael,”.

Next we will modify the import process that reads in the catalog.xml file and creates the upload files for Solr. This process is driven by the import-solr.php script in import/. This script breaks each record element in the catalog.xml file into a string. Each of these strings has an XSLT transformation, marcxml2solr.xsl, applied to create Solr upload files. These files are then posted to Solr.

To do this, I’ve added a section to marcxml2solr.xsl:

original code

<xsl:if test="//marc:datafield[@tag=100]/marc:subfield[@code='a']">
  <field name="author">
    <xsl:value-of select="//marc:datafield[@tag=100]/marc:subfield[@code='a']"/>
  </field>
  <field name="author-letter">
    <xsl:value-of select="substring(//marc:datafield[@tag=100]/marc:subfield[@code='a'], 1, 1)"/>
  </field>
</xsl:if>

<xsl:if test="//marc:datafield[@tag=110]/marc:subfield[@code='a']">
  <field name="author2">
    <xsl:value-of select="//marc:datafield[@tag=110]/marc:subfield[@code='a']"/>
  </field>
</xsl:if>

modified code (added at about line 78)

<xsl:if test="//marc:datafield[@tag=100]/marc:subfield[@code='a']">
  <field name="author">
    <xsl:value-of select="//marc:datafield[@tag=100]/marc:subfield[@code='a']"/>
  </field>
  <field name="author-letter">
    <xsl:value-of select="substring(//marc:datafield[@tag=100]/marc:subfield[@code='a'], 1, 1)"/>
  </field>
</xsl:if>
<xsl:if test="//marc:datafield[@tag='100']">
  <field name="authornaf">
    <xsl:for-each select="//marc:datafield[@tag=100]/marc:subfield">
        <xsl:value-of select="@code"/>
        <xsl:value-of select="translate(normalize-space(.),'abcdefghijklmnopqrstuvwxyz ','ABCDEFGHIJKLMNOPQRSTUVWXYZ_')" />
    </xsl:for-each>
  </field>
</xsl:if>
<xsl:if test="//marc:datafield[@tag=110]/marc:subfield[@code='a']">
  <field name="author2">
    <xsl:value-of select="//marc:datafield[@tag=110]/marc:subfield[@code='a']"/>
  </field>
</xsl:if>

The added code joins all the subfields of the 100 field, after first removing any spaces before and after each subfield and converting the text to upper-case. For example, if $aJackson, Michael, $d1942- appeared in the 100 field of a a record, there would be a field in Solr that would be searchable with the value aJACKSON,_MICHAEL,d1942-

If you’re following along, you’ll need to restart vufind.sh (./vufind.sh restart) and re-import (php import-solr.php) at this point.

The Author Home Page in VuFind

Now, in order to demonstrate how the the index can be used to improve an interface, I’ll focus on the Author page of VuFind, which provides information from Wikipedia about the author and a list of works by the author . Currently, a record or search results page will link to the Author page with a URL like http://example.com/Author/Home?author=Newman,%20Paul. The PHP page resides in web/services/Author/Home.php, and takes in just one GET parameter, author={processed subfield a}. For this information VuFind uses subfield a of the 100 field, with some slight normalizations. This author information is used in two ways within the page.

  1. To create a URL used to screen scrape a Wikipedia page providing information about the author.
    1. $author =  $_GET['author'];  
    2. if (substr($authorstrlen($author) - 1, 1) == ",") {  
    3.       $author = substr($author, 0, strlen($author) - 1);  
    4. }  
    5. $author = explode(','$author);  
    6.   
    7. // some unrelated code ...  
    8.   
    9. $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . urlencode("$author[1] $author[0]");  

    You will notice it only uses the name, assuming it will find an entry in Wikipedia for this particular author at “first name last name”. So, for any heading that has the personal name “Jackson, Michael,” you will get the information for the pop star since his Wikipedia identifier is “Michael Jackson”.

    Author information from WikipediaVuFind’s display of author information from Wikipedia
  2. To make a Solr call to search for all books written by the author.
    1. // Get records by this author  
    2. $this->db = new SOLR($configArray['SOLR']['url']);  
    3. $result = $this->db->query('author:"' . $_GET['author'] . '"', null, 0, 20);  
    4. //does some things with the resulting arrays to display the results  

    This will collocate all books in the system written by authors with that name, even if they are different authors.

    Author search resultsResults of an author search in VuFind

To improve both the Wikipedia search and the list of author works, we need Author/Home.php to use our indexed full heading, authornaf, so we can refer to a unique bibliographic identity. Later we will modify the Author page to use the authornaf field, but first we have to pass it in. In places where the old URL existed, it should now be formed to look like http://example.com/Author/Home?author=Newman,%20Paul&authornaf=aNEWMAN,_PAUL,d1921-.

For an example of how existing URLs in VuFind should be changed, we will look at the display of an individual record. VuFind uses the raw MARCXML file and transforms it for display, similar to how it transforms it for the Solr upload file. So in web/services/Record/xsl/record-html.xsl, I modified a section that creates the URL for the author in the record to also include the authornaf.

Original code (line 79)

<xsl:template name="citation">
  <table cellpadding="2" cellspacing="0" border="0" class="citation">
    <xsl:if test="//datafield[@tag=100]">
      <tr valign="top">
        <th><xsl:value-of select="php:function('translate', 'Main Author')"/>: </th>
        <td>
          <a>
            <xsl:attribute name="href"><xsl:value-of select="$path"/>/Author/Home?author=<xsl:value-of select="//datafield[@tag=100]/subfield[@code='a']"/></xsl:attribute>
            <xsl:value-of select="//datafield[@tag=100]/subfield[@code='a']"/>
          </a>
        </td>
      </tr>
    </xsl:if>

Modified

<xsl:template name="citation">
  <table cellpadding="2" cellspacing="0" border="0" class="citation">
    <xsl:if test="//datafield[@tag=100]">
      <tr valign="top">
        <th><xsl:value-of select="php:function('translate', 'Main Author')"/>: </th>
        <td>
     	  <a>
            <xsl:attribute name="href">
	      <xsl:value-of select="$path"/>
	      <xsl:text>/Author/Home?authornaf=</xsl:text>
	      <xsl:for-each select="//datafield[@tag=100]/subfield">
		<xsl:value-of select="@code"/>
		<xsl:value-of select="translate(normalize-space(.),'abcdefghijklmnopqrstuvwxyz ','ABCDEFGHIJKLMNOPQRSTUVWXYZ_')" />
	      </xsl:for-each>
	      <xsl:text>&author=</xsl:text>
	      <xsl:value-of select="//datafield[@tag=100]/subfield[@code='a']"/>
	    </xsl:attribute>
	    <xsl:value-of select="//datafield[@tag=100]/subfield[@code='a']"/>
     	  </a>
        </td>
      </tr>
    </xsl:if>

Modifying Wikipedia Search

We can now use the authornaf field to enhance the Wikipedia search . We’ll do this using the following algorithm:

  1. Use the authornaf to retreive the title field stored in Solr for all records that contain that authornaf.
  2. Find the two most common words in these titles after a simple stop list is used to eliminate common words like “the”,”of”, etc.
  3. Create a Wikipedia search URL and use it to query Wikipedia.
  4. Extract the search results from the HTML page returned by Wikipedia
  5. Iterate through results in relevancy ordering until a URL is found that has all the parts of the name
  6. Use this URL to retrieve the page and extract the first section. We’re assuming this page is about the author.

This algorithm seems to work for many cases, but some further experimentation could involve:

  • Looking for a disambiguation page
  • Not counting parts of the first name as heavily
  • Using subject headings as search words
  • Using a local index of Wikipedia or a local DBpedia for more complex searches

Doing this means we no longer get information about the King of Pop everytime we search Wikipedia; rather, we get information more suited to the individual or no information at all.

Wikipedia information about Jackson, Michael, 1942-Improved results from Wikipedia

Note: The relevant code for this can be found in Appendix II.

Modify the list of works by an author

Now, let’s focus on the second part of the Author page, the list of all the author’s works. In this case, instead of doing a Solr query that will get us records that have the same 100 subfield a, we’ll do a query that finds all records with the same normalized 100. So instead of looking for all works that have Jackson, Michael we will find all the bibliographic records that have a normalized 100 that looks like aJACKSON,_MICHAEL,d1942-. The result will be a list of just the works by the well-known beer and whiskey critic Michael Jackson.

Works by Jackson, Michael, 1942-Results of an authornaf search

To accomplish this, we modify line 152 in web/services/Author/Home.php from

$result = $this->db->query('author:"' . $_GET['author'] . '"', null, 0, 20);

to

$result = $this->db->query('authornaf:"' . $_GET['authornaf'] . '"', null, 0, 20);

Note: All of the changes from this article can be implemented with a patch file [5].

Issues

Authority Complexity

A flaw in the use of the heading to collocate author works is that it makes the assumption that a person’s entire works will be represented by a particular heading. This assumption is fundamentally flawed. In certain cases there will be multiple headings for one person, as when an author writes under a pseudonym or a changed name, or writes as a government officer. Since we do not examine the authority records, each of these headings will be treated in our modified VuFind as if it is for a different author.

Take, for example, Bill Clinton, the 42nd President of the United States. He has one authorized heading as “Clinton, Bill, 1946-”, but he also has writings as:

  • Arkansas. Governor (1979-1981 : Clinton)
  • Arkansas. Governor (1983-1992 : Clinton)
  • United States. President (1993-2001 : Clinton)

Ideally, there would be some way to find all the writings of an individual but still be able to distinguish between writings done by separate “bibliographic identities”. This would require a much higher level of processing, including use of the authority records.

The lack of descriptions regarding the nature of a connection between two authority headings is also problematic. US practice does not give enough information to identify the relationship between two authority records. If this information existed in a machine-readable form it would be possible to display, for example:

Material by Stephen King writing under pseudonym Richard Bachman

or

Works created as President by George W. Bush

Lack of Authority Control.

One of the largest stumbling blocks to implementing this system is the simple fact that not all libraries practice authority control. It appears a large majority of academic libraries do use some name authority control. According to a survey of libraries with a Carnegie Classification of either Doctoral/Research Extensive or Intensive level, 95% of those who responded (75%) do some form of authority control. For in-house work, 88% of new and maintence cataloging had the personal name headings verified. For vendor work 95% of new cataloging involved verifying personal name headings and 90% of maintence work verified personal name headings [6].

However, that leaves many libraries that do not practice any authority control. In fact, some may even practice a mixture by keeping the headings on records they import from external systems such as OCLC but not adding any to their internal records. This would prevent the association of some books with their authors. In these cases, grouping all the records by authors with the same name may be the only solution.

Possible future directions:

Don’t use all the subfields

The modification given in this paper is a simple algorithm that uses all the subfields of the 100 field of the bibliographic MARC record. In the future, it would make sense to restrict the fields indexed to subfields a, b, c, d, q, and u.

Use added entry as well as main entry

In the future, treat added entries and headings used as subjects in a similar way. This might allow for some more sophisticated interfaces.

User Testing

Far more user testing is needed, looking at when people look for items by particular authors and what can help them get appropriate results. With this technique we can take some different approaches than those currently offered by many next-generation catalogs.

System Testing

What impact does having more unique headings have on the performance of large production systems? Perhaps previous next-generation catalog developers have purposely avoided using the full heading for this reason.

Facets

There is a need to try to improve the faceting of results. Right now the interface will “group” all the authors with the same normalized subfield a together. However, this makes this particular facet useless when people are searching for books by an author’s name. If one clicks on a link in a record, they’ll just see the exact same search they performed.

The identifier information could be used to populate the facet display with unique authors, even if they have the same subfield a. Distinguished information could be added such as titles of works created by the individual or common subject headings assigned to the person.

Authority Files

Briefly mentioned earlier, incorporating the authority files into a separate index could allow for interesting possibilities, such as multiple levels of searching by author (i.e., all works or just works written under a particular pseudonym).

Better Linking to External Projects

If we can identify a person and the works they created, we can use that information to try to connect with other sources of information. We already link to Wikipedia; we could also link to similar projects.

Static Dumps of External Projects

Wikipedia can be downloaded as a static file and re-indexed to focus more on information boxes, persons, and works. DBpedia and similar projects also offer RDF tags and a chance to search Wikipedia information.

Using static dumps when possible allows for more intensive automated searching. For example, there are several works by Michael Jackson, an anthropologist in my current collection. The algorithm of using the most popular two words fails in this case. Few of the words in his titles overlap and the highest ranking words do not appear in the Wikipedia article. Having a local source would reduce concerns of over-use and abuse of Wikipedia bandwidth and other resources, allowing experimentation with different algorithms to find the most appropriate article.

Conclusion

We have room for improvement in the next generation of catalogs. It may be that there are some cataloging pratices that could be changed to make automation easier, but there are also existing techniques that developers are not using to their full potential. The new generation of open systems brings new opportunities for experimentation. This does not have to be complex or intimidating; by changing just a few lines of code, we can create lists of the works of individual authors and improve the retrieval of author information from Wikipedia. Go experiment, and make our catalog interfaces better than they have ever been.

Notes

  1. This was run with the YAZ-client version 2.1.18, from the Ubuntu package. YAZ-client is a client that is included with the YAZ library
  2. Consortium of Academic and Research Libraries in Illinois. For information on connecting to the Z39.50 server hosted by CARLI, read “I-Share via Z39.50“. I-Share is a catalog that contains all the records of most of the CARLI members.
  3. Anglo-American Cataloguing Rules. Second Edition. Revision 2002. American Library Association, Chicago; 2002. (COinS)
  4. For more information about the details of these MARC fields, see the Library of Congress Bibliographic Data pages, in particular Main Entry — Personal Name, Added Entry — Personal Name, and Subject Added Entry — Personal Name. Or, see the OCLC Bibliographic Formats and Standards.
  5. Apply the patch file to recreate the modifications to VuFind described in this article.
  6. Wolverton, Robert E., Jr. Authority Control in Academic Libraries in the United States: A Survey. Cataloging & Classification Quarterly. 41 (1), 2005. p 111-131. (COinS)

Appendix I – Getting Records Using YAZ-client

First, I issue the command to log into the I-Share z39.50 server and start creating an outputfile. ($ indicates command line prompt). Any records that I “show” during the session will also be recorded into the output file.

$ yaz-client -u I-SHARE auth.carli.illinois.edu:210/voyager -m somefile.marc

If I wanted to narrow it just to a particular institution, I could use a different user name. For example, to search just Urbana-Champaign.

$ yaz-client -u UIU auth.carli.illinois.edu:210/voyager -m somefile.marc

I actually searched I-Share for records with Michael and Jackson in the author fields, but only searched Urbana-Champaign for the graphic novels. It should be noted that results from running YAZ-client are always appended to the filename given with the -m flag (somefile.marc in the example). It is not overwritten. This means you can build up a file from several different institutions and sessions.

You will get a response that contains some information about the server and then another command prompt (Z>). Now you can do some searches and retrieve records. The search should return the number of hits, which you can then “show 1+number of hits” to retrieve. One word of warning, for larger return sets you may wish to do them in batches by doing “show 1+500; show 501+500″ and so on. Otherwise you risk timing out with busy servers.

Getting records by people named Michael Jackson

Z> find @and @attr 1=1003 Jackson @attr 1=1003 Michael
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 483
records returned: 0
Elapsed: 3.119026
Z> show 1+483

Getting records that might be comic books/graphic novels/comic strip collections

Z> find @and @attr 1=21 "Comic Books" @attr 1=21 strips
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 5605
records returned: 0
Elapsed: 4.772098
Z> show 1+500
... lots of files go zipping by
Z> show 501+500
... lots of files go zipping by
Z> show 1001+500
... lots of files go zipping by
Z> close
Z> exit

Finally, you will need the MARC records to be in MARCXML. Another YAZ tool can help here.

yaz-marcdump -X somefile.marc > catalog.xml

Appendix II – Modified Wikipedia Code for Retrieving Author Information

The instructions below apply to web/services/Author/Home.php (see the original code):
Add the following function at line 29 (before the launch function):

  1. /* 
  2.  Return: array consisting of word => number of times words appears 
  3.  Input: $string - the string to analize 
  4.  $words  - an array consisting of word => number of times words appear 
  5.  if there are existing values, they will be added to. 
  6.  This means you can pass in a series of strings and 
  7.  get the overall totals 
  8.  
  9. */  
  10. function countWords($string,$words) {  
  11.   foreach(explode(' ',$stringas $word) {  
  12.     //print("$word <br />");  
  13.     $word = strtolower($word);  
  14.     $word = str_replace(array(':',  
  15.                   ';',  
  16.                   ',',  
  17.                   "\'",  
  18.                   '"',  
  19.                   '(',  
  20.                   ')',  
  21.                   '|',  
  22.                   '/',  
  23.                   '?',  
  24.                   '!',  
  25.                   '@',  
  26.                   '#',  
  27.                   '$',  
  28.                   '%',  
  29.                   '^',  
  30.                   '&',  
  31.                   '*',  
  32.                   "\\",  
  33.                   '.',  
  34.                   '+',  
  35.                   '=',  
  36.                   '_',  
  37.                   '~',  
  38.                   '`',  
  39.                   '"'),  
  40.                 '',  
  41.             $word);  
  42.     /* - not removed/kept out because hypens might be important probably could just focus on beginning and end of strings */  
  43.     /* simple stop list */  
  44.     if($word != '' &&  
  45.        $word != 'the' &&  
  46.        $word != 'a' &&  
  47.        $word != 's' &&  
  48.        $word != 'of' &&  
  49.        $word != 'on' &&  
  50.        $word != 'in' &&  
  51.        $word != 'an' &&  
  52.        $word != 'if' &&  
  53.        $word != 'to' &&  
  54.        $word != 'and') {  
  55.       $words[$word]++;  
  56.     }  
  57.   }  
  58.   return($words);  
  59. }  

Then replace lines 113-209 (after adding the above function):

Original code

  1. // Clean up author string  
  2. $author = $_GET['author'];  
  3. if (substr($authorstrlen($author) - 1, 1) == ",") {  
  4.   $author = substr($author, 0, strlen($author) - 1);  
  5.  }  
  6. $author = explode(','$author);  
  7. $interface->assign('author'$author);  
  8.   
  9. // Connect To Wikipedia  
  10. if (!isset($_GET['page']) || ($_GET['page'] == 1)) {  
  11.   $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . urlencode("$author[1] $author[0]");  
  12.   $client = new HTTP_Request();  
  13.   $client->setMethod(HTTP_REQUEST_METHOD_GET);  
  14.   $client->setURL($url);  
  15.   $result = $client->sendRequest();  
  16.   if (!PEAR::isError($result)) {  
  17.     $body = unserialize($client->getResponseBody());  
  18.   
  19.     //Check if data exists or not  
  20.     if(!$body['query']['pages']['-1']) {  
  21.       $body = array_shift($body['query']['pages']);  
  22.       $info['name'] = $body['title'];  
  23.   
  24.       $body = array_shift($body['revisions']);  
  25.       $body = explode("\n"$body['*']);  
  26.   
  27.       $done = 0;  
  28.       while(!$done) {  
  29.     if($body[0] == '') {  
  30.       array_shift($body);  
  31.       continue;  
  32.     }  
  33.     switch(substr($body[0], 0, 2)){  
  34.     case "[[" :  
  35.     case "{{" :  
  36.     case "}}" :  
  37.     case "]]" :  
  38.     case "| " :  
  39.       //echo " sub : '" . substr($body[0], 0, 2) . "' ";  
  40.       $stpos = stripos($body[0], "image:");  
  41.       if(!$stpos)  
  42.         $stpos = stripos($body[0], "image");  
  43.       if($stpos) {  
  44.         $len = 4;  
  45.         $endpos = stripos($body[0], ".jpg");  
  46.         if(!$endpos) {  
  47.           $len = 4;  
  48.           $endpos = stripos($body[0], ".gif");  
  49.         }  
  50.         if($endpos) {  
  51.           $image = substr($body[0], $stpos,  
  52.                   $endpos + $len - $stpos);  
  53.         }  
  54.       }  
  55.       array_shift($body);  
  56.       break;  
  57.     default :  
  58.       $done = 1;  
  59.       break;  
  60.     }  
  61.   
  62.       }  
  63.   
  64.       $desc = "";  
  65.       $done = 0;  
  66.       while(!$done) {  
  67.     if(substr($body[0], 0, 2) == "==")  
  68.       $done = 1;  
  69.     else {  
  70.       $desc .= $body[0];  
  71.       array_shift($body);  
  72.     }  
  73.       }  
  74.   
  75.       //Create links to wikipedia  
  76.   
  77.       $pattern = array();  
  78.       $replacement = array();  
  79.       $pattern[] = '/(\x5b\x5b)([^\x5d|]*)(\x5d\x5d)/';  
  80.       $replacement[] = '<a href="http://en.wikipedia.org/wiki/$2">$2</a>';  
  81.       $pattern[] = '/(\x5b\x5b)([^\x5d]*)\x7c([^\x5d]*)(\x5d\x5d)/';  
  82.       $replacement[] = '<a href="http://en.wikipedia.org/wiki/$2">$3</a>';  
  83.       // Removes citation  
  84.       $pattern[] = '/({{)[^}]*(}})/';  
  85.       $replacement[] = "";  
  86.   
  87.       $desc = preg_replace($pattern$replacement$desc);  
  88.   
  89.       $info['image'] = $image;  
  90.       $info['description'] = $desc;  
  91.       $interface->assign('info'$info);  
  92.   
  93.     }  
  94.   }  
  95.  }  
  96. }  

Modified code

  1. // Clean up author string  
  2. $author = $_GET['author'];  
  3. if (substr($authorstrlen($author) - 1, 1) == ",") {  
  4.   $author = substr($author, 0, strlen($author) - 1);  
  5.  }  
  6. $author = explode(','$author);  
  7. $interface->assign('author'$author);  
  8.   
  9. $authornaf = $_GET['authornaf'];  
  10.   
  11. //We'll now search to see if we can find  
  12. //a wikipedia article that seems associated with the  
  13. //author by using common title words  
  14.   
  15. // Connect To Wikipedia  
  16. if (!isset($_GET['page']) || ($_GET['page'] == 1)) {  
  17.   
  18.   // Get records by this author  
  19.   $this->db = new SOLR($configArray['SOLR']['url']);  
  20.   $result = $this->db->query('authornaf:"' . $_GET['authornaf'] . '"', null, 0, 20);  
  21.   
  22.   /* The result will have some information about 
  23.    the SOLR query and also information about 
  24.    each record.  Issue is this is an array of arrays, 
  25.    unless there's only one result, then it's just 
  26.    an array with values */  
  27.   
  28.   if (is_array($result['record'][0])) {  
  29.     $records = $result['record'];  
  30.   }  
  31.   else if (is_array($result['record'])){  
  32.     $records = array($result['record']);  
  33.   }  
  34.   
  35.   $titles = array();  
  36.   $words = array();  
  37.   
  38.   for($i = 0;$i < count($records);$i++) {  
  39.     $words = $this->countWords($records[$i]['title'],$words);  
  40.   }  
  41.   
  42.   asort($words);  
  43.   
  44.   /* now the words should be sorted from most frequent to least */  
  45.   $words = array_keys($words);  
  46.   
  47.   /* now we search for the author words (from 
  48.    earlier processing) and the two most common 
  49.    words.  Why?  Some rouging testing seem to 
  50.    indicate this was a good number. */  
  51.   $url = "http://en.wikipedia.org/w/index.php?title=Special:Search&search=" . urlencode("$author[1] $author[0] " .array_pop($words) . " "array_pop($words)  );  
  52.   
  53.   //Now we examine the results.  
  54.   
  55.   $client = new HTTP_Request();  
  56.   $client->setMethod(HTTP_REQUEST_METHOD_GET);  
  57.   $client->setURL($url);  
  58.   $result = $client->sendRequest();  
  59.   if (!PEAR::isError($result)) {  
  60.     $xmlstring = $client->getResponseBody();  
  61.   }  
  62.   else {  
  63.     print("<html><head><title>Error</title></head><body>error</body></html>");  
  64.   }  
  65.   
  66.   //need to suppress warnings  
  67.   //errors about id  
  68.   $xmldoc = new DOMDocument();  
  69.   
  70.   //see http://www.mutinydesign.co.uk/scripts/problems-encountered-with-php-dom-functions---3/ on suppressing warnings -> bad html  
  71.   @$xmldoc->loadHTML($xmlstring);  
  72.   
  73.   $docXpath = new DOMXPath($xmldoc);  
  74.   
  75.   //for some reason I haven't quite yet figured out,  
  76.   //registering the namespace isn't working,  
  77.   //the dom class seems to ignore it in the source  
  78.   //document  
  79.   $query = '/html/body/div[@id="globalWrapper"]/div[@id="column-content"]/div[@id="content"]/div[@id="bodyContent"]/ul[1]/li/a';  
  80.   
  81.   $links = $docXpath->query($query);  
  82.   $goodlink = '';  
  83.   
  84.   //Now, I'll iterate through the results  
  85.   //I'm looking for the first result that  
  86.   //has all the parts of the author name in it  
  87.   //  
  88.   //This could definitely be improved  
  89.   foreach($links as $link) {  
  90.   
  91.     $firstname = $author[1];  
  92.     $firstname = str_replace(array('.',','),'',$firstname);  
  93.     $firstname = trim($firstname);  
  94.   
  95.     $lastname = $author[0];  
  96.     $lastname = str_replace(array('.',','),'',$lastname);  
  97.     $lastname = trim($lastname);  
  98.   
  99.     if (stripos($link->nodeValue,$firstname) > -1 &&  
  100.     stripos($link->nodeValue,$lastname) > -1)  
  101.       {  
  102.   
  103.     //print("good link <br />");  
  104.     $goodlink = $link->attributes->getNamedItem('href')->nodeValue;  
  105.     break;  
  106.   
  107.       }  
  108.   }  
  109.   
  110.   $title = substr($goodlink,6);  
  111.   
  112.   $interface->assign('info'$info);  
  113.   
  114.   $url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=php&titles=' . $title;  
  115.   
  116.   //if we found something, display the wikipedia info  
  117.   //(in final version we'd want to have something displayed  
  118.   // if there wasn't a match or a more strict  
  119.   if ($goodlink != '') {  
  120.     $client = new HTTP_Request();  
  121.     $client->setMethod(HTTP_REQUEST_METHOD_GET);  
  122.     $client->setURL($url);  
  123.     $result = $client->sendRequest();  
  124.     if (!PEAR::isError($result)) {  
  125.       $body = unserialize($client->getResponseBody());  
  126.   
  127.       //Check if data exists or not  
  128.       if(!$body['query']['pages']['-1']) {  
  129.     $body = array_shift($body['query']['pages']);  
  130.     $info['name'] = $body['title'];  
  131.   
  132.     $body = array_shift($body['revisions']);  
  133.     $body = explode("\n"$body['*']);  
  134.   
  135.     $done = 0;  
  136.     while(!$done) {  
  137.       if($body[0] == '') {  
  138.         array_shift($body);  
  139.         continue;  
  140.       }  
  141.       switch(substr($body[0], 0, 2)){  
  142.       case "[[" :  
  143.       case "{{" :  
  144.       case "}}" :  
  145.       case "]]" :  
  146.       case "| " :  
  147.         //echo " sub : '" . substr($body[0], 0, 2) . "' ";  
  148.         $stpos = stripos($body[0], "image:");  
  149.         if(!$stpos)  
  150.           $stpos = stripos($body[0], "image");  
  151.         if($stpos) {  
  152.           $len = 4;  
  153.           $endpos = stripos($body[0], ".jpg");  
  154.           if(!$endpos) {  
  155.         $len = 4;  
  156.         $endpos = stripos($body[0], ".gif");  
  157.           }  
  158.           if($endpos) {  
  159.         $image = substr($body[0], $stpos,  
  160.                 $endpos + $len - $stpos);  
  161.           }  
  162.         }  
  163.         array_shift($body);  
  164.         break;  
  165.       default :  
  166.         $done = 1;  
  167.         break;  
  168.       }  
  169.   
  170.     }  
  171.   
  172.     $desc = "";  
  173.     $done = 0;  
  174.     while(!$done) {  
  175.       if(substr($body[0], 0, 2) == "==")  
  176.         $done = 1;  
  177.       else {  
  178.         $desc .= $body[0];  
  179.         array_shift($body);  
  180.       }  
  181.     }  
  182.   
  183.     //Create links to wikipedia  
  184.   
  185.     $pattern = array();  
  186.     $replacement = array();  
  187.     $pattern[] = '/(\x5b\x5b)([^\x5d|]*)(\x5d\x5d)/';  
  188.     $replacement[] = '<a href="http://en.wikipedia.org/wiki/$2">$2</a>';  
  189.     $pattern[] = '/(\x5b\x5b)([^\x5d]*)\x7c([^\x5d]*)(\x5d\x5d)/';  
  190.     $replacement[] = '<a href="http://en.wikipedia.org/wiki/$2">$3</a>';  
  191.     // Removes citation  
  192.     $pattern[] = '/({{)[^}]*(}})/';  
  193.     $replacement[] = "";  
  194.   
  195.     $desc = preg_replace($pattern$replacement$desc);  
  196.   
  197.     $info['image'] = $image;  
  198.     $info['description'] = $desc;  
  199.     $interface->assign('info'$info);  
  200.   
  201.       }  
  202.     }  
  203.   }  
  204.  }  
  205. }  

(Final modified code for web/services/Author/Home.php)

About the Author

Jonathan Gorman spends his days shifting bits and bytes through the Voyager system, among several other duties, as a Research Information Specialist for the University of Illinois. His nights are spent playing around with library technologies in hopes of constructing tools he would actually enjoy using. Jonathan is grateful for the love and support of his wife, Colleen; this article would not have been possible without her. He can be contacted at jonathan.gorman at gmail dot com.

Blogged with the Flock Browser

Publié sur WordPress.