Merlot session notes: Federated Search

Notes from the Federated Search session.

Merlot Federated Search
- Martin Koning Bastiaan
- Sam Shamseldin
- Alyssa Lalanne
- http://fedsearch.merlot.org
- Why?
  - original problem: hard to find/evaluate learning materials
  - emergent problem: number of collections/repositories/communities
  - various ways of addressing the emergent problem - they chose federated search over harvesting
    - 2 issues with harvesting
      - lots of authors - how to get info together?
      - if lots of collections, we could create one “union catalog” with all collections harvested in it, BUT that removes the value added by the individual collections
        Harvesting would “take away the life” of the communities and collections that are harvested
  - 2 parts
    - services
      - expose partner resources
    - clients
      - connect to partner resources
  - Federated search = cross collection client
  - Simultaneous search of all partners, collecting results into integrated hitlist
  - Limit number of results, to prevent harvesting (can’t get more than 25 results at a time)
  - use Long Response Page to show progress bar during search (like WOLongResponse)
  - Built in JSP
  - Ranking weighs title over description, etc…
  - How are controlled vocabularies managed?
    - not at all. vocabulary agnostic
- Demo
  - Merlot
  - EdNA
  - SMETE
  - Relevancy ranking applied at the fed. search client level (not in sources)
Can you run a federated search against Merlot? What API?
- based on Google WebService API
- A tweaked version used by Merlot and its partners (DN: CAREO should probably support this)
- search is open to partners only (both ways) - not open to the The World
No RSS feed or bookmarkable URL for searches
Federated Search Collections
- Current partners: MERLOT, EdNA, SMETE
- additional partners needed
  - general collections
  - discipline-specific collections
Fed. Search Architecture
- proxies
- service dispatch mechanism
- result handlers
- user interface customization
- future requirements
Discussing putting their implementation into Open Source, or Shared Source with their partners
Federated search community
- can’t solve these problems individually:
  - search syntax - what is the query?
  - results requirements - what info is returned?
  - sharing knowledge and solutions
- Community charter: develop simple standards for searching multiple collections and a federated search framework as an implementation of those standards
  - RE-USE EXISTING SIMPLE STANDARDS
    - eg. used Google as model, not Lucene.
What about network latencies?
- different services respond at different speeds
- use timeout - if no result after so long, disregard source.
- use intermediary page before results to show status of search (progress bar)
- EdNA is in australia, and are one of the faster responses - latency not really an issue.
Cacheing?
How to handle scalability?
- searches run simultaneously (in parallel) so they all happen at the same time
- no real cost for increased sources - the entire search is only as slow as the single slowest source
- have a resultlistener that gets callbacks from each source query, aggregates and ranks all results together.
- assume that the individual sources are giving their results with the “best” first, since we use only the first X records…
- Aggregated results from all sources are then sorted together for overall relevancy at the fed. search client level
- If there are missing fields, they just aren’t displayed (if there is no author returned, it’s not put as part of the result display item)
Built it to grow easily
- just add 2 classes to the server to manage fed. queries on new source
  - source and listener?