Notes from the Federated Search session.
- Merlot Federated Search
- Martin Koning Bastiaan
- Sam Shamseldin
- Alyssa Lalanne
- http://fedsearch.merlot.org
- Why?
- original problem: hard to find/evaluate learning materials
- emergent problem: number of collections/repositories/communities
- various ways of addressing the emergent problem - they chose federated search over harvesting
- 2 issues with harvesting
- lots of authors - how to get info together?
- if lots of collections, we could create one “union catalog” with all collections harvested in it, BUT that removes the value added by the individual collections
- Harvesting would “take away the life” of the communities and collections that are harvested
- 2 issues with harvesting
- 2 parts
- services
- expose partner resources
- clients
- connect to partner resources
- services
- Federated search = cross collection client
- Simultaneous search of all partners, collecting results into integrated hitlist
- Limit number of results, to prevent harvesting (can’t get more than 25 results at a time)
- use Long Response Page to show progress bar during search (like WOLongResponse)
- Built in JSP
- Ranking weighs title over description, etc…
- How are controlled vocabularies managed?
- not at all. vocabulary agnostic
- Demo
- Merlot
- EdNA
- SMETE
- Relevancy ranking applied at the fed. search client level (not in sources)
- Can you run a federated search against Merlot? What API?
- based on Google WebService API
- A tweaked version used by Merlot and its partners (DN: CAREO should probably support this)
- search is open to partners only (both ways) - not open to the The World
- No RSS feed or bookmarkable URL for searches
- Federated Search Collections
- Current partners: MERLOT, EdNA, SMETE
- additional partners needed
- general collections
- discipline-specific collections
- Fed. Search Architecture
- proxies
- service dispatch mechanism
- result handlers
- user interface customization
- future requirements
- Discussing putting their implementation into Open Source, or Shared Source with their partners
- Federated search community
- can’t solve these problems individually:
- search syntax - what is the query?
- results requirements - what info is returned?
- sharing knowledge and solutions
- Community charter: develop simple standards for searching multiple collections and a federated search framework as an implementation of those standards
- RE-USE EXISTING SIMPLE STANDARDS
- eg. used Google as model, not Lucene.
- RE-USE EXISTING SIMPLE STANDARDS
- can’t solve these problems individually:
- What about network latencies?
- different services respond at different speeds
- use timeout - if no result after so long, disregard source.
- use intermediary page before results to show status of search (progress bar)
- EdNA is in australia, and are one of the faster responses - latency not really an issue.
- Cacheing?
- How to handle scalability?
- searches run simultaneously (in parallel) so they all happen at the same time
- no real cost for increased sources - the entire search is only as slow as the single slowest source
- have a resultlistener that gets callbacks from each source query, aggregates and ranks all results together.
- assume that the individual sources are giving their results with the “best” first, since we use only the first X records…
- Aggregated results from all sources are then sorted together for overall relevancy at the fed. search client level
- If there are missing fields, they just aren’t displayed (if there is no author returned, it’s not put as part of the result display item)
- Built it to grow easily
- just add 2 classes to the server to manage fed. queries on new source
- source and listener?
- just add 2 classes to the server to manage fed. queries on new source