Merlot session notes: Federated Search


Notes from the Federated Search session.

  • Merlot Federated Search
    • Martin Koning Bastiaan
    • Sam Shamseldin
    • Alyssa Lalanne
    • http://fedsearch.merlot.org
    • Why?
      • original problem: hard to find/evaluate learning materials
      • emergent problem: number of collections/repositories/communities
      • various ways of addressing the emergent problem - they chose federated search over harvesting
        • 2 issues with harvesting
          • lots of authors - how to get info together?
          • if lots of collections, we could create one "union catalog" with all collections harvested in it, BUT that removes the value added by the individual collections
            • Harvesting would "take away the life" of the communities and collections that are harvested
      • 2 parts
        • services
          • expose partner resources
        • clients
          • connect to partner resources
      • Federated search = cross collection client
      • Simultaneous search of all partners, collecting results into integrated hitlist
      • Limit number of results, to prevent harvesting (can't get more than 25 results at a time)
      • use Long Response Page to show progress bar during search (like WOLongResponse)
      • Built in JSP
      • Ranking weighs title over description, etc...
      • How are controlled vocabularies managed?
        • not at all. vocabulary agnostic
    • Demo
      • Merlot
      • EdNA
      • SMETE
      • Relevancy ranking applied at the fed. search client level (not in sources)
  • Can you run a federated search against Merlot? What API?
    • based on Google WebService API
    • A tweaked version used by Merlot and its partners (DN: CAREO should probably support this)
    • search is open to partners only (both ways) - not open to the The World
  • No RSS feed or bookmarkable URL for searches
  • Federated Search Collections
    • Current partners: MERLOT, EdNA, SMETE
    • additional partners needed
      • general collections
      • discipline-specific collections
  • Fed. Search Architecture
    • proxies
    • service dispatch mechanism
    • result handlers
    • user interface customization
    • future requirements
  • Discussing putting their implementation into Open Source, or Shared Source with their partners
  • Federated search community
    • can't solve these problems individually:
      • search syntax - what is the query?
      • results requirements - what info is returned?
      • sharing knowledge and solutions
    • Community charter: develop simple standards for searching multiple collections and a federated search framework as an implementation of those standards
      • RE-USE EXISTING SIMPLE STANDARDS
        • eg. used Google as model, not Lucene.
  • What about network latencies?
    • different services respond at different speeds
    • use timeout - if no result after so long, disregard source.
    • use intermediary page before results to show status of search (progress bar)
    • EdNA is in australia, and are one of the faster responses - latency not really an issue.
  • Cacheing?
  • How to handle scalability?
    • searches run simultaneously (in parallel) so they all happen at the same time
    • no real cost for increased sources - the entire search is only as slow as the single slowest source
    • have a resultlistener that gets callbacks from each source query, aggregates and ranks all results together.
    • assume that the individual sources are giving their results with the "best" first, since we use only the first X records...
    • Aggregated results from all sources are then sorted together for overall relevancy at the fed. search client level
    • If there are missing fields, they just aren't displayed (if there is no author returned, it's not put as part of the result display item)
  • Built it to grow easily
    • just add 2 classes to the server to manage fed. queries on new source
      • source and listener?

See Also

comments powered by Disqus