Expose Blob

It just struck me that the fancy Expose Blob on MacOSX 10.3 is really quite useless. It’s not that convenient to mouse over to it and click the button. Much easier to just hit F9 or whatever you’ve set your Expose Key to…

It would be MUCH more useful if you’ve got a touch screen. Like, say, a SMART Board, or, say, a tablet. Newton could have used the Expose Blob.

Expose Blob

I’m just saying… Wonder what’s under wraps for MWSF next week…

eXistDB Prototype Database Server

I’ve got a prototype eXistDB server, built as a WebObjects application, running on an iMac on my desk. Works pretty well, and it does some great XQuery stuff. I’ve entered the CAREO metadata, current as of a week ago, so it’s got 3733 IMS LOM records to play with.

Check it out.

It has a simple search (just enter a term and hit the button), as well as a generic XQuery entry panel. Feel free to experiment on any XQuery statements you want (do a simple search and look at the source HTML for the result page for a starting point…) Go ahead and try some funky boolean searches like “earth image satellite”. Still some refinements left to go (like limits on search results – it’s currently possible to return the entire database as a result of a query – not recommended). I also have to play around with handling multiple schema – IMS LOM, DublinCore, MPEG7, IMS CP, METS, … for both querying and retrieval.

Basically, eXist is pretty darned good, and Wolfgang (the developer) is quite responsive. Not sure how it compares to XStreamDB performance-wise, but it does beat it on the cost…

Merry Christmas!

I’ve been trying to be offline for the last few days (and will keep trying until Jan. 5, when I’m Back In The Office). I’ll likely be lurking online, though.

Took Evan to see Santa last night. He has been fine with Santa before – sat on his knee several times, for lotsa photos – but last night, he decided that Santa Is Evil. Crying and fussing must be associated with attempted Santa Knee Sittings. Oh, well… It was still pretty cute.

Evan on Santa's Knee, 2003

Merry Christmas, everyone, and I’ll see you in 2004 (hopefully refreshed and raring to go…)

Federated Identity Management

Looking into techniques to allow us to decentralize user management in cross-institutional (and non-institutional) software, such as APOLLO.

Here are some links I’ve come across on the topic:

Many of these articles look like corporate shovelware “Read about how smart we are – give us money” but maybe there’s some good stuff in there, too.

This is stuff waaaay outside my normal realm of things, so I’ll be doing some reading/thinking about this stuff, and how it might affect CAREO/APOLLO.

The goal is to be able to do something like this scenario:

Bill is a professor at the University of Calgary. He securely logs into an APOLLO search application using his U of C login, and APOLLO is aware of the groups and roles that Bill has as part of his U of C identity.

Mary is a grad student at the University of British Columbia. She logs into an APOLLO collaborative application using her UBC login, and is able to access resources defined by her groups and roles described by her UBC identity.

Bill and Mary are working together on a project, and Bill creates an ad-hoc group in APOLLO for them to share resources privately while collaborating on their development. Once ready for publication, these resources are made available to individuals at both the U of C and UBC.

eXistDB and WebObjects

I’ve spent the morning building a prototype WebObjects app to act as an xml metadata server. I’ve embedded eXistDB into the application, and it created the necessary database files and indices for me.

Then, I wrote a short method to import xml documents from a path (and added the added bonus of importing a whole directory if that was given). 3600+ records in the embedded database.

And boy, is it fast. Queries are almost instantaneous (~100ms typical), but document retrievals are a wee bit slower, increasing linearly with the number of hits. I haven’t added any limits, so you can do a query for something lame like “*a*” and get the whole database back in one page.

The embedded eXist database doesn’t use the XML-RPC API like the standalone database does, so there isn’t any marshalling/unmarshalling overhead. Just native java calls.

When considering the document retrieval isn’t optimized (and is just basically a debug “dump the entire LOM as the item to display”), performance is quite acceptable already.

Here’s the stats from a simple search for “biology”
Query: //text() &= ‘biology’
Hits: 377
Query Time: 124 ms
Retrieval Time: 6059 ms

That retrieval time includes pulling the ENTIRE LOM for each and every one of the 377 results.

UPDATE: Just ran some more tests, and cracked open the debug log file. Here’s what I found:

09 Dec 2003 18:42:00,401 – loading 3647 documents from 2collections took 3ms.
09 Dec 2003 18:42:00,411 – found image: 2800 in 4ms.
09 Dec 2003 18:42:00,414 – found nasa: 9 in 0ms.
09 Dec 2003 18:42:00,417 – found space: 13 in 1ms.
query: //text() &= ‘image nasa space’
hits: 3
query time: 86
retrieve time: 22

Finding 2800 records containing the string “image” took 4ms. Holy freaking cow.

From the various test queries I’ve run, on average the vast majority of the time is spent retrieving the documents out of the database. The query runs extremely fast, but yanking the entire LOM out takes some time. I’m going to look at ways to only pull various XPATH values rather than the full record – that may be faster…

eXist XPath Extensions

One of the really cool things about eXist is the XPath extensions for fulltext searching. They mimic (using XPath) the stuff that is done in XStreamDB via XQuery.

I can do stuff like:

document(*)//text() &= "*image*"

and eXist will return me any xml document (from it’s entire set of collections) that contains the string “image” somewhere in it (could be in /lom/general/title/langstring/Images Of Bangalore, or /lom/technical/format/image/jpeg, or whatever. It doesn’t care. And, it’s very fast.

What’s more, I can do stuff like:

document(*)/*[ //format &= "*image*" and //text() &= "*earth*"]

which says “find me xml documents that have “image” somewhere in a “format” element (could be, say, /lom/technical/format), and contain the string “earth” somewhere (like, say, /lom/general/title/langstring/Earth At Night or /lom/general/title/langstring/Earthquakes )

I can also do something like:

document(*)//text() &="*image* *kyoto*"

Which will give me different results than

document(*)//text() &= "*image* *kyoto* *relig*"

because the second query will restrict the search to stuff to do with “relig” – religion, religious, whatever (in this case, a Buddhist temple in Kyoto is returned, as opposed to the Kyoto Accord presentations at the University of Calgary, which are returned by the query before it…)

The fulltext extension – based queries (using the &= qualifier to indicate “boolean and” – you can also use the := qualifier to indicate “boolean or”) are amazingly fast. I’m getting results from rather complicated test queries on the entire 3600+ CAREO record set in a fraction of a second. Nice.

eXist: Open Source XML Database

I initially sent this as an email to the group, but thought it might serve better on the weblog…

I’ve been playing around with eXist today. Holy crap.

I used Rob’s JUD export script to suck all 3600+ records out of the CAREO JUD (took almost 2 hours to process that), then ran the import function on eXist (took maybe 5 minutes to import them all).

It looks like it’s going to be able to do some pretty freaky stuff, search-wise. I’ve been playing around with some pretty loose XPath queries, and it returns excellent hits, pretty darned fast. It can be slow if I request, say, all documents with the letter “a” in them somewhere, but for normal queries, it’s stinky fast.

Even for some pretty compound queries, it’s fast, too.

Here’s an example:

document(*)//text() &= ‘*image* *biology* *water*’

This basically says: Return any xml document that contains, somewhere in the various elements in the document, the strings “image”, “biology”, and “water”.

It might match “image” in /lom/technical/format, and “biology” in /lom/classification/keywords/langstring, and “water” in /general/description/langstring.

This particular search returned 60 hits, taking a total of 638ms of processing. Without having added any indexing.

I did another search for:
document(*)//text() &= “*biology* *video*”

and it found stuff that would have been difficult to know it was a video otherwise (the technical/location had a value that had “/VIDEOS/” in it, so it matched.

Also, it seems to cache search results on the fly, so subsequent searches for the same thing return instantly. Very nice.