reclaiming website search

I’ve been withdrawing from relying on Google wherever possible, for various reasons. One place where I was still stuck in the Googleverse was with the embedded site search I was using on my self-hosted static file photo gallery site. That was one of the few places where I couldn’t find a decent replacement for Google, so it stayed there. And I wasn’t comfortable with that – I don’t think Google needs to be informed every time someone visits a page I host1. I use that embedded search pretty regularly, and cringe every time the page loads.

There had to be a good search utility that could be self-hosted. I went looking, and tried a few. My requirements were pretty basic – I don’t need multiple administrators, or shards of database replication, or multiple crawling schedulers etc… I don’t want to have to install a new application framework or runtime environment just for a search engine. I want it to be a simple install – ideally either a simple CGI script or something that can trivially drop onto a standard LAMP server.

Today, I installed a website indexer on a fresh new subdomain. Currently, the only website it indexes is darcynorman.net/gallery, but I can add any site to it, and then index and search on my own terms, without feeding data into or out of Google (or any other third party).

The search tool is powered by Sphider and seems pretty decent. It’s a simple installation process, and uses a MySQL database to store the index. Seems pretty fast – on my single-site index, with one user (me).

The biggest flaw I’ve found with Sphider so far is in how it handles relative links. Say you have a website structure like this:

  • index.html
    • page1.html
    • page2.html

If index.html uses a simple relative link like <a href="page1.html">Page 1</a>, Sphider skips it. Unless the index.html page has a <base> element to tell Sphider explicitly how to regenerate full URLs for the relative links. Something like this:

<base href="http://photos.darcynorman.net/" />

Which Sphider can then use to turn relative links into fully resolved absolute links.

But this is strange – I had 2 choices:

  1. hack the Sphider code to teach it how to behave properly (and then re-hack the code if there’s an update)
  2. update each gallery menu page to add the <base> head element

I chose #2, because I just didn’t have the energy to fix Sphider, and the HTML fix was simple enough. It definitely feels like a bug – there’s no way that editing every page to add a <base> element should be required, but whatever.

Bottom line, Sphider works perfectly for my needs. It’s now powering the site search for my photo gallery site, and works quite well for that. And, it’s going to be available to index any of my other projects if needed.

  1. as would happen when the embedded search javascripts are loaded – that activity data could then be tracked/stored/analyzed by Google to better model what you’re interested in, who you know, etc… []