reclaiming website search

I’ve been withdrawing from relying on Google wherever possible, for various reasons. One place where I was still stuck in the Googleverse was with the embedded site search I was using on my self-hosted static file photo gallery site. That was one of the few places where I couldn’t find a decent replacement for Google, so it stayed there. And I wasn’t comfortable with that – I don’t think Google needs to be informed every time someone visits a page I host1. I use that embedded search pretty regularly, and cringe every time the page loads.

There had to be a good search utility that could be self-hosted. I went looking, and tried a few. My requirements were pretty basic – I don’t need multiple administrators, or shards of database replication, or multiple crawling schedulers etc… I don’t want to have to install a new application framework or runtime environment just for a search engine. I want it to be a simple install – ideally either a simple CGI script or something that can trivially drop onto a standard LAMP server.

Today, I installed a website indexer on a fresh new subdomain. Currently, the only website it indexes is darcynorman.net/gallery, but I can add any site to it, and then index and search on my own terms, without feeding data into or out of Google (or any other third party).

The search tool is powered by Sphider and seems pretty decent. It’s a simple installation process, and uses a MySQL database to store the index. Seems pretty fast – on my single-site index, with one user (me).

The biggest flaw I’ve found with Sphider so far is in how it handles relative links. Say you have a website structure like this:

  • index.html
    • page1.html
    • page2.html

If index.html uses a simple relative link like <a href="page1.html">Page 1</a>, Sphider skips it. Unless the index.html page has a <base> element to tell Sphider explicitly how to regenerate full URLs for the relative links. Something like this:

<base href="http://photos.darcynorman.net/" />

Which Sphider can then use to turn relative links into fully resolved absolute links.

But this is strange – I had 2 choices:

  1. hack the Sphider code to teach it how to behave properly (and then re-hack the code if there’s an update)
  2. update each gallery menu page to add the <base> head element

I chose #2, because I just didn’t have the energy to fix Sphider, and the HTML fix was simple enough. It definitely feels like a bug – there’s no way that editing every page to add a <base> element should be required, but whatever.

Bottom line, Sphider works perfectly for my needs. It’s now powering the site search for my photo gallery site, and works quite well for that. And, it’s going to be available to index any of my other projects if needed.

  1. as would happen when the embedded search javascripts are loaded – that activity data could then be tracked/stored/analyzed by Google to better model what you’re interested in, who you know, etc… []

DuckDuckGo – search engine that doesn’t track you

I’ve been trying to find ways to reduce the amount of information about me that’s tracked every time I do anything online. I don’t like that every piece of activity is tracked, analyzed, and sold. I’ve said over and over, that if any government or agency had proposed tracking this much data on every citizen, there would be an uproar. But we just shrug it off when it’s done by the big online properties.

So I have just sworn off using the Google search engine. Well, as much as physically possible – it’s still built into to so much that it’s absolutely impossible to completely withdraw from Google.

DuckDuckGo

But, I’m going to try switching to DuckDuckGo. Some early tests show that the search engine is decent. The results may be ranked differently than Google’s, but that may be a good thing. What interests me isn’t the search engine, as much as *they give a crap [about privacy](http://donttrack.us/)*.

They have a [pretty detailed description of what they do to prevent personal info leakage](https://duckduckgo.com/privacy.html), which makes me want to use DuckDuckGo far more than any of the others. And, the search engine has some pretty cool features, including [!Bang](https://duckduckgo.com/bang.html).

My biggest concern with DuckDuckGo is how they pay their bills. I’m not seeing any ads on the site, nor in the search results. That’s great, but then how are they paying for bandwidth and infrastructure, and who’s paying the people running it? So far, [it looks to be self-funding](http://duck.co/#Topic/28469000000154001), but how will that be sustainable?

Regardless, it’s an interesting new search engine that puts privacy to the forefront, rather than quietly tracking everything you do while simultaneously shouting “DO NO EVIL.”

I found out about DuckDuckGo [through a thread on Reddit](http://www.reddit.com/r/reddit.com/comments/evirl/google_tracks_you_we_dont_an_illustrated_guide/), where [the developer was responding to additional privacy concerns by modifying the search engine on the fly](http://www.reddit.com/r/reddit.com/comments/evirl/google_tracks_you_we_dont_an_illustrated_guide/c1bbob1). That’s pretty cool stuff.

common words

I just updated the excellent [Relevanssi](http://wordpress.org/extend/plugins/relevanssi/) search index plugin (it makes the search feature of WordPress actually WORK, with relevant results rather than the lame built-in search). It reports on the top words in the search index. I’m a little surprised at the results (but, looking over the words in just this short post, I probably shouldn’t be…).

1. just (1226)
1. like (846)
1. i’m (820)
1. i’ve (675)
1. really (557)
1. new (538)
1. time (517)
1. use (500)
1. stuff (494)
1. got (477)
1. way (474)
1. using (461)
1. pretty (443)
1. blog (441)
1. cool (428)
1. that’s (426)
1. i’ll (408)
1. don’t (388)
1. going (387)
1. update (387)
1. work (376)
1. people (375)
1. things (370)
1. post (368)
1. sure (365)

I’m kinda surprised that “awesome” isn’t high up that list…

Drupal Search Funkiness

I've been noticing that the search feature of this Drupal blog has been acting up for awhile – searching for "drupal" turns up only 4 items, but I've written many many posts mentioning Drupal. I didn't think it was a big deal, but I've actually been getting emails and IMs asking me wtf wrt searching.

So, I dug a bit deeper. Turns out, Drupal is refusing to index my content when cron.php is called. It's called every hour, but the /admin/settings/search status indicator is stuck at:

Drupal Search Index Not Updating: Taken on 2006/06/07, showing the search index not updating, even though cron.php is called every hour (and I even manually triggered it several times) and the number of items to process is turned down to 10.Drupal Search Index Not Updating: Taken on 2006/06/07, showing the search index not updating, even though cron.php is called every hour (and I even manually triggered it several times) and the number of items to process is turned down to 10.

Some poking around on the Drupal site didn't turn up anything useful. I'll keep poking around to hopefully find out wtf is going on with searching. It's a puzzler…

Update: Temporarily fixed. Something's definitely borked. It's only updating the first batch of nodes, even if cron.php is called multiple times. The hack fix involves editing search.module to allow larger batches so all nodes make it into the first run. I added a "2000" item to the $items array on line 217, then cleared the old index by clicking the "Re-index site" button. Manually called cron.php and let it chew, and now all nodes are properly indexed. No idea if I'll have to keep re-indexing. That would be an ugly hack…

Update the Second: Looks like everything's updating ok now… I'll try dropping the batch size back down to a sane value to see if it still works (or if it really is just indexing the first batch of records only)

Update the Third: Yeah. All's well now. New content is being automatically indexed, and all old content is properly indexed. Wonder what happened…

Drupal Search Funkiness - resolved: it's now 100% indexed. no idea what was wrong before...Drupal Search Funkiness – resolved: it's now 100% indexed. no idea what was wrong before…

I've been noticing that the search feature of this Drupal blog has been acting up for awhile – searching for "drupal" turns up only 4 items, but I've written many many posts mentioning Drupal. I didn't think it was a big deal, but I've actually been getting emails and IMs asking me wtf wrt searching.

So, I dug a bit deeper. Turns out, Drupal is refusing to index my content when cron.php is called. It's called every hour, but the /admin/settings/search status indicator is stuck at:

Drupal Search Index Not Updating: Taken on 2006/06/07, showing the search index not updating, even though cron.php is called every hour (and I even manually triggered it several times) and the number of items to process is turned down to 10.Drupal Search Index Not Updating: Taken on 2006/06/07, showing the search index not updating, even though cron.php is called every hour (and I even manually triggered it several times) and the number of items to process is turned down to 10.

Some poking around on the Drupal site didn't turn up anything useful. I'll keep poking around to hopefully find out wtf is going on with searching. It's a puzzler…

Update: Temporarily fixed. Something's definitely borked. It's only updating the first batch of nodes, even if cron.php is called multiple times. The hack fix involves editing search.module to allow larger batches so all nodes make it into the first run. I added a "2000" item to the $items array on line 217, then cleared the old index by clicking the "Re-index site" button. Manually called cron.php and let it chew, and now all nodes are properly indexed. No idea if I'll have to keep re-indexing. That would be an ugly hack…

Update the Second: Looks like everything's updating ok now… I'll try dropping the batch size back down to a sane value to see if it still works (or if it really is just indexing the first batch of records only)

Update the Third: Yeah. All's well now. New content is being automatically indexed, and all old content is properly indexed. Wonder what happened…

Drupal Search Funkiness - resolved: it's now 100% indexed. no idea what was wrong before...Drupal Search Funkiness – resolved: it's now 100% indexed. no idea what was wrong before…

Shaking the Google Addiction

I’ve been such a total Google junkie since it kicked all of the search engine’s collective asses. Nothing else has come close, so I haven’t even bothered looking anywhere else for perhaps a couple of years now. It just hit me that I’m a little uncomfortable with that total reliance on one source (and their algorithms) for my searching.

So, following Mark Evans’ lead, I’m going to try going a week without Google. I’m not approaching this from a “Google is EVIL” angle – I think they’re the exact opposite – they’ve had opportunity to be evil, and have shown that they want to make the effort to be Good. I just need to take a look around to see what else is coming along…

First, I’m going to try Ice Rocket – kinda Google-like, but it’s been doing some cool stuff with RSS and blogs long before The Goog rolled that stuff out.

I think I should poke around and see if the Meta Search Aggregators are progressing. Remember Dog Pile? They were teh cool before Google ruled us all. (heh – just checked and it’s still running! I’ll have to check it out…)

Update: Woah. Just did a quick (Dogpile) search for “metasearch”, and came up with a bunch of candidates:

I haven’t done any research into legitimacy of any of these tools yet, and haven’t tried them out (except for Dog Pile), but there’s the (short, incomplete) list.

Update: Just did a search for “Calgary” on each of these metasearchers, and only 2 engines returned stuff that wasn’t mostly ads, or just piping in Google’s results. Dog Pile and Search AllInOne – of the two, Dog Pile was more useful.

I’ve been such a total Google junkie since it kicked all of the search engine’s collective asses. Nothing else has come close, so I haven’t even bothered looking anywhere else for perhaps a couple of years now. It just hit me that I’m a little uncomfortable with that total reliance on one source (and their algorithms) for my searching.

So, following Mark Evans’ lead, I’m going to try going a week without Google. I’m not approaching this from a “Google is EVIL” angle – I think they’re the exact opposite – they’ve had opportunity to be evil, and have shown that they want to make the effort to be Good. I just need to take a look around to see what else is coming along…

First, I’m going to try Ice Rocket – kinda Google-like, but it’s been doing some cool stuff with RSS and blogs long before The Goog rolled that stuff out.

I think I should poke around and see if the Meta Search Aggregators are progressing. Remember Dog Pile? They were teh cool before Google ruled us all. (heh – just checked and it’s still running! I’ll have to check it out…)

Update: Woah. Just did a quick (Dogpile) search for “metasearch”, and came up with a bunch of candidates:

I haven’t done any research into legitimacy of any of these tools yet, and haven’t tried them out (except for Dog Pile), but there’s the (short, incomplete) list.

Update: Just did a search for “Calgary” on each of these metasearchers, and only 2 engines returned stuff that wasn’t mostly ads, or just piping in Google’s results. Dog Pile and Search AllInOne – of the two, Dog Pile was more useful.