reclaiming website search

I’ve been withdrawing from relying on Google wherever possible, for various reasons. One place where I was still stuck in the Googleverse was with the embedded site search I was using on my self-hosted static file photo gallery site. That was one of the few places where I couldn’t find a decent replacement for Google, so it stayed there. And I wasn’t comfortable with that – I don’t think Google needs to be informed every time someone visits a page I host1. I use that embedded search pretty regularly, and cringe every time the page loads.

There had to be a good search utility that could be self-hosted. I went looking, and tried a few. My requirements were pretty basic – I don’t need multiple administrators, or shards of database replication, or multiple crawling schedulers etc… I don’t want to have to install a new application framework or runtime environment just for a search engine. I want it to be a simple install – ideally either a simple CGI script or something that can trivially drop onto a standard LAMP server.

Today, I installed a website indexer on a fresh new subdomain. Currently, the only website it indexes is darcynorman.net/gallery, but I can add any site to it, and then index and search on my own terms, without feeding data into or out of Google (or any other third party).

The search tool is powered by Sphider and seems pretty decent. It’s a simple installation process, and uses a MySQL database to store the index. Seems pretty fast – on my single-site index, with one user (me).

The biggest flaw I’ve found with Sphider so far is in how it handles relative links. Say you have a website structure like this:

  • index.html
    • page1.html
    • page2.html

If index.html uses a simple relative link like <a href="page1.html">Page 1</a>, Sphider skips it. Unless the index.html page has a <base> element to tell Sphider explicitly how to regenerate full URLs for the relative links. Something like this:

<base href="http://photos.darcynorman.net/" />

Which Sphider can then use to turn relative links into fully resolved absolute links.

But this is strange – I had 2 choices:

  1. hack the Sphider code to teach it how to behave properly (and then re-hack the code if there’s an update)
  2. update each gallery menu page to add the <base> head element

I chose #2, because I just didn’t have the energy to fix Sphider, and the HTML fix was simple enough. It definitely feels like a bug – there’s no way that editing every page to add a <base> element should be required, but whatever.

Bottom line, Sphider works perfectly for my needs. It’s now powering the site search for my photo gallery site, and works quite well for that. And, it’s going to be available to index any of my other projects if needed.

  1. as would happen when the embedded search javascripts are loaded – that activity data could then be tracked/stored/analyzed by Google to better model what you’re interested in, who you know, etc… []

how to repair all tables in all databases on a mysql server

This comes in handy, and I have to google it every time I need it12. So, here’s a copy for reference later…

mysqlcheck --repair --use-frm --all-databases

Run it as root, with MySQL running. It’ll repair every table in every database. Give it time to chew for awhile. It spews out the status of every table as it works. Here’s what it found with my Fever˚ database tables (which now work just fine):

dnorman_fever.fever__config
warning  : Number of rows changed from 0 to 1
status   : OK
dnorman_fever.fever_favicons
warning  : Number of rows changed from 0 to 408
status   : OK
dnorman_fever.fever_feeds
warning  : Number of rows changed from 0 to 240
status   : OK
dnorman_fever.fever_feeds_groups
warning  : Number of rows changed from 0 to 305
status   : OK
dnorman_fever.fever_groups
warning  : Number of rows changed from 0 to 17
status   : OK
dnorman_fever.fever_items
warning  : Number of rows changed from 0 to 13660
status   : OK
dnorman_fever.fever_links
warning  : Number of rows changed from 0 to 46208
status   : OK

Better.

Looks like it doesn’t like INNODB tables, throwing this:

note     : The storage engine for the table doesn't support repair

So, if you’re using MyISAM tables, this should do the trick. Not sure how to fix the INNODB tables, or if they even need fixing…

  1. usually coming up with the top-voted answer for a question posted to stackoverflow.com []
  2. actually, I use DuckDuckGo, so I get the tip inline in the search results… []

Hippie Hosting server resource usage update

The server’s been feeling pokey lately, so I wanted to dig around to see if anything was acting up (or if it was just gremlins in the machine). Did some reporting using sar and got this:

sar -r gets the memory dump. sar -u gets the CPU usage dump.

sar -r > memory.txt;
sar -u > cpu.txt

then, grab the text files, clean them up a bit, bring them into Excel, select the columns, and insert chart. Yeah. The whole Excel thing is probably a non-traditional-linux-admin tool. Whatever. Pretty pictures.

CPU is consistently idle. Memory is consistently not pegged. I think we’re doing ok.

resource management on Hippie Hosting

The Hippie Hosting Co-op server has been humming along for several months now. We’ve had our share of growing pains, and recently we’d been seeing memory usage on the server spiking pretty severely.

For the non-Hippies: the Co-op is run on a Mediatemple (dv) dedicated virtual server, and we have 197 domains for 84 members running on the box. Most are pretty low demand, simple blogs with low traffic. Some are higher demand. But, on average, we should be well within the limits of what the server can do.

To try tracking down the resource issues, I fired up top to see what was sucking up memory. It looked like MySQL was using waaaaay more than it needed to, followed by apache and PHP (all expected on a server that’s basically just a LAMP webserver…). Thankfully, Mediatemple provides some really great recipes to help track down and solve these issues. I followed the MySQL Tuning howto, and it seems to have stuck. Memory spikes are gone, and we’re back well into the normal range. Awesome.

We’re still seeing higher-than-expected memory usage, and the CPU is running a little higher than I’d like.

Next, we need to try to optimize the apache/PHP side of things a bit more, and maybe come back to MySQL to tweak performance a little. But, we’re back on our feet (touch wood).

Ramping up the Co-op in Hippie Hosting Co-op

The Hippie Hosting Co-op was started by the idea of friends and colleagues pitching in to share resources to run a server together. It kind of took off from there. In the months since the launch, it’s grown to over 80 members1, most of whom were attracted by the idea2. And it’s continued to grow.

But, we’re reaching a point where we need to make some decisions as a co-op. For this to work, it has to be more than just a discount web hosting provider3. We need to be in this together. For the server to handle the number of users it has now, the costs to the co-op are $186/month45 . We’ve been lucky enough to have a bucket of cash to kick things off, but we’re going to burn through what’s left of that in a couple of months.

The system administration and design work has all been done by the awesome Tim Owens. I’ve tried to pitch in where I can, but it’s easily been 99% Tim (including the actual setup, management of the Plesk interface, and configuration of the server itself). That’s not going to be sustainable. Tim has a job, a life, etc… and tweaking the co-op server is placing a growing demand on him.

We just had a kind of major server outage, caused by some server upgrades that went kerblooey. Luckily, Tim was able to fit some debugging and recovery into his day job, but he also volunteered many hours, working into the wee hours of the night to get the co-op back online. We need more people who can help shoulder the load. I’m just a google jockey, so wasn’t much use in this case.

So, what can we do as members of the HHC? Well, I’m glad you asked. There’s 2 main ways:

  1. Fat stacks of Benjamins, yo. The co-op needs cash to stay afloat. The $1/month plans are great, but we need to figure out a more sustainable financial model. What should that be? I don’t know. Annual telethons? Kickstarter campaigns? Bake sales? Higher membership fees? Something else? Some combination of models? Not all members use the same level of resources – there are a few (myself included) who use significantly more resources (bandwidth and disk space) than the typical “let’s set up a new site to see what this is about” type of member.

  2. Hippies pitching in. Barns need raising. Hamsters need feeding. There’s stuff that needs to get done if the co-op is going to stay on the air. You have some skills. We need them. Not sure what skills, or how they’re needed yet, but we’ll all need to pitch in. Maybe you can write documentation. Maybe you can mess around with MySQL and server packages. Maybe you have some other awesomeness. Cool. Roll up your sleeves and dig in.

The goal of the HHC is to provide a place for people to come together to work on stuff – to build their online spaces in a community. We can’t just be a discount webserver provider – there’s more to the co-op idea than just a server. It’s about the members, working together.

Have an idea for how to make the co-op more sustainable? Let the hippies know.

NewImage

<

p style=”text-align: center;”>everybody on the bus!

  1. that’s a pretty big co-op! []
  2. and, really, we’ve never heard of many of the members, so the co-op feel needs some cultivation []
  3. there are lots of those around, although our insanely cheap minimum membership fees make the HHC pretty attractive []
  4. that gets us a MediaTemple dv virtual server, with 2GB of RAM, 100GB of disk space, and 2TB of bandwidth per month. we’re not hitting the limits on those, but need to have room to grow as the hippies start using their websites more []
  5. we started with a much cheaper dv server option, but that was quickly outgrown by the number of members that joined so quickly at the beginning []

Hippie Hosting server now has room to grow

I’d been getting nervous, seeing the storage on our Hippie Hosting Co-op server filling up. We were over 80% full, with less than 18GB left until we were in serious trouble. So, I did some digging. I was getting ready to start deleting some of the bigger video files in my web hosting account, to make space. Turns out, that wasn’t necessary.

I use this to find directories that have lots of stuff in them:

du -Psckx * | sort -nr

For bonus marks, run that as root. For extra-special bonus marks, set it as an alias – I have this in my .bash_profile on several servers:

alias dus='du -Psckx * | sort -nr'

Start where you suspect trouble (I started in the /var/www/vhosts directory, thinking one of the stinking hippies was filling the place up with their free love. turns out, the hippies are only using just over 30GB of space, on a 100GB volume. no problem.) So, I moved to the root directory / and tried it there. /var was the biggest directory, so I moved into it and ran it from there. 2 big directories in it – /var/lib and /var/www. I’d already checked out /var/www (where /var/www/vhosts is), so I popped into /var/lib and ran it again. /var/lib/mysql. Metric buttloads of files in there. Ruh roh.

I took a look, and saw LOTS of BIG mysql-bin files in there. Some quick poking around the great MediaTemple documentation site, and I came across this piece on mysql binary logging. It’s used for replication and disaster recovery. We don’t replicate the server, and aren’t using the binary logs for disaster recovery. That’s what backups are for. I’ve had to disable binary logging on servers before, so it wasn’t a surprise. Well, I was surprised that it was enabled by default, but yeah…

The reporting command described on the MediaTemple documentation site dumped this:

MySQL binlog consuming 34.98 Gigabytes of disk space

35 GB. On a 100GB drive. When the next step up in server specs is about triple the monthly cost. Yeah… We don’t need binary logs that badly…

So I disabled binary logging, restarted the mysql server, and nuked the binary log files. Hey, presto! We’re now back under 50% of storage space used, with LOTS of room to grow. Awesome.

the long tail of hippie hosting sites

There are now over 170 sites hosted on the Hippie Hosting Co-op server. Most are low traffic, low resources sites, with a handful of big sites.

The folks with big sites are paying more than the nominal fee, so it all works out. I was surprised to see how few large sites there are on the server. I was also surprised to only be #3 on the list. This whole Reclaim project wants to suck up drive space…

Self-hosting video with WordPress and Hippie Hosting Co-op

I’ve been messing around with hosting my own videos, but that’s one area where the third party services have the functionality nailed. They magically transcode video file formats. They create thumbnails. They provided embeds to make it easy to use the video. But, Jim posted about how he’s having to take on some copyfighting, because YouTube is bending over for some pretty outrageous false copyright claims. The only way to prevent a third party from misusing your content is to not use a third party.

So… I took another look for a decent, fully-featured video hosting plugin for WordPress. And, I found one that looks pretty decent – the creatively named Video Embed & Thumbnail Generator plugin. It integrates with the WordPress media library, uses ffmpeg for transcoding and thumbnail generation, and provides a flash- and HTML5- embed for easy use of the videos.

It looks like ffmpeg doesn’t understand the “up” orientation flag on videos shot with an iPhone (and probably other devices), so the only caveat is that you have to be careful to hold the device so that it’s facing “up” (I actually had to figure out what’s the “proper” way to hold an iPhone – turns out, with the volume buttons on the bottom. oops.). Windows seems to have trouble with this, as well, showing photos and videos upside down…

all along the watchtower
[FMP poster=”http://www.darcynorman.net/wp-content/uploads/2012/06/20120616-214038_thumb1.jpg” width=”840″ height=”473″]http://www.darcynorman.net/wp-content/uploads/2012/06/20120616-214038.mov[/FMP]
Right-click or ctrl-click this link to download.

If you’re using the Hippie Hosting Co-op, ffmpeg is now available. After installing the plugin, set your “path to ffmpeg” setting to point to “/usr/bin”, and you’re off and running. Adjust the default settings however you like (I set mine to embed video 840px wide).

[FMP poster=”http://www.darcynorman.net/wp-content/uploads/2012/06/Screen-Recording_thumb4.jpg” width=”840″ height=”525″]http://www.darcynorman.net/wp-content/uploads/2012/06/Screen-Recording.mov[/FMP]

goaccess live webserver stats on hippie hosting

I just installed the GoAccess apache log processing application on the Hippie Hosting Co-op server, giving users a way to watch the stats for their sites in realtime, without having to rely on privacy-invading analytics bugging software. This software works on the command line, so just SSH into your account and type:

goaccess -f statistics/logs/access_log

That tells goaccess to load with the logfile at the specified location. You can feed it other logfiles, but the default one for a Hippie Hosting account should be at statistics/logs/access_log.

It will prompt you for the type of log file. Select NCSA Combined (arrow-down, hit enter to select, then F10 to continue. yeah. intuitive software…)

It’ll give you something like this, updating live: