Self Hosted Searching in Hugo-D'Arcy Norman, PhD

I’ve been using DuckDuckGo’s site-specific search as a way to make this site searchable, after moving from WordPress to Hugo. Since static websites don’t have a database, searching is more difficult so I’d let that go and had just used an embedded search form that fired off a DuckDuckGo query.

Which worked. Mostly. But it also got results from other subdomains at *.darcynorman.net, and didn’t sort them too well. So it was not as useful as the WordPress Relevanssi search plugin had been.

I’m working on my thesis proposal, and am building it as a website. A bunch of markdown text files, with the website generated by Hugo using the Hugo Book theme. It’s shaping up nicely. And, Hugo Book comes with a javascript search engine built in. When compiling the site, it saves a search index javascript file.

It works great on my relatively small thesis website, but I doubted that it would scale to handle my blog with over 8,000 posts going back over 18 years. Surely, the index would be ginormous, and would take hours to generate.

I needed to shift gears to clear my head after some meetings, so took a stab at adapting the Hugo Book search into my blog to see a) if it would even work, and b) how much overhead it would add to generating the site.

About half an hour of tinkering later, and it works. It works quite well. Searches are fast, and much more relevant than the DDG searches were.

I basically copied the search.js and search-data.js files from Hugo Book’s assets folder into the assets folder on my blog (these files are then processed when building the website), added some javascript to the header, some css to handle the search results, and added the Search page.

And, it only adds a few milliseconds to generating my blog website. If I run it a few times, the variation due to other stuff running on my computer is larger than the difference between non-indexed and indexed. Awesome.

But. The index is rather large - there are 2 files, 1 just over 5MB, the other just a few KB. So, searching means the browser has to download those index files so the search runs locally. Not ideal, but back when I was running WordPress, it wasn’t unusual to have every single page weigh in at that order of magnitude in size. Now, the entire website is much leaner, and the search index will load only if I actively go to the Search page.

There’s still a bit of tweaking I want to do with the layout, but it’s working well enough to replace the site search. I’ll want to delete older search index files from the server - they’re all given unique-ified names to avoid browser caching, but that means they will collect on the server with each site build. I don’t want to use rsync –delete, because there are a bunch of files on the webserver that aren’t on my local computers and that would delete them. Which would be bad. Maybe an easy script to run nightly, deleting older search index files…

Update: Easy fix for preventing the search index files building up on the server. Just added a line to my publishing script, deleting the search index .js files before rsyncing the fresh copy of the website into place:

#!/bin/sh

figlet "Hugo PUBLISH"

echo "Running hugo to generate blog website"

cd ~/Documents/Blog/blog
time hugo --cleanDestinationDir

echo "Starting rsync publish of blog website to server"

# delete search indices
ssh darcynorman.net 'rm sites/blog/en.search*.js'

# upload generated site via rsync
time rsync -az /tmp/public/ darcynorman.net:sites/blog

echo "Rsync publish completed"

figlet "Complete"