This week had risked turning into a disappointment, what with Richard Sennett’s lecture on “The Open City” being cancelled. But last night more than made up for it. The Data Insights Cambridge meetup group, er, met up, for a brilliant talk by Charlie Hull, founder of Flax, about open source search engines.
Instead of talking about how search engines are implemented, as fascinating as that stuff is, he largely talked about industry politics, which turns out to be fascinating too.
If you ignore web search — which, thanks to Google, you kind of have to — the trend in search over the past few years is for the proprietary search engine vendors to get bought up by much larger companies, who only want to sell search as part of a larger package. The famous local example is Autonomy, who got bought by HP. (Are we going to get printers with search capabilities?)
This leaves any large organization who have internal search requirements with a hard choice: either stick with a closed source vendor, who just became less interested in supporting what the organization really needs, or go for an open source option.
As ever with open source, the problem is more one of image and culture than technical capability. From a computer science point of view, “building a search engine is a solved problem”. The major search engines (both proprietary and free) are broadly equivalent at a technical level. What’s interesting is what can be done with them.
People often say “You get what you pay for”, so it can be difficult for someone selling an open source solution to persuade a client that, in reality, there’s less magic going on than there might appear; and that a lower cost solution isn’t necessarily going to be lower quality or missing on functionality.
Of course, in these sorts of situations, it’s “non-functionals” that people are really paying for. Although, from some of the horror stories we heard about proprietary Content Management Systems, other meanings of “non-functional” spring to mind…
The winning phrase is “economic scalability”. As an organization’s data management needs grow — and they generally will — it can add servers without having to pay through the nose for extra software licences.
Charlie went through the open source options. Apache Solr/Lucene is the obvious big player. Elasticsearch is what the cool kids — including those in the Cabinet Office — use these days. Xapian is a bayesian search engine; but it’s not really “the open source Autonomy”, oh no. There are other players too.
He gave some case studies of stuff Flax had been involved in. To me, the one about the Cabinet Office and its website www.gov.uk, was the most interesting. It made a conversation I’d had with an old colleague the other week, who now works in that area, suddenly make much more sense. (Slightly weird coincidence: someone at the meetup currently is at the same company where the two of us worked, back during the dot com boom.)
The Q&A was good too. There as a question about automated tagging of images based on their content, but it sounds like that’s as much as a pipe dream as it ever was. I remember that being a hot thing back when Microsoft first opened their Cambridge research centre.
The problematic issue of software patents came up. Perhaps that problem will go away more quickly than people expect, if Sir Francis Maude is now constantly dealing with a bunch of developers telling him that patents stifle innovation and allow for price-gouging on the part of big vendors.
Personally, I’m not so great with post-presentation networking. But a bunch of people sat in a pub, nattering about stuff their interested in and shared acquaintances? Now that I can handle. And that — as I’ve been trying to explain to various people recently — is more important to Cambridge’s success than is commonly realized.
I got an inside tip on doing presentations too: it’s just a performance, so treat it as such. Perhaps I should revive that plan to take up improv theatre…