Skip to content

What’s really under the hood of your surveillance software?

If everyone uses the same underlying data management and search engines, then what is the value-add of external surveillance software? This question arises because most external software vendors who need a these capabilities in their products, whether in surveillance or not, default to open-source software based on Apache Lucene plus one of a number of extensions to Lucene, the most popular of which is Elasticsearch.

Without going to deep into the world of coding, Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index. Elasticsearch is a distributed, open-source search and analytics engine built on Lucene which allows you to store, search, and analyze huge volumes of data quickly and in near real-time. There are other open source projects that extend Lucene’s capabilities including the well-known Apache Solr.

The issue for banks buying surveillance software is that they are essentially buying engines that ingest data and search it. An e-comms surveillance application ingests emails, messages and chat text, and then looks through it for words, phrases or more complex units of meaning or intent which the bank defines as relevant to its search for misconduct. Other surveillance tools work the same way: they ingest data and then use a set of search criteria to isolate potential instances of misconduct.

But if everyone is using the same underlying, free, open source search technology (say, Elasticsearch running over Lucene), then what are banks paying for when they buy surveillance tools? When vendors say that they can ingest multiple data types, or ingest and search unstructured data, what they are actually saying is that these free pieces of open-source software can do that. And when they say that they can search huge, unstructured data sets, again, they mean that this can be done in these pieces of free software via established query languages and APIs. And while the queries may be complex, they ultimately derive either from mappings to regulations or from instructions from the banks buying the products.

There are other questions. What are the possible issues with relying on open-source software that in some cases was created more than a decade ago? For example, Solr was once a thriving community of open-source developers who kept the code and documentation up to date. Lately it has been overtaken by Elasticsearch and the Solr community is now less active. So, what happens if your surveillance software provider uses Solr? 

And what about security? Open-source software is open source. If you upload your communications to the Cloud for ingestion into a system based on these search engines, how do you know the underlying software (not the vendor’s interface or Cloud) is not compromised? The Log4J / Log4Shell security incident was (and remains in some cases) extremely serious, allowing cybercriminals to compromise vulnerable systems with just a single malicious code injection. This was a vulnerability in an open-source logging library freely distributed by Apache. What if Lucene becomes insecure?

None of this is to say that banks should avoid new surveillance products. And obviously there is more to a modern surveillance tool than just data ingestion and search. But it is a call to banks to ask what these systems are based on and what the implications are. It also surely complicates the buy-build question: if the core functions of these systems are free and can be accessed by any coder writing an application, then what is the exact trade-off between buy and build?