JOIN OUR TEAM

The Problem with Document Search Technology


Mike Adams
Author: Mike Adams Date: 06/26/2018

The most valuable resource any organization has is its institutional knowledge. That’s the combined knowledge and wisdom of your people plus the thousands and thousands of records, transactions and documents created in the course of business. Corralling and maintaining that knowledge base is why Electronic Content Management (ECM) technology is one of the most popular and important investments for companies large and small.

In fact, ECM investment is growing at a rate of 18.7%, with an expected total investment of $66.27 billion by 2021. Unfortunately, the dirty secret of ECM is that as much as 85% of all data stored in documents isn’t searchable and is considered dark data. How can a technology that is supposed to capture and make an organization’s document library searchable be so ineffective?

The Italian Restaurant Problem

The traditional technology used to search most ECM systems is notoriously problematic. ECM applications streamline access to records through keyword and full-text search, allowing employees to get to the information they need directly from their desktops in seconds rather than searching multiple applications or digging through paper records. But keywords are unreliable.

The most commonly cited research on information re­trieval is a 1985 article: An Evaluation of Re­trieval Ef­fectiveness for a Full-Text Document Retrieval System, by David C. Blair and M.E. Maron[i]. Blair and Maron used keyword searches to find documents on a given topic among 40,000 documents comprising 350,000 pages. The research showed that keyword searches found only 20% of the relevant documents in the collection. More recently, a study by the Text Retrieval Confer­ence (TReC)[ii] found that searches using commands such as “and, or” within so many words across a range of different hypothetical topics found only between 22–57% of all relevant documents.

Every organization collects, processes and stores a tremendous volume and variety of data, but what good is that data if only 20% of the relevant information is even discoverable? Whether evaluating loan applications, processing insurance claims or managing shipping invoices, human intervention is often needed to review and make sense of that unstructured data. Unfortunately, this type of handling is usually slow, labor-intensive, costly and error-prone.

And despite their ubiquity in modern life, Google, Bing or other internet search engines are not particularly useful to researchers looking for the most precise answer. We call it the “Italian Restaurant Problem.” That’s because if you search “Italian Restaurant near me” on Google, you’ll get hundreds or thousands of answers, but most people will only ever see a handful of results that appear on the search page. 

That kind of search result might help you get a quick meal, but Google has no idea what kind of Italian food you like, what dietary restrictions you may have, or other important facts. For that, you will need to search again or visit the web pages for individual restaurants yourself. 

The Paris Hilton Problem

In attacking the problem of the ambiguity of human language, computer science is now closing in on what researchers refer to as the “Paris Hilton problem.” That is the ability, for example, to determine whether a query is being made by someone who is trying to reserve a hotel in France or is simply passing time surfing the internet looking up has-been celebutants. 

Many search tools today make allowances for synonyms and so-called fuzzy logic that find matches on related bases of the word. However, it is not the word, but the intent that is key in a cognitive system.  In this case, users no longer just search; they query — often in full sentences. The background application deconstructs the query — based on intensive training and refinement — to determine the user’s intent and find results that match it.   

We believe the ultimate answer is a cognitive system which can be trained to consider the particular demands of your inquiry and will learn over time what results are most relevant for your organization. For example, if you research a legal issue, a cognitive system will understand that your search results should not feature the most relevant link, but the document that is relevant for your jurisdiction, is the most recent precedent, and considers pending regulatory changes. 

Now that we’ve established the challenge for most ECM systems, it’s clear that a cognitive solution involving a combination of artificial intelligence, machine learning and possibly natural language processing will be needed to overcome it. In the coming week’s blog post, I will explain the solution in more depth. In the meantime, I want to remind everyone that at the Gordon Flesch Company, we’re on the forefront of cutting-edge technology and can provide solutions for all your computing needs. Contact us today for a free consultation. 

Also, look here for exciting, upcoming information from the Gordon Flesch Company about how we hope to solve the Italian Restaurant and Paris Hilton problem in one fell swoop.

[i] “An Evaluation of Re­trieval Ef­fectiveness for a Full-Text Document Retrieval System,” David C. Blair and M.E. Maron, published in Commu­­nications of the Association for Computing Machinery. 1985

[ii] Text Retrieval Conference https://trec-legal.umiacs.umd.edu/

The Official Guide To Going Paperless With ECM (Blog Footer)

Leave a Comment

Written by Mike Adams

Mike has been with the GFConsulting Group, formerly Cambridge Connections, since 2009. His role today as Manager of Development is to identify, develop and promote the value of enhanced IT products and services for our customers, with a focus on cloud and hybrid technologies that provide practical solutions to a variety of business challenges.

Need More Information?

We’re ready – and eager – to help you solve your technology challenges.