What if you could instantly find anything in terabytes of office files, email archives, and even web data formats? What if you could search for data from anywhere and extend that search capability to all of your colleagues? Think about how much time that would save you. This article will break down the processes that go into business search then follow with some more advanced tips.
Indexed search for enterprise search
The key to instant terabyte search is to let the search engine build a search index first. Enterprise search can include indexed or non-indexed search. dtSearch®, for example, offers both. But at the same time unindexed search allows you to query data without the overhead of a search index, it is much slower for multi-user concurrent search across terabytes of data.
So what happens in a search index?
An index is just one internal search engine guide which stores every unique word and number and the location of each in the data. For the end user, indexing is easy; just point to the folders and such to index, and the search engine does the rest.
A single index can contain up to one terabyte of text, and there is no limit to the number of indexes the search engine can create and search simultaneously.
Creating an index requires a lot of resources
Indexed search is resource-efficient. There is no limit to the number of concurrent search threads that can query the same index in a network environment. Online, each search thread can operate completely stateless, making it very easy to scale on a busy site.
Datasets can continue to evolve
Our sample search engine supports automatic updating of all indexes using the Windows Task Scheduler to accommodate file changes, new files, and file deletions. Updating indexes does not block searching, so individual and simultaneous searching can continue even while indexes are updating.
Different data formats for business research
Ultimately, what makes business search so helpful is that a single search request can cover several different data formats and different data repositories. Here’s how it works.
File Format Specification
To view a file outside of a search engine, you usually check out that file in its native application, such as viewing a Word document in Microsoft Word, an email in Outlook, etc.
Build an index in the search engine
This is great for viewing individual files. But for a search engine to effectively build its index on terabytes of data, the search engine needs a different approach. This approach consists of displaying each file in its binary format, bypassing the native application approach entirely.
The problem is that when you look at the majority of “Office” files and such in binary format, they look like a hodgepodge of binary codes. Main text can range from hard to read to completely impenetrable. Effective text filtering requires the application of a file format specification.
File Format Specification
The file format specification for “Office” formats can be hundreds of pages long and varies for different file types. The Microsoft Word file format is very different from the Access format, which is, in turn, very different from the file format for Excel, PowerPoint, OneNote, PDF, emails, HTML, XML, etc. Correctly determine the file format of each binary file is therefore critical.
Do not mistakenly apply a file format extension
However, it is too easy to misapply a file format extension, save a PDF with a .DOCX file extension, or save a Word document with a .PDF extension. While an incompatible file format extension can be accidental, it can also result from a desire to hide a particular file from scrutiny.
The surefire way to determine the file format is for the search engine to look inside each binary file.
After determining the file format from the binary file itself, the search engine can then apply the correct file format specification to parse the full text and metadata of each item. Then the resulting information goes into building the index.
After indexing, the search engine will usually do a “mini-display” showing the search terms in context
The search engine can also display the full text of recovered files along with highlighted results. To do this, the search engine will usually revert to the binary format version and convert it to HTML for display in a browser window inside the search engine, adding click-through navigation for convenient navigation.
Types of Enterprise Search Engines Indexed
Since indexed search is based on a predefined index, there are over 25 different search options available for instant search. These include almost any combination of word and phrase searches, boolean and/or non-boolean search expressions, and two-way or one-way proximity search. The search can cover the full text of the indexed data or focus on specific metadata, such as the subject line of an email.
Beyond word-based search, an indexed search can also encompass number-based queries.
A numeric-oriented query is similar to searching for specific numbers or numeric ranges and searching for specific dates or date ranges, even if the dates are in different formats, such as 05/07/21 and June 11, 2022. The search engine can also find different configurations of characters and numbers, including regular expressions and numeric character matching.
As the general standard for text files, Unicode covers hundreds of international languages, including English and other European languages, Asian languages, right-to-left languages such as Hebrew and Arabic, and many more. Unicode allows any mixture of languages to coexist in a single document. All of this is in the binary format of a file and therefore available to a search engine.
Advanced Tips for Enterprise Search Engines.
The description above represents the basics of how a search engine instantly searches terabytes. These are advanced tips.
Tip #1. Black writing on a black background, red writing on a red background, etc. can practically disappear in the native application view of a file. However, since a search engine accesses files in binary format, all text is also available to a search engine.
Tip #2. When viewing a file in its native application, it can take an awful lot of clicks in the right order to even know that some metadata is there. But all metadata is on an equal footing in the binary format, which makes all metadata accessible to a search engine.
Tip #3. It’s easy to forget when viewing a document in its final form that underlined changes may still exist in another view of the document. If these are not entirely eliminated from a draft, these annotations will remain accessible to a search engine, both in the research phase and in the display phase of the file.
Tip #4. Have you ever tried to copy what looks like words from a PDF file and got nothing when you tried to paste those words? This is what happens in an “image only” PDF. These PDF files can be mixed with other documents and are very difficult to spot on their own. As these are images only, there is no digital text (other than the filename and metadata). This means that they are effectively empty for a text search engine. But search engines may flag “image only” PDF files at indexing time, telling you that you should run them through a OCR program like Adobe Acrobat – then send them back to the search engine for full-text indexing.
Tip #5. Certain documents such as e-mails and OCR files may be full of typos. Setting the fuzzy search to a low level, like 1 or 2, will sift through common typographical errors. And fuzzy search works in addition to most other search options.
Tip #6. A search engine may flag some personal information in files like credit card numbers. During the indexing process, the search engine may take a series of numbers that could represent a credit card and run those numbers through a credit card validation algorithm. Identifying where credit card numbers may appear in shared data allows you to separately take steps to address the risk of such exposed personal information.
Tip #7. Normally, the search engine returns to the original source of data to display it with the results highlighted. But if the original data is far from where the search is running from, or if the original data may disappear completely, enabling caching will still allow the display files with highlighted results work seamlessly. The downside of enabling caching is that it will make the index size much larger than otherwise.
Featured image credit: Photo by Vlada Karpovich; pexels; Thanks!