Abstract
Literature review is a time-consuming burden because it is hard to find relevant articles. But literature review is so important because it allows researchers to find solutions to their questions/problems from previous work already performed and published by others. It is difficult to wade through documents quickly and assess their quality by only looking at their title, abstract, or even full-text. The human visual system allows us to quickly glance at images and infer the main subject of an article and decide whether we are interested in reading more. In some cases, such as biology articles for example, figures showing photos of experimental results quickly allow a researcher in the literature review phase to determine the quality of the work by its results. This work describes a system for literature review that uses content-based image retrieval (CBIR) techniques to search for relevant documents using the content of figures in a document along with relevance feedback refinement instead of keyword search guesswork. The long-term goal is to use it as a subsystem in a content-based document retrieval system where the figures and their captions are combined with the document's body text. This paper describes the processing of the documents to extract available raster graphics as well as text with its layout and formatting information intact. The process of matching a figure to its caption using this layout information is then described. While caption-based search is implemented but not quite merged into the system yet, the figure-caption matching is complete. Two novel modified tf-idf measures that are being considered to take into account bold/italic text, font size, and document structure as a way to infer text importance rather than just rely on text frequency is detailed mathematically and explained intuitively. CBIR queries where there are multiple images that form the query are issued as separate queries and their results are then merged together.