PDF Crawler using Inverted Index and Interval lists
Abstract
The search operation in PDF document has become very indispensable now a days and loads of research have being organized to store and process the index required for search operation in a very simple and effective manner. Whenever indexes are stored, its access time is large and it requires large amount of storage space. The above techniques have some limitation like it can be done only for small number of PDF documents. To increase the access time and to reduce the storage space we are using the concept of inverted index and interval list. With the help of inverted index of a keyword available in PDF it can easily retrieve the PDF document. It can assign unique id to each and every document (docID) available in repository. Interval list is used for lower bound and upper bound of document present in repository. The inverted index and interval list make it easy to retrieve information of PDF document with the help of keyword. The combination of both can improve the information retrieval system (IR) and it allows us to search millions of PDF document.
Index terms: keyword search, key-phrase search, inverted index, interval list.