Best practices for searchable archive of thousands of documents (pdf and/or xml)

songania · Jul 20, 2012

Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?). Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents? Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.Thanks for any advice.

Best practices for searchable archive of thousands of documents (pdf and/or xml)

songania

New Member