I'm creating a section on my intranet to store documents relating to research projects. These documents could be .pdf's and or Word documents and could be pretty lengthy and in large quantities.
One of the main things people are asking for is that this resource be searchable, ie. the document contents themselves.
I can code a method of linking an uploaded file in the file system to a corresponding database record at the time of uploading but this doesn't make it searchable. Unless of course I get the user uploading the document to enter keywords or something, but this seems a little labour intensive and prone to error.
Anyone had to do this or something similar or got any suggestions?
Making it searchable, looking for ideas and suggestions
Moderator: General Moderators
- CoderGoblin
- DevNet Resident
- Posts: 1425
- Joined: Tue Mar 16, 2004 10:03 am
- Location: Aachen, Germany
Keyword entry is one way of doing it.
Another method (if you have the storage space available) is to look at what you can do through the operating system. Example. Person uploads a PDF document. The document is converted automatically via the operating to a pure text file. When performing a search all text files are searched (Simplest *nux example would be through a grep command). If the text file name is the Id held in the database you can then present the user with the actual format document. Downside is you need additional resources (space) on your server. You may also run into problems with the amount of files within a directory and require some method to solve this.
Yet another method if the documents are structured in some way is to automatically extract keywords from the text. Have a site of "available" keywords and search/index the documents according to the contents. Again you would probably need to perform some conversions. You would also need to be able to provide the ability for users to update the "keyword" list.
These are just a couple of ideas of the top of my head, not sure how practical they are in your situation...
Another method (if you have the storage space available) is to look at what you can do through the operating system. Example. Person uploads a PDF document. The document is converted automatically via the operating to a pure text file. When performing a search all text files are searched (Simplest *nux example would be through a grep command). If the text file name is the Id held in the database you can then present the user with the actual format document. Downside is you need additional resources (space) on your server. You may also run into problems with the amount of files within a directory and require some method to solve this.
Yet another method if the documents are structured in some way is to automatically extract keywords from the text. Have a site of "available" keywords and search/index the documents according to the contents. Again you would probably need to perform some conversions. You would also need to be able to provide the ability for users to update the "keyword" list.
These are just a couple of ideas of the top of my head, not sure how practical they are in your situation...
I suppose I could have a text box on the form to upload documents and make the user paste in all the document text into the text box then store this text in the database, create a full text index of the database and search on this. Not being particularly experienced in this area though is all this text in the database likely to affect performance?
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
eventually, it will.. but it's far easier to search a full-text field set than search individual documents. If Word is installed on the server, you can fetch the text right out of it in PHP using the COM extension. As for PDF's on Windows.. I believe there is a tool out there that can do it.. probably a port of the unix one.. http://www.google.com/search?hl=en&q=co ... df+%7Etext