Hiya,
I got a PHP script which indexes the web-pages on my intranet.
At the moment I only have it working with .html files
using the fopen() and fread(), is it possible to read the content of a .pdf?
This would make my life easier as a number of documents to be placed on the intranet are word documents, and i'd rather convert them to .pdf than .word
cheers,
StuartCreate PDF files...yes (using php's PDF functions). Read PDF files....I don't think that's possible.
You could read in Word files if you use php's COM functions but it would be tricky, frustrating, and require php to be running on a Windows server.
*wanders off realising he hasn't really helped here*luckily I have PHP on Win2003
google can index .pdf AND convert them to html...
so it must be possible - even if I have to use a different scripting language, I suppose...
but seeing as how I've built the rest of the intranet on PHP/MySQL technology, i would prefer to keep to that for simplicity.
and I just realised the thread title is wrong... dammit...why read them? just have the browser load the pdf.i want to index them for a intranet search enginethis is one thing I haven't worked with yet. so with that said what if you used
pdf_open_file()
and just open it? would it give you anything?It looks like pdf_open_file() is used in the CREATION of pdf's
If looks like PDFlib+PDI offers support for parsing pdf files. I've never done it, but you might check there.
<!-- m --><a class="postlink" href="http://www.pdflib.com/products/pdflib/pdi.htmlAaron">http://www.pdflib.com/products/pdflib/pdi.htmlAaron</a><!-- m -->, that's right the php PDF functions are for creation only (or at least that's my understanding too).
Nice link though. Especially as the API supports PHP. :thumbup:but what if you tried it without the new() on it. jsu topen it and see what it does.just had a look at the php.net manual:
pdf_open_pdi (<!-- m --><a class="postlink" href="http://uk.php.net/manual/en/function.pdf-open-pdi.php">http://uk.php.net/manual/en/function.pdf-open-pdi.php</a><!-- m -->)
looks good...Just opening it will likely return a pointer (like when you open a text file with fopen() you just get a pointer to use in other functions).
Also, opening the file is one thing, but being able to understand the contents are another. The file would be full of odd commands, numbers, codes and god knows what else. You have to have something which can parse the file and understand its contents.
Be interested if you get this working somehow to see how you did it thoughyeah, i know....
yikes!!
how od google do it google will use an library function to do it too. Check out Aaron's link.Originally posted by torrent
Just opening it will likely return a pointer (like when you open a text file with fopen() you just get a pointer to use in other functions).
but once you open it doesn't the content end up in a variable? this way you can just stick the contents between some html tags and echo it? well after thinking about it you would still need something similiar to fread for pdf wouldn't ya?
yeah, I would be very curious as to how you go about it HK.i find it kinda funny that it is so easy to CREATE a .pdf document when you have to pay Adobe for their editor, but it seems to be more difficult to READ it when the Adobe Reader is free?
i'm going to give it a go sometime tonight and let you know...Ask yourself which makes the better business sense? The more people they can get to want to receive documents in pdf format the more editors they sell. Therefore give away the read client for free. Make it accessible.
Besides, it's no different to reading any file. You want to read an XML file, you still have to go through an XML parser, or code the parsing rules in PHP yourself. You still have to know the rules of the document structure.
This is exactly the same for PDF. The PDF documents will have their own structure and formatting tags, etc. You will have to know how to parse them all and I believe the only efficient and easy way of doing that is to you a purpose built library (parser).
Good luck though and keep us informed how you get on.Where did you get your indexer? Or did you write it yourself? Is it available online anywhere for Download ? You find a solution for this Horus?Hello?and you replied why? this is an old thread.I wasn't sure if Horus_Kol saw my post about where he found his indexer or if it's available for Download .you can write one, it is at php.net. some of the users comments under opendir() will give you some code.i wrote my own - but it has a problem with https - not good for my intranet site...
I got a PHP script which indexes the web-pages on my intranet.
At the moment I only have it working with .html files
using the fopen() and fread(), is it possible to read the content of a .pdf?
This would make my life easier as a number of documents to be placed on the intranet are word documents, and i'd rather convert them to .pdf than .word
cheers,
StuartCreate PDF files...yes (using php's PDF functions). Read PDF files....I don't think that's possible.
You could read in Word files if you use php's COM functions but it would be tricky, frustrating, and require php to be running on a Windows server.
*wanders off realising he hasn't really helped here*luckily I have PHP on Win2003
google can index .pdf AND convert them to html...
so it must be possible - even if I have to use a different scripting language, I suppose...
but seeing as how I've built the rest of the intranet on PHP/MySQL technology, i would prefer to keep to that for simplicity.
and I just realised the thread title is wrong... dammit...why read them? just have the browser load the pdf.i want to index them for a intranet search enginethis is one thing I haven't worked with yet. so with that said what if you used
pdf_open_file()
and just open it? would it give you anything?It looks like pdf_open_file() is used in the CREATION of pdf's
If looks like PDFlib+PDI offers support for parsing pdf files. I've never done it, but you might check there.
<!-- m --><a class="postlink" href="http://www.pdflib.com/products/pdflib/pdi.htmlAaron">http://www.pdflib.com/products/pdflib/pdi.htmlAaron</a><!-- m -->, that's right the php PDF functions are for creation only (or at least that's my understanding too).
Nice link though. Especially as the API supports PHP. :thumbup:but what if you tried it without the new() on it. jsu topen it and see what it does.just had a look at the php.net manual:
pdf_open_pdi (<!-- m --><a class="postlink" href="http://uk.php.net/manual/en/function.pdf-open-pdi.php">http://uk.php.net/manual/en/function.pdf-open-pdi.php</a><!-- m -->)
looks good...Just opening it will likely return a pointer (like when you open a text file with fopen() you just get a pointer to use in other functions).
Also, opening the file is one thing, but being able to understand the contents are another. The file would be full of odd commands, numbers, codes and god knows what else. You have to have something which can parse the file and understand its contents.
Be interested if you get this working somehow to see how you did it thoughyeah, i know....
yikes!!
how od google do it google will use an library function to do it too. Check out Aaron's link.Originally posted by torrent
Just opening it will likely return a pointer (like when you open a text file with fopen() you just get a pointer to use in other functions).
but once you open it doesn't the content end up in a variable? this way you can just stick the contents between some html tags and echo it? well after thinking about it you would still need something similiar to fread for pdf wouldn't ya?
yeah, I would be very curious as to how you go about it HK.i find it kinda funny that it is so easy to CREATE a .pdf document when you have to pay Adobe for their editor, but it seems to be more difficult to READ it when the Adobe Reader is free?
i'm going to give it a go sometime tonight and let you know...Ask yourself which makes the better business sense? The more people they can get to want to receive documents in pdf format the more editors they sell. Therefore give away the read client for free. Make it accessible.
Besides, it's no different to reading any file. You want to read an XML file, you still have to go through an XML parser, or code the parsing rules in PHP yourself. You still have to know the rules of the document structure.
This is exactly the same for PDF. The PDF documents will have their own structure and formatting tags, etc. You will have to know how to parse them all and I believe the only efficient and easy way of doing that is to you a purpose built library (parser).
Good luck though and keep us informed how you get on.Where did you get your indexer? Or did you write it yourself? Is it available online anywhere for Download ? You find a solution for this Horus?Hello?and you replied why? this is an old thread.I wasn't sure if Horus_Kol saw my post about where he found his indexer or if it's available for Download .you can write one, it is at php.net. some of the users comments under opendir() will give you some code.i wrote my own - but it has a problem with https - not good for my intranet site...