Text File Crawling
Moderator: General Moderators
- beautifuldawn
- Forum Newbie
- Posts: 4
- Joined: Sat Aug 22, 2009 7:29 pm
Text File Crawling
Greetz,
I am looking for a working PHP Function to be able to crawl a text file. For simplicity's sake, let's say I have the following text file;
// mytextfile.txt
// It has 6 lines, or sentences, one per each line!
--------------------------------------------------------------------------------------------------
My favorite color is green.
My car is red.
My favorite car is blue.
My bike is red.
I love my green car.
Yellow is my favorite color.
--------------------------------------------------------------------------------------------------
// These 6 lines above are the only text in the file, the first line = 0, obvious!
Now what I want to do in PHP is, to be able to read the entire text file line by line and search (case-insensitive) for a certain word/phrase and if found, echo that entire line (not the line number, but the contents), where the match was found to the screen. Do this for each line where X is found. So for example:
$search = "green";
// Should echo line 0 and 4, because in both lines the word "green" exists!
My favorite color is green.
I love my green car.
$search = "yellow";
// Should echo line 5, because only in line 5 (6th sentence) the word "yellow" exists!
Yellow is my favorite color.
$search = "my favorite";
// Should echo line 0, 2 and 5, because in all these lines the phrase "my favorite" exists!
My favorite color is green.
My favorite car is blue.
Yellow is my favorite color.
--------------------------------------------------------------------------------------------------
Can someone show me a working code how to do this?
All help appreciated!
Dawn ;
I am looking for a working PHP Function to be able to crawl a text file. For simplicity's sake, let's say I have the following text file;
// mytextfile.txt
// It has 6 lines, or sentences, one per each line!
--------------------------------------------------------------------------------------------------
My favorite color is green.
My car is red.
My favorite car is blue.
My bike is red.
I love my green car.
Yellow is my favorite color.
--------------------------------------------------------------------------------------------------
// These 6 lines above are the only text in the file, the first line = 0, obvious!
Now what I want to do in PHP is, to be able to read the entire text file line by line and search (case-insensitive) for a certain word/phrase and if found, echo that entire line (not the line number, but the contents), where the match was found to the screen. Do this for each line where X is found. So for example:
$search = "green";
// Should echo line 0 and 4, because in both lines the word "green" exists!
My favorite color is green.
I love my green car.
$search = "yellow";
// Should echo line 5, because only in line 5 (6th sentence) the word "yellow" exists!
Yellow is my favorite color.
$search = "my favorite";
// Should echo line 0, 2 and 5, because in all these lines the phrase "my favorite" exists!
My favorite color is green.
My favorite car is blue.
Yellow is my favorite color.
--------------------------------------------------------------------------------------------------
Can someone show me a working code how to do this?
All help appreciated!
Dawn ;
Re: Text File Crawling
This should do it:
Code: Select all
foreach(file('filename.txt') as $line)
{
if(stristr($line, $searchterm))
{
echo $line;
}
}
- beautifuldawn
- Forum Newbie
- Posts: 4
- Joined: Sat Aug 22, 2009 7:29 pm
Re: Text File Crawling
And indeed it does Jack!! Wonderful and many thanks ...
I already had a working function but it could only tell me how many times a certain word/phrase exists (if at all) and at what position. But I never could manage to get it to show the entire line instead of just the position of the word. Now I can. I have the entire KJV Bible and Pickthall Quran in 2 major text files, line by line, verse by verse so to speak. And I use this function to crawl both files and return the verses where X appears, compare the results between the two "books" and have these weighted by a "Pattern Matching" Neural Network, I am running at home
This works great!
Dawn;
I already had a working function but it could only tell me how many times a certain word/phrase exists (if at all) and at what position. But I never could manage to get it to show the entire line instead of just the position of the word. Now I can. I have the entire KJV Bible and Pickthall Quran in 2 major text files, line by line, verse by verse so to speak. And I use this function to crawl both files and return the verses where X appears, compare the results between the two "books" and have these weighted by a "Pattern Matching" Neural Network, I am running at home
This works great!
Dawn;
Re: Text File Crawling
Cool, no problem.
- John Cartwright
- Site Admin
- Posts: 11470
- Joined: Tue Dec 23, 2003 2:10 am
- Location: Toronto
- Contact:
Re: Text File Crawling
You may also look into matching your entire word list against the entire file at once with preg_match() / PREG_OFFSET_CAPTURE modifier. I'm not exactly sure if that would be faster or slower though 
- beautifuldawn
- Forum Newbie
- Posts: 4
- Joined: Sat Aug 22, 2009 7:29 pm
Re: Text File Crawling
Well,
As I wrote before, I'm using this for a "Pattern Matching" Neural Network application, running as a test some "Data Mining" functions on the entire English Pickthall Quran from 1 plain text file, line by line as verse by verse. So far, as working now with the function from Jack (see above), it works fine. Either single words or whole/parts of a phrase. And it is fast enough ... Processing about 38.000 Categories in less than 20 seconds.
But there are also issues: When for example $searchterm = "allah"; Now the main reason is that "allah" is almost in every single verse at least once or twice, all 6229 verses! So eventually, eventhough testing here on XAMP with a 180 (3 minutes) server timeout, it stops and kinda crashes. This is due to the server timeout. I must say.
Eventually this will run from a private AI Supercomputer for more speed and power, made of 3 Playstation 3's all clustered together, running Linux. But the main console, that I am building now, the UI Terminal so to speak is done with PHP and Flash. So the terminal computer will be a normal windows machine, or anything that can run Flash and can go online connecting to this Supercomputer. I believe it will solve a lot of speed problems I do already face. But I also understand the limits of my PC ... as being a major player in these issues.
Dawn
As I wrote before, I'm using this for a "Pattern Matching" Neural Network application, running as a test some "Data Mining" functions on the entire English Pickthall Quran from 1 plain text file, line by line as verse by verse. So far, as working now with the function from Jack (see above), it works fine. Either single words or whole/parts of a phrase. And it is fast enough ... Processing about 38.000 Categories in less than 20 seconds.
But there are also issues: When for example $searchterm = "allah"; Now the main reason is that "allah" is almost in every single verse at least once or twice, all 6229 verses! So eventually, eventhough testing here on XAMP with a 180 (3 minutes) server timeout, it stops and kinda crashes. This is due to the server timeout. I must say.
Eventually this will run from a private AI Supercomputer for more speed and power, made of 3 Playstation 3's all clustered together, running Linux. But the main console, that I am building now, the UI Terminal so to speak is done with PHP and Flash. So the terminal computer will be a normal windows machine, or anything that can run Flash and can go online connecting to this Supercomputer. I believe it will solve a lot of speed problems I do already face. But I also understand the limits of my PC ... as being a major player in these issues.
Dawn
- John Cartwright
- Site Admin
- Posts: 11470
- Joined: Tue Dec 23, 2003 2:10 am
- Location: Toronto
- Contact:
Re: Text File Crawling
You most certainly won't need a super computer to handle the searching in a timely manner. If performance is becoming an issue, you should import your files into a normalized database structure. From there, a full text search would be quite fast.
Re: Text File Crawling
I concur.
With a well indexed database, you'll likely reduce the execution time drastically.
With a well indexed database, you'll likely reduce the execution time drastically.
Re: Text File Crawling
It doesn't even have to be a database, but indexing of any kind will boost your performance tremendously. Check out Lucene and Sphinx for general purpose text indexing.
- beautifuldawn
- Forum Newbie
- Posts: 4
- Joined: Sat Aug 22, 2009 7:29 pm
Re: Text File Crawling
The PS3X Supercomputer is actually a project on its own, this small program is just one of the many that will run from it. And it will run many programs at the same time. It also works with SAPI 5 TTS and SR which take some memory too. But I agree that for comparing and searching a large file of data a Normalized DB is the best work around. At this moment the main module of the program engine has:
1 main db (MySQL) and can progressive connect to any second, third ... nth db if needed. So if the main db doesn't have a matching category, it will then try all the other db's. This way it can work with as many db's that a certain project needs. The main db is a Normalized (XML/AIML) driven db running AIML version 3.0. Each Category can be either plain text, a definition, Chat or even have a small program on its own running a PHP and SimpleXML combination to interact with all kinds of data. From NLP user input to external media files. 3 flat text based (XML/AIML) db's holding at this moment: The KJV Bible, The Pickthall Quran and a Global Dictionary. Besides this it also reads, writes from and to text files (TXT), logfiles, word dictionaries, word lists, facts, poetry and Japanese Haiku ... So while the main program runs the "Data Mining" functions, there is still room for interacting with about 10.000 people at the same time!
For my project I've split the processes into an STM and LTM process, Short Term Memory and Long Term Memory. Math, Chat and Basic functionality etc work with the STM and the major "Static" data are stored in the LTM. I've set it up this way because having to "reload" 40.000 Categories everytime you just change 1 character ain't fun hehehe ...
Dawn
1 main db (MySQL) and can progressive connect to any second, third ... nth db if needed. So if the main db doesn't have a matching category, it will then try all the other db's. This way it can work with as many db's that a certain project needs. The main db is a Normalized (XML/AIML) driven db running AIML version 3.0. Each Category can be either plain text, a definition, Chat or even have a small program on its own running a PHP and SimpleXML combination to interact with all kinds of data. From NLP user input to external media files. 3 flat text based (XML/AIML) db's holding at this moment: The KJV Bible, The Pickthall Quran and a Global Dictionary. Besides this it also reads, writes from and to text files (TXT), logfiles, word dictionaries, word lists, facts, poetry and Japanese Haiku ... So while the main program runs the "Data Mining" functions, there is still room for interacting with about 10.000 people at the same time!
For my project I've split the processes into an STM and LTM process, Short Term Memory and Long Term Memory. Math, Chat and Basic functionality etc work with the STM and the major "Static" data are stored in the LTM. I've set it up this way because having to "reload" 40.000 Categories everytime you just change 1 character ain't fun hehehe ...
Dawn
Re: Text File Crawling
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard is unrelated to, but can be used in conjunction with, sitemaps, a robot inclusion standard for websites.
Data Entry India
Data Entry India