I should have explained what I wanted to first. Basically I'm doing a project for my university to develop an application/system using PHP that would compare two php files and find any similarities between them. there are two main classes needed for this system:
1. a tokeniser class, that when the files are selected would generate a set of tokens similar to the following:
Java source code Generated tokens
1 public class Count { BEGINCLASS
2 public static void main(String[] args) VARDEF,BEGINMETHOD
3 throws java.io.IOException {
4 int count = 0; VARDEF,ASSIGN
5
6 while (System.in.read() != -1) APPLY,BEGINWHILE
7 count++; ASSIGN,ENDWHILE
8 System.out.println(count+" chars."); APPLY
9 } ENDMETHOD
10 } ENDCLASS
But the tokens would be for a PHP class instead of a Java
2. Then the tokens are passed to the second class, which will be the greedy string tiling (gst) alogorithm implemented in PHP to check the tokens.
The the gst has two parts: scanpattern and markarrays. the original pseudocode for it is as follows:
Scanpattern
length_of_token_tiled := 0
Repeat
maxmatch := minimum-match-length
starting at the first unmarked token of P, for each Pp do
starting at the first unmarked token of T, for each Tt do
j := 0;
while Pp+j = Tt+j AND unmarked(Pp+j) AND unmarked(Tt+j) do
j := j + 1
if j = maxmatch then add match(p, t, j) to list of matches of length j
else if j > maxmatch then start new list with match(p, t, j) and maxmatch := j
Markarrays
for each match(p, t, maxmatch) in list
if not oclluded then
for j:=0 to maxmatch – 1 do
mark_token(Pp+j)
mark_token(Tt+j)
length_of_tokens_tiled := length_of_tokens_tiled + maxmatch
Until maxmatch = minimum-match-length
Which gives the Java pseudocode:
Code: Select all
Greedy-String-Tiling(String A, String)
{
tiles = {};
do
{
maxmatch = MinimumMatchLength;
matches = {};
Forall unmarked tokens Aa in A
{
Forall unmarked tokens Bb in B
{
j= 0;
while (Aa+j == Bb+j && unmarked(Aa+j ) && unmarked(Bb+j ))
j ++;
if (j == maxmatch)
matches = matches + match(a, b, j);
else if (j > maxmatch)
{
matches = {match(a; b; j)};
maxmatch = j;
}
}
}
Forall match(a; b; maxmatch) Σ matches
{
For j = 0… (maxmatch - 1)
{
mark(Aa+j );
mark(Bb+j );
}
tiles = tiles U match(a; b; maxmatch);
}
}while (maxmatch > MinimumMatchLength);
return tiles;
}
A & B I think are suppose to be arrays and a & b are the tokens (substrings in the arrays) so the forall Aa in A should mean the tokens in the array of A. Looking at it now I think I need to assign a variable $a to mean the contents of array A.
Hope this shed some more light as to what I am doing.
By the way anyone know how to create a tokeniser set for PHP?
Clevie