Page 1 of 2

REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 2:37 am
by DavidTheSlayer
Hi everyone this is my first post, and hope to be around for a long time here.... :D
I have a problem where the user can input some text or numbers but only in three ways, any other way has to result in an error, I need this case insensitive and need it to Unicode for say Japanese characters etc.

If there is one match then I would like to display and error for the user to correct. If all is good, then I will explode the users input into an array by using a comma as the delimiter.

Here's what I have worked on...

Firstly
The examples of the three valid formats of input...

sometext
sometext,sometext
somenumber.somenumber sometext

Further examples
somenumber.somenumber sometext, sometext
sometext,somenumber.somenumber sometext
sometext, somenumber.somenumber sometext

I use http://www.lumadis.be/regex/test_regex.php for my testing, including testing on my test page to make sure.

The user's input come via a text box on a form. I run trim, strip tags and stripslahes first, then the regex

Secondly

My REGEX's...
Tried separating them and testing then combining them, but I can't get it to validate correctly :banghead: .

Code: Select all

/[a-z-A-Z]+|[a-zA-Z]\,\s[a-zA-z]+|[1-9]\.[1-9]\s[a-zA-Z]+/iu

Code: Select all

/[a-z-A-Z]+|[a-zA-Z]\,\s[a-zA-z]+|[1-9]\.[1-9]\s[a-zA-Z]+/

Code: Select all

/^[^a-z-A-Z]+|^[a-zA-Z]\,\s[a-zA-z]+|^[1-9]\.[1-9]\s[a-zA-Z]+/iu
Characters need a limit of 60 on each entry but I have some other testing for that elsewhere.
I hope someone can help as I've spent a week at this trying to get it to work.

Thanks in advance for any replies.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 4:46 am
by prometheuzz
Since the input is of a variable length AND the order of the numbers and words are not fixed, a classic one line match isn't possible in this case.
What you want is really a small compiler that validates your input string.

So, I recommend you to "walk" through your input string and match the tokens in it. For each matched token, remember the length of it and after the entire input string is done, compare the length of the input string with the sum of the length of all tokens. If these two numbers are the same, it means that your input string is valid, if not, it's invalid.

You basically have four different tokens (everything is in pseudo-regex-code!):

Code: Select all

/* WORD   */   [a-zA-Z]+
/* NUMBER */   [0-9]\.[0-9]
/* COMMA  */   ,
/* SPACE  */   \s
 
Now, looking at your valid example strings, a valid token is:

Code: Select all

// a WORD must have a COMMA or 'the end of the string' in front of it
WORD(?=COMMA|$)
 
// a NUMBER must have a space in front of it
NUMBER(?=SPACE)
 
// a SPACE must have a WORD or a NUMBER in front of it
SPACE(?=WORD|NUMBER)
 
// a COMMA must have a SPACE or a NUMBER or a WORD in front of it
COMMA(?=SPACE|NUMBER|WORD)
If you combine these four "tokens-rules" in one regex and, for example, perform a preg_match_all(...) on the input string, then counting all the lengths of the strings from those matches should give you the answer if the string is valid.

I have tested this and it worked fine (the first 6 tests are based on your example input strings):

Code: Select all

1) Is 'sometext' valid ? true
 2) Is 'sometext,sometext' valid ? true
 3) Is '1.2 sometext' valid ? true
 4) Is '3.4 sometext, sometext' valid ? true
 5) Is 'sometext,5.6 sometext' valid ? true
 6) Is 'sometext, 7.8 sometext' valid ? true
 7) Is 'sometext 7.8 sometext' valid ? false
 8) Is 'sometext, 7.8, sometext' valid ? false
 9) Is 'sometext, , 7.8 sometext' valid ? false
10) Is 'sometext,, 7.8 sometext' valid ? false
Good luck!

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 5:52 am
by DavidTheSlayer
Thanks for the reply! :D

I've exploded the input string first and built the regex as you said and I believe to have to counting set up correctly, but one small problem. How to do I combine those small rules into one regex using preg_match_all?

I couldn't find anything on how to do this reading the advanced tutorial on this site.

Here's my regex

Code: Select all

if(preg_match_all("/$word(?=$comma|$)|$number(?=$space)|$space(?=$word|$number)|$comma(?=$space|$word|$number)/", $temparr[$i]))
{
   echo "<br />";
   echo "preg_match hit";
   echo $temparr[$i];
   echo "<br />";
   $length[$i] = strlen($temparr[$i]);
   echo " length of $temparr[$i] is " . $length[$i];
}
If thats correct then it keeps saying its not valid when I enter 'sometext'

Thanks again for the reply.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:00 am
by prometheuzz
DavidTheSlayer wrote:Thanks for the reply! :D

I've exploded the input string first and built the regex as you said and I believe to have to counting set up correctly, but one small problem. How to do I combine those small rules into one regex using preg_match_all?

I couldn't find anything on how to do this reading the advanced tutorial on this site.
...
Try this code:

Code: Select all

$test = 'sometext,5.6 sometext';
preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $test, $tokens);
$count = 0;
foreach ($t as $tokens) {
  $count += strlen($t);
}
echo "Is '$test' valid ? " . ($count == strlen($test));
Disclaimer: I did not run this code since I have no access to a PHP interpreter! So, since my PHP skills are lousy, there might well be an error in it! But the overall idea should be ok.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:05 am
by DavidTheSlayer
prometheuzz wrote:
DavidTheSlayer wrote:Thanks for the reply! :D

I've exploded the input string first and built the regex as you said and I believe to have to counting set up correctly, but one small problem. How to do I combine those small rules into one regex using preg_match_all?

I couldn't find anything on how to do this reading the advanced tutorial on this site.
...
Try this (untested) code:

Code: Select all

$test = 'sometext,5.6 sometext';
preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $test, $tokens);
// count the lengths of the matches from $tokens
Trying very shortly.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:34 am
by DavidTheSlayer
okay here's my code for this, I think since I'm a slight newbie I'm getting the counting wrong....

Code: Select all

$temparr = explode(',', $_POST['newUserTag']);
var_dump($temparr);
echo "<br />";
 
$max = count($temparr);
print "counted max is " . count($temparr);
echo "<br />";
        
for ($i = 0; $i < $max; $i++)
{
   $temparr[$i] = trim($temparr[$i]);
            
   //if(preg_match_all("/$word(?=$comma|$)$number(?=$space)$space(?=$word|$number)$comma(?=$space|$word|$number)/", $temparr[$i]))
   if(preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $temparr[$i], $tokens))
   {
      echo "preg_match hit";
      $length = $length + strlen($tokens[$i]);
      echo " length is " . $length;       
      echo "<br />";
   } 
}
 
echo "length is " . $length;
        
if($length == $fulllength)
{
   echo "valid";
}
else
{
   echo "not valid";
}

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:38 am
by prometheuzz
I found a PHP interpreter. This works:

Code: Select all

<?php
$test = 'sometext,5.6 sometext';
preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $test, $tokens);
$sum = 0;
foreach ($tokens[0] as $t) {
  $sum += strlen($t);
}
if(strlen($test) == $sum) {
  echo "'$test' is valid!";
}
else {
  echo "'$test' is invalid...";
}
?>
Good luck!

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:43 am
by DavidTheSlayer
Okay I'll try that soon and I'll post back if I can't get it working :banghead:

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 6:51 am
by prometheuzz
DavidTheSlayer wrote:Okay I'll try that soon and I'll post back if I can't get it working :banghead:
No problem.
Note that $tokens is a 2 dimensional array. So you'll have tor iterate over all the individual tokens from (the 1 dimensional) array $tokens[0].

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 8:11 am
by DavidTheSlayer
Okay, sorry for asking for more help, but nested arrays I'm not good with despite some quick tuts online.

Here's my code.....again :roll:

Code: Select all

 
 
$_POST['newUserTag'] = trim($_POST['newUserTag']); //trims the beginning and end excess
$_POST['newUserTag'] = strip_tags($_POST['newUserTag']); //Stop SQL injection
$_POST['newUserTag'] = stripslashes($_POST['newUserTag']); //As above
$fulllength = strlen($_POST['newUserTag']);
 
$sum = 0;
for ($i = 0; $i < $max; $i++)
{
    $temparr[$i] = trim($temparr[$i]);
  //Changed REGEX slightly with an added comma \d\.\d(?=\s|\,)...
    if(preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s|\,)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $temparr[$i], $tokens)) 
    {
        echo "<br />";
        echo "preg_match hit";
        echo $temparr[$i];
        echo "<br />";
        //$length[$i] = strlen($temparr[$i]);
        //echo " length of $temparr[$i] is " . $length[$i];
        //echo "<br />";
 
    }
}
 
$max = count($tokens[0]);
print "max is " . $max;
 
foreach ($tokens[0] as $t)
{
    for ($i = 0; $i < $max; $i++)
    {
        $sum += strlen($t);
    }
}
        
        /*foreach ($tokens[0] as $t)
        {
            
            $sum += strlen($t);
        }*/
        
if ($fulllength == $sum)
{
    echo $_POST['newUserTag'] . " is valid!";
}
else
{
    echo $_POST['newUserTag'] . " is not valid...";
}
Just spotted that theres nothing in the loop, :banghead: bear with....

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 8:42 am
by DavidTheSlayer
Got it working :) slightly! :x

error when I put in something like ,sometext

working code....

Code: Select all

$_POST['newUserTag'] = trim($_POST['newUserTag']);
$_POST['newUserTag'] = strip_tags($_POST['newUserTag']);
$_POST['newUserTag'] = stripslashes($_POST['newUserTag']);
$fulllength = strlen($_POST['newUserTag']);
 
$sum = 0;
 
preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s|\,)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $_POST['newUserTag'], $tokens);
    
  foreach ($tokens[0] as $t)
  {
     $sum += strlen($t);
  }
 
var_dump($sum);
var_dump($tokens[0]);
 
  if ($fulllength == $sum)
  {
     echo $_POST['newUserTag'] . " is valid!";
  }
  else
  {
     echo $_POST['newUserTag'] . " is not valid...";
  }
 
 
Going to try the other REGEX above, but if that fails I'll go into detail on later one. I can read some of it...so its not to bad.
Actually it does work!!!! ( 8O blames 6AM starts :roll: ), just have to explode it now and continue.
Will post back if it does fail to catch something I'll make this Unicode supported by adding 'u' right next to the 'i', just wondering if the '\s' should be '\t' incase someone enters '0xFF' or whatever the uni-char' for space is...

Thank you very much for your time. :bow:

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 9:02 am
by prometheuzz
DavidTheSlayer wrote:...
Thank you very much for your time. :bow:
You're welcome.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 9:20 am
by DavidTheSlayer
I've (dare I say) improved the REGEX slightly...

Code: Select all

preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|[a-z]+)/i', $test, $tokens); //Original
preg_match_all('/[a-z]+(?=$|,)|\d\.\d(?=\s|\,)|\s(?=[a-z]+|\d\.\d)|,(?=\s|\d\.\d|\s[a-z]|[a-z]+)/iu', $_POST['newUserTag'], $tokens);
so now it catches...

'sometext, 7.8, sometext'

as my results look like this

array(7) { [0]=> string(8) "sometext" [1]=> string(1) "," [2]=> string(1) " " [3]=> string(3) "7.8" [4]=> string(1) "," [5]=> string(1) " " [6]=> string(8) "sometext" } sometext, 7.8, sometext is valid!

I need to kill those spaces and comma.

When exploded...

array(3) { [0]=> string(8) "sometext" [1]=> string(0) "" [2]=> string(13) " 7.8 sometext" }

Code: Select all

$temparr = explode(',', $_POST['newUserTag']);
var_dump($temparr);
I'll try a preg_replace on exploded one.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 9:34 am
by prometheuzz
DavidTheSlayer wrote:...

so now it catches...

'sometext, 7.8, sometext'

...
Ah, so that is also acceptable. In the examples from your original post, there was never a comma after a number.
Perhaps you could explain in more detail what is and what isn't acceptable: there may well be an easier solution.
A good deal of example strings that should match and shouldn't match will also make it easier to understand what you're really after.

Re: REGEX prob for catching three ways of input

Posted: Fri Aug 22, 2008 10:04 am
by DavidTheSlayer
Yes sorry I realised that long afterwards.

sometext, sometext is also allowed along with...
somenumber.somenumber sometext, sometext (also applies vice-versa).

So....

Valid...

sometext
sometext,sometext
sometext, sometext
somenumber.somenumber, somenumber

Invalid...

sometext, , sometext
sometext, , , ,
sometext,
sometext, sometext
1.5 sometext, , sometext
,,
,sometext sometext
(same as above but with 'somenumber')
+ anyother tests I can't think of.
Including the standard char's (!"£$%^&*()_+=-;...) etc

However I cannot strip the excess whitespace from the exploded array's elements.

'array(3) { [0]=> string(8) "sometext" [1]=> string(4) " 7.8" [2]=> string(9) " sometext" } '

Note those little extra spaces...

Here's what I tried...

Code: Select all

$temparr = explode(',', $_POST['newUserTag']);
var_dump($temparr);
echo "<br />";
$max = count($temparr);
 
//Remove excess spaces again from each element after exploding
for ($i = 0; $i < $max; $i++)
{
    preg_replace("/\s/", '', $temparr[$i]);
    //preg_replace("/^[\s]$/", '', $temparr[$i]);
    //$temparr[$i] = trim($temparr[$i]);
}
        
var_dump($temparr);
I would have thought that trimming or replacing (tried a couple of variant regex's) the elements with the spaces would have rid of excess whitespace but I have no idea why this cannot be done. I mean logically that makes sense.