Page 1 of 1

Primitive filters

Posted: Sat Nov 08, 2008 5:48 pm
by alex.barylski
Because filtering is somewhat important to get right I'd like a quick review of the regex, etc and hopefully any errors are spotted. In addition maybe I missed a simple filter which you might then recommend.

NOTE: These are meant to be primitives nothing really fancy, although I have considered using HTML_Purifier instead of strip_tags. The convention is Filter_X - with X being the characters to filter or remove.

I'm not sure I could consider encoding or escaping as a logical part of this collection of static classes. Something higher level like a Filter_Email is not really nessecary as I use a validator which parses the Email according to RFC standards and MUST match so filtering here would be redundant.

What I am intereted in though is maybe filtering Numerics and not just digits, for instance, is the number a hex value, in which case leading 0x might be allowed. Currency filters would not make sense as those data variables rely on locality as well, which is not part of end goal for this.

Here are my four trivial filters hitherto:

Code: Select all

 class Filter_Alpha implements Filter_Interface{
    public static function filterMe($value)
    {
      return preg_replace('/[^0-9\.\+\-]/', '', $value);  
    }  
  }
 
  class Filter_Html implements Filter_Interface{
    public static function filterMe($value, $safe_tags = null)
    {
      if(is_array($safe_tags)){
        $safe_tags = array_map(create_function('$element', 'return "<".strtolower($element).">";'), $safe_tags);                  
        $safe_tags = implode('', $safe_tags);
      }
      else{
        $safe_tags = '';
      }     
      
      return strip_tags($value, $safe_tags);
    }  
  }
 
  class Filter_Digit implements Filter_Interface{
    public static function filterMe($value)
    {
      return preg_replace('/\d/', '', $value);  
    }  
  }
 
  class Filter_Space implements Filter_Interface{
    public static function filterMe($value)
    {
      return trim($value);  
    }  
  }

Re: Primitive filters

Posted: Sun Nov 09, 2008 2:12 am
by prometheuzz
About your current regex: Within a character class, the "normal" regex meta characters don't have any special meaning. Only the ^ and - might need escaping (and the [ and ] themselves, of course). I say "might", because it also depends on where these meta characters occur: the ^ only is a negation meta character if it's placed at the beginning of the character set, else it will just match the character '^'. And - will just match the character '-' if it's placed at the start or at the beginning of the character class, somewhere within the class and it will serve as a range meta character.

Code: Select all

[^ac]   // matches any character except 'a' and 'c'
[a^c]   // matches 'a', '^' or 'c'
[\^ac]  // matches '^', 'a' or 'c'
 
[a-c]   // matches 'a', 'b' or 'c'
[-ac]   // matches '-', 'a' or 'c'
[ac-]   // matches 'a', 'c' or '-'
 
[.+*-]  // matches '.', '+', '*' or '-'
That said, your regex should look like this:

Code: Select all

'/[^0-9.+-]/'
But to get to the "real" topic, you could match non-ascii digits (Hebrew, Chinese etc digits) by using their Unicode values in a character class:

Code: Select all

[\x??-\x??]
I am not too sure what kind of numerical values you want to match/filter, but it sounds rather tricky and error prone, IMO. What I mean is that numerical values are frequently used as "plain strings". Take telephone numbers or serial numbers for example, although they can be, or are, made from digits, they don't hold a "real" numerical value.

Re: Primitive filters

Posted: Sun Nov 09, 2008 2:10 pm
by GeertDD
Note that by adding a + quantifier to that character class you can speed up the regex a tad. The replacement action will then be triggered for multiple characters when possible.

Code: Select all

/[^0-9.+-]+/

Re: Primitive filters

Posted: Sun Nov 09, 2008 4:40 pm
by alex.barylski
But to get to the "real" topic, you could match non-ascii digits (Hebrew, Chinese etc digits) by using their Unicode values in a character class
Interesting...didn't even think of that, thanks. :)
I am not too sure what kind of numerical values you want to match/filter, but it sounds rather tricky and error prone, IMO. What I mean is that numerical values are frequently used as "plain strings". Take telephone numbers or serial numbers for example, although they can be, or are, made from digits, they don't hold a "real" numerical value.
I've dropped the idea of supporting advanced filters such as phone, etc.

Now I just want to strip non-alpha and non-digit characters but in a Unicode friendly way.

Does preg_replace support the localized charset when I use generic matchers (lack of a better word on my behalf) such as \d or such???