Word Count

Small, short code snippets that other people may find useful. Do you have a good regex that you would like to share? Share it! Even better, the code can be commented on, and improved.

Moderator: General Moderators

Post Reply
User avatar
protokol
Forum Contributor
Posts: 353
Joined: Fri Jun 21, 2002 7:00 pm
Location: Cleveland, OH
Contact:

Word Count

Post by protokol »

I was bored and wanted something that could give me some useful information about strings, files, etc. That's when the WordCount class was born. This is PHP5-specific code, so don't try it if you only have PHP4 installed. If you REALLY want, I can give you a PHP4 version. However, this is trivial and you can do it yourself if you are in the mood. Here it is:

Code: Select all

<?php
/**
 * The purpose of this class is to provide a mechanism for counting the number of characters,
 * words, lines, and maximum line length in either the contents of a file or a string. The
 * behavior is identical to that of the UNIX 'wc' program.
 * 
 * Example usage:
 *
 * // Create the WordCount object 
 * $wc = new WordCount();
 *
 * // Process a file on the filesystem
 * try {
 *     $wc->processString($file_name);
 *     // Use the getter methods to retrieve the counts
 * } catch (Exception $e) {
 *     // Handle the file exception here
 * }
 * 
 * // Process data read in from standard input (STDIN)
 * try {
 *     // Read from standard input.
 *     $wc->processString('php://stdin');
 *     // Use the getter methods to retrieve the counts
 * } catch (Exception $e) {
 *     // Handle the file exception here
 * }
 * 
 * // Process data from a string
 * $wc->processString($string);
 * // Use the getter methods to retrieve the counts
 *
 * @author Craig Slusher <cslusher@acm.org>
 * @version 1.0
 */
class WordCount
{
    private $character_count;
    private $word_count;
    private $line_count;
    private $max_line_length;
    
    /**
     * Create a new WordCount object with all counts reset to 0.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     */
    public function __construct()
    {
        $this->resetCounts();
    }
    
    /**
     * Reset all of the counts.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     */
    public function resetCounts()
    {
        $this->character_count = 0;
        $this->word_count = 0;
        $this->line_count = 0;
        $this->max_line_length = 0;
    }
    
    /**
     * Process the character, word, and line count for the contents of a file.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @param string $file_name The name of the file to process
     * @param bool $is_stdin True if the $file_name is 'php://stdin' (standard input), False otherwise
     * 
     * @throw Exception
     */
    public function processFile($file_name)
    {
        // Only check for file existence if we are NOT reading from STDIN
        if (strcasecmp($file_name, 'php://stdin') != 0) {
            if (!file_exists($file_name)) {
                throw new Exception($file_name.': No such file or directory');
            }
        }
        
        $file_handle = @fopen($file_name, 'r');
        if ($file_handle === false) {
            throw new Exception($file_name.': Unable to read file');
        }
        
        // Reset the counts in case they haven't been reset already
        $this->resetCounts();
        
        // Read as much data as possible, but stop only when we get a \n
        $whole_line = '';
        while (!feof($file_handle)) {
            $data = fgets($file_handle, 4096);
            $whole_line .= $data;
            
            $strlen = strlen($data);
            $this->character_count += $strlen;
            
            // Use this to reference the last character in the string
            $strlen--;
            
            // We found a whole line, so let's update our counts
            if ($data{$strlen} == "\n") {
                $this->line_count++;
                
                // There is a new longest line
                if ($strlen > $this->max_line_length) {
                    $this->max_line_length = $strlen;
                }
                
                $this->word_count += str_word_count($whole_line);
                $whole_line = '';
            }
        }
    }
    
    /**
     * Process the character, word, and line count for the contents of a string.
     * 
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @param string $string The string to process
     */
    public function processString($string)
    {
        $this->resetCounts();
        
        $this->character_count = strlen($string);
        $lines = explode("\n", $string);
        $this->line_count = count($lines) - 1;
        
        foreach ($lines as $line) {
            $strlen = strlen($line);
            
            // There is a new longest line
            if ($strlen > $this->max_line_length) {
                $this->max_line_length = $strlen;
            }
            
            $this->word_count += str_word_count($line);
        }
    }
    
    /**
     * Get the total number of bytes.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @return int The total number of bytes
     */
    public function getByteCount()
    {
        return $this->getCharacterCount();
    }
    
    /**
     * Get the total number of characters.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @return int The total number of characters
     */
    public function getCharacterCount()
    {
        return $this->character_count;
    }
    
    /**
     * Get the total number of words.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @return int The total number of words
     */
    public function getWordCount()
    {
        return $this->word_count;
    }
    
    /**
     * Get the total number of lines.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @return int The total number of lines
     */
    public function getLineCount()
    {
        return $this->line_count;
    }
    
    /**
     * Get the length of the longest line.
     *
     * @author Craig Slusher <cslusher@acm.org>
     * @version 1.0
     * 
     * @return int The length of the longest line
     */
    public function getMaxLineLength()
    {
        return $this->max_line_length;
    }
}
?>
Last edited by protokol on Thu Apr 07, 2005 3:12 pm, edited 1 time in total.
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

This is probably useful for windows, but in Linux, the `wc` command can do all of this (except maybe the maximum line length).
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
protokol
Forum Contributor
Posts: 353
Joined: Fri Jun 21, 2002 7:00 pm
Location: Cleveland, OH
Contact:

Post by protokol »

The wc command in linux does have the --max-line-length parameter. This PHP code is for anyone who does not have access to running shell scripts from within PHP to do that. Plus it's a lot cooler 8)
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

Ah, good point. It didn't even occur to me that some people might not have access to shell scripts. Good thinking and good job.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

You might improve your class using str_word_count instead of:

Code: Select all

//....
    $words = preg_split('/\s+/', trim($whole_line));
    $tot_words = count($words);                
    $this->word_count += $tot_words;
//....
User avatar
protokol
Forum Contributor
Posts: 353
Joined: Fri Jun 21, 2002 7:00 pm
Location: Cleveland, OH
Contact:

Post by protokol »

Nice tip. Thanks!

The code has been updated to reflect that change
Post Reply