explode - memory hog?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

explode - memory hog?

Post by SidewinderX »

I was doing some statistical data analysis on some pretty large documents (about 4mb of text).

1. I read the contents of the file into a buffer using file_get_contents
2. I wanted to know how many words there was, so I exploded the content based on a space.

Code: Select all

$content = file_get_contents("foo");
$words = explode(" ", $content);
echo count($words);
The result is:
Fatal error: Allowed memory size of 67108864 bytes exhausted...
My question is, why in the world is explode trying to use more than 64mb of memory? I figured I would ask here to try and save myself from having to look through the string.c source (it's not very pleasant)
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: explode - memory hog?

Post by requinix »

It's not like a string in PHP is the same as a string in C. There's a lot of stuff attached to it like encoding and length. Then the internal structures to hold all that.

If the average word was 5 characters long you're looking at about 700,000 separate strings PHP has to track. If a string structure is 100 bytes (conjecture) that's 66MB alone. And don't forget the 8MB of actual string data, plus the memory for the rest of your script.
davex
Forum Contributor
Posts: 101
Joined: Sat Feb 27, 2010 4:10 pm
Location: Namibia

Re: explode - memory hog?

Post by davex »

Hi,

As tasairis says it is no surprise that an explode() generates an array in excess of 64Mb for that data size.

For your needs I'd suggest using substr_count

Code: Select all

<?php
$massive_string = file_get_contents("bigfile.txt");
$num_words = substr_count($massive_string, " ");
?>
Cheers,

Dave.
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

Re: explode - memory hog?

Post by SidewinderX »

Granted my knowledge of the core is very limited, but I believe PHP stores all its data as a zval union which is defined in zend.h as

Code: Select all

typedef union _zvalue_value {
        long lval;
        double dval;
        struct {
                char *val;
                int len;
        } str;
        HashTable *ht;
        zend_object_value obj;
} zvalue_value;
 
struct _zval_struct {
        /* Variable information */
        zvalue_value value;             /* value */
        zend_uint refcount__gc;
        zend_uchar type;        /* active type */
        zend_uchar is_ref__gc;
};
 
And a pointer is 4 bytes and an int is (usually) 4 bytes. Thus the string structure is only 8 bytes. So,
(700,000 * 8 byes) + 4mb =~ 10mb - no where near 64+
davex
Forum Contributor
Posts: 101
Joined: Sat Feb 27, 2010 4:10 pm
Location: Namibia

Re: explode - memory hog?

Post by davex »

Hi,

I think that would be 8 bytes per character.

And there looks like all the other overheads:

long (8 bytes), double (16 bytes), char* (4 bytes), int (4 bytes), HashTable* (4 bytes) and let's just ignore the (non-pointer) instance of the zend_object_value.

So I make that around 36 bytes and, unless I am mistaken (entirely possible and in fact likely) that is per character. So totally ignoring array overheads etc a 700,000 word array each of 5 chars would give us:

700000 * 5 * 36 = 126000000 bytes = 120Mb

Cheers,

Dave.
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

Re: explode - memory hog?

Post by SidewinderX »

Each (ASCII) character is one byte (8 bits). This is accounted for with the +4mb.
long (8 bytes), double (16 bytes), char* (4 bytes), int (4 bytes), HashTable* (4 bytes) and let's just ignore the (non-pointer) instance of the zend_object_value.
That is how you would treat a struct. However with a union, you can only use one field at a time, since they overlay each other. (See: http://www.c-faq.com/struct/union.html). Therefore you should only take into account the field that is being used. Since it is a string, we are going to use the struct, which consists of a pointer and an int or 8 bytes. Each struct represents a word (since the pointer points to a character array/string and the "length" just makes it binary safe in the event the string is not null terminated).

(#of words * sizeof(union)) + (#of words * 5) = ~10mb

Note: #of words * 5 is actually the size of the file, hence the +4mb.

Edit: I realized I am assuming sizeof(char) = 1 (like it is when I just tested it with gcc on my system). However, I am going to assume is _not_ the case.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: explode - memory hog?

Post by requinix »

I count:

The entire file as a string = 4MB (actual data) + 8B (_zvalue_value) + 6B (rest of the _zval_struct) = ~4MB
Each separate string = 700,000 * (5B + 8B + 6B) = ~12.7MB

(Assuming:
- No NUL terminators on the strings since it's keeping track of the length anyways
- pointers are 4 bytes
- a zend_uint is 4 bytes
- a zend_object_value is <= 8 bytes long)

That's almost 17MB, but it doesn't account for:
- The array holding all the 700K strings = 14B (_zval_struct) + sizeof a HashTable with 700K entries
- The internal variable/value mappings = small (don't know how much)
- The fact that PHP allocates memory in (small) chunks (I forget the exact quantity)


If the maximum attachment size here was large enough (only 2MB) I'd ask if OP could post it. Maybe it can be uploaded someplace else...?
SidewinderX
Forum Contributor
Posts: 407
Joined: Fri Jul 16, 2004 9:04 pm
Location: NY

Re: explode - memory hog?

Post by SidewinderX »

The original file is War and Peace from Project Gutenberg (http://www.gutenberg.org/files/2600/2600.txt)
- The array holding all the 700K strings = 14B (_zval_struct) + sizeof a HashTable with 700K entries
Hmm, I hadn't thought of that. I just assumed the array holding the strings was simply a pointer to a pointer, but I suppose a HashTable is how PHP would implement associative arrays. I still would consider a 50MB HashTable horribly inefficient?
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: explode - memory hog?

Post by requinix »

Only have the time for a quick FYI: when I ran

Code: Select all

echo memory_get_peak_usage(), "\n";
$file = file_get_contents("http://www.gutenberg.org/files/2600/2600.txt");
echo memory_get_peak_usage(), "\n";
$array = explode(" ", $file);
echo count($array), " items; ", memory_get_peak_usage(), "\n";
I got
- PHP 5.2.12: +3299736B (3.14MB) to download, +48446016B (46.2MB) to split
- PHP 5.3.1: +3295432B (3.14MB) to download, +52570976B (50.1MB) to split
with 515,621 words. The file is 1214723B (1.15MB).
Post Reply