MySQL Database of an English Dictionary

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

gth759k
Forum Commoner
Posts: 76
Joined: Mon Jun 15, 2009 3:04 am

MySQL Database of an English Dictionary

Post by gth759k »

I'm looking for a comprehensive list of words you would find in an english dictionary, so I don't need definitions. It would be helpful if it was already in an sql file that I could import into phpmyadmin. Has anyone seen anything like this before?
Any help would be appreciated. Thanks.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

Search: "english word list"

I have a database of words used in Wheel of Fortune* puzzles, but it's not a comprehensive list of English words.

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:32 pm, edited 1 time in total.
gth759k
Forum Commoner
Posts: 76
Joined: Mon Jun 15, 2009 3:04 am

Re: MySQL Database of an English Dictionary

Post by gth759k »

Surprisingly I was able to get exactly what I wanted, with just a little bit of work so I thought I'd share what I did.

First, I found the free dictionary GCIDE which is the GNU dictionary in xml format and can be downloaded here:
http://www.ibiblio.org/webster/gcide_xml.zip

I thought I wouldn't be able to use it because the xml data was pretty cryptic, but I had an idea so I put the unzipped xml fils in an folder labled "xml" in my script directory on my server and after some trial and error I came up with this script.

Code: Select all

 
    // standard mysql
    require('config.php');
    $connection = mysql_connect(SQL_HOST, SQL_USER, SQL_PASS) or die('Could not connect to MySQL database. ' . mysql_error());
    $db = mysql_select_db(SQL_DB,$connection);
    
    $xml = array('xml/gcide_a.xml', 'xml/gcide_b.xml', 'xml/gcide_c.xml', 'xml/gcide_d.xml', 'xml/gcide_e.xml','xml/gcide_f.xml','xml/gcide_g.xml', 'xml/gcide_h.xml', 'xml/gcide_i.xml', 'xml/gcide_j.xml', 'xml/gcide_k.xml', 'xml/gcide_l.xml', 'xml/gcide_m.xml', 'xml/gcide_n.xml', 'xml/gcide_o.xml', 'xml/gcide_p.xml', 'xml/gcide_q.xml', 'xml/gcide_r.xml', 'xml/gcide_s.xml', 'xml/gcide_t.xml', 'xml/gcide_u.xml', 'xml/gcide_v.xml', 'xml/gcide_w.xml', 'xml/gcide_x.xml', 'xml/gcide_y.xml', 'xml/gcide_z.xml');
    $numberoffiles = count($xml);
    
    for ($i = 0; $i <= $numberoffiles-1; $i++) {
        $xmlfile = $xml[$i];
        
        // original file contents
        $original_file = @file_get_contents($xmlfile);
        
        // if file_get_contents fails to open the link do nothing
        if(!$original_file) {}
        else {
            // find words in original file contents
    
            preg_match_all("/<hw>(.*?)<\/hw>/", $original_file, $matches);
            
            $result = array_unique($matches[1]);
            
            $numberofwords = count($result);
            
            // traverse words array
            for ($j = 0; $j <= $numberofwords-1; $j++) {
                $word = preg_replace("/[-]/", " ", $result[$j]);
                $word = preg_replace("/[^a-zA-Z0-9\s]/", "", $word);
                $word = strtolower($word);
                if ($word != "") {
                    
                    $uniquesql = "SELECT word FROM ind_words WHERE word='$word'";
                    $uniqueresult = mysql_query($uniquesql) or die(mysql_error());
                    $uniquenum = mysql_num_rows($uniqueresult);
                    
                    // prevent duplicates
                    if ($uniquenum == 0) {
                        $insertsql = "INSERT INTO ind_words (word) VALUES ('$word')";
                        $insertresult = mysql_query($insertsql) or die(mysql_error());
                    }
                }
            }
        }
    }
    echo 'Done!';
 
And that did the trick! Running the script creates a list of over 100,000 unique english words. This requires that you have a mysql table called ind_words with a column called words in it.

I went ahead and made a mysql dump of my words list and deleted a few rows which didn't have actual words because the beginning of the first xml file had some funky stuff at the beginning. Its still not perfect, but you can download the sql file if you want from:

http://capicy.com/extra/ind_words.sql

Update** Link works now, I moved the file from free hosting to my own. :)
Last edited by gth759k on Mon Aug 29, 2011 12:42 pm, edited 1 time in total.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

Nice work. Thanks for sharing. :)

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:32 pm, edited 1 time in total.
gth759k
Forum Commoner
Posts: 76
Joined: Mon Jun 15, 2009 3:04 am

Re: MySQL Database of an English Dictionary

Post by gth759k »

Wooohoooo! I got it to grab definitions too!

Here's the new script:

Code: Select all

 
<?php
 
    // standard mysql
    require('config.php');
    $connection = mysql_connect(SQL_HOST, SQL_USER, SQL_PASS) or die('Could not connect to MySQL database. ' . mysql_error());
    $db = mysql_select_db(SQL_DB,$connection);
    
    $xml = array('xml/gcide_a.xml', 'xml/gcide_b.xml', 'xml/gcide_c.xml', 'xml/gcide_d.xml', 'xml/gcide_e.xml','xml/gcide_f.xml','xml/gcide_g.xml', 'xml/gcide_h.xml', 'xml/gcide_i.xml', 'xml/gcide_j.xml', 'xml/gcide_k.xml', 'xml/gcide_l.xml', 'xml/gcide_m.xml', 'xml/gcide_n.xml', 'xml/gcide_o.xml', 'xml/gcide_p.xml', 'xml/gcide_q.xml', 'xml/gcide_r.xml', 'xml/gcide_s.xml', 'xml/gcide_t.xml', 'xml/gcide_u.xml', 'xml/gcide_v.xml', 'xml/gcide_w.xml', 'xml/gcide_x.xml', 'xml/gcide_y.xml', 'xml/gcide_z.xml');
    $numberoffiles = count($xml);
    
    for ($i = 0; $i <= $numberoffiles-1; $i++) {
        $xmlfile = $xml[$i];
        
        // original file contents
        $original_file = @file_get_contents($xmlfile);
        
        // if file_get_contents fails to open the link do nothing
        if(!$original_file) {}
        else {
            // find words in original file contents
    
            preg_match_all("/<hw>(.*?)<\/hw>(.*?)<def>(.*?)<\/def>/", $original_file, $results);
            
            $blocks = $results[0];
            
            // traverse blocks array
            for ($j = 0; $j <= count($blocks)-1; $j++) {
                
                preg_match_all("/<hw>(.*?)<\/hw>/", $blocks[$j], $wordarray);
                $words = $wordarray[0];
                $word = addslashes(strip_tags($words[0]));
                $word = preg_replace('{-}', ' ', $word);
                $word = preg_replace("/[^a-zA-Z0-9\s]/", "", $word);
                
                
                preg_match_all("/<def>(.*?)<\/def>/", $blocks[$j], $definitionarray);
                $definitions = $definitionarray[0];
                $definition = addslashes(strip_tags($definitions[0]));
                $definition = preg_replace('{-}', ' ', $definition);
                $definition = preg_replace("/[^a-zA-Z0-9\s]/", "", $definition);
                
                // select all words from the dictionary matching the current word
                $wordsql = "SELECT * FROM dictionary WHERE word ='$word'";
                $wordresult = mysql_query($wordsql) or die(mysql_error());
                $worddata = mysql_fetch_array($wordresult, MYSQL_ASSOC);
                $wordnum = mysql_num_rows($wordresult);
                
                if ($wordnum == 0) {
                    $insertsql = "INSERT INTO dictionary (word, definition) VALUES ('$word', '$definition')";
                    $insertresult = mysql_query($insertsql) or die(mysql_error());
                }
            }
        }
    }
    
    echo 'Done!';
 
?>
 
And here's the new english dictionary mysql dump with definitions (106134 unique english words and their definitions):

http://capicy.com/extra/dictionary.sql

Update** Link works now, I moved the file from free hosting to my own. :)

P.S. This script takes quite a bit longer to run (about 75 minutes on my laptop).
Last edited by gth759k on Mon Aug 29, 2011 12:43 pm, edited 1 time in total.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

You are introducing some errors into your data with the character limits you have set and by not converting the character entities that are defined in the DTD in gcide.xml.

For example,
Auf"kl&auml;*rung
should become
aufklärung
Instead, it becomes
aufklaumlrung
Also, hyphens should not be removed. The suffix
-ation
incorrectly becomes
ation
It's unfortunate that using entity references to include external XML files is not supported by PHP's DOM classes (nor Firefox). DOMDocument will validate the XML only if it is in a single file. A work-around is to combine the letter files (gcide_*.xml) and the main file (gcide.xml) into a single XML file using a text editor that is capable of handling massive files. However, it might be best to use the main file as a template and make a separate, valid XML file for each alphabet letter.

The benefit of using PHP's DOM classes to read the files is that DOMDocument will convert the XML entities automatically. Also, it is trivial to getElementsByTagName('hw'). The only cleanup you would need to do is strip out the three characters: " (chr(34)), * (chr(42)), and ` (chr(96)).

Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:35 pm, edited 1 time in total.
gth759k
Forum Commoner
Posts: 76
Joined: Mon Jun 15, 2009 3:04 am

Re: MySQL Database of an English Dictionary

Post by gth759k »

I only removed the hyphens because in another script I'm working on I wanted to cross reference my input with the dictionary, for instance, the input is a url like:

Code: Select all

 
http://www.guardian.co.uk/environment/2009/nov/15/obama-copenhagen-climate-talks-treaty
 
and the script breaks the link down into each sub folder listed in the url and then queries the database to see if any of the terms in each index of the array have a hit in the database, if they do, they're left in the array, otherwise they're removed from the array. The result is something like this:

Code: Select all

 
Array
(
    [0] => environment
    [4] => copenhagen climate treaty
)
 
so if there happend to be a url that had a word like low-budget in it, the words low and budget wouldn't get excluded from the array just because the word low-budget isn't in the database. This still only fixes a portion of what I'm trying to do, but its a start.

Anyways, you're right about the words that have symbols in them. I didn't realize it was leaving characters in that shouldn't be there. I guess its back to the drawing board. Lol. Thanks.
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

If you store the hyphens, the data will be more useful because it is easier to take the hyphens out than to put them back in.

When you need non-hyphenated data, you can do the replacement in SQL.

Code: Select all

SELECT REPLACE('high-definition', '-', ' ') FROM DUAL
Or you could have an additional field in the database.

Edit: This post was recovered from search engine cache.
Coder007
Forum Newbie
Posts: 5
Joined: Sun Feb 20, 2011 1:06 pm

Re: MySQL Database of an English Dictionary

Post by Coder007 »

Hello just to start out sorry for the necro of this forum but this seems to be the only place I have found a potentially working version of the english dictionary which could be dumped into a mysql forum. Now I have downloaded the GNU dictionary and I tried the code that gth759k listed (and I tried the code that he posted later that grabbed the definitions too which is what I really want for my database) and was to no avail. I did rename the folder to xml and I did create a "ind_words" table with a column named "words" in phpmyadmin and I uploaded the xml files in the "xml" folder and the script. I loaded my page and nothing seemed to happen. I was wondering if someone has been able to do this or if someone could help me to get this to work. (or maybe if someone could re-upload the .sql files so I could add it to my db directly)

Any help would be appreciated,
Thanks
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

Coder007 wrote:I loaded my page and nothing seemed to happen.
Configure the following php.ini directives:
  • error_reporting
  • display_errors
  • max_execution_time
  • memory_limit
The script likely timed out. gth759k said it took 75 minutes. If you can, execute such long-running scripts from the command-line instead of the browser.
Coder007
Forum Newbie
Posts: 5
Joined: Sun Feb 20, 2011 1:06 pm

Re: MySQL Database of an English Dictionary

Post by Coder007 »

Thaks for the reply. I just had to mess around with how I set up a few things and I eventually got it to work, although I could only get up to "b" into my db. But I found out that you can access the xml files from your site directly by using a perl script that they also provide, except for I can't seem to get it to actaully find any matches. What I really want is to have a working "search" button like on the webiste here (http://www.ibiblio.org/webster/#search_form). I added the files to my website via ftp and seemed to put them in the correct folders (webster/cgi-bin/headwordscript, and webster/sml_files/dictionaryfiles). It lets me enter a search term but it always comes up with no matches found. I'm not sure if there's a problem with how I set it up on my ftp or the perl script is bugged, or some other problem. But I would like to know if anyone else has been able to get a working dictionary search like on the ibiblio website and how to do it.

Thanks
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

Make sure $xml_file_dir points to the directory where the XML files are.

Code: Select all

$xml_file_dir = "/public/html/webster/xml_files";
Coder007
Forum Newbie
Posts: 5
Joined: Sun Feb 20, 2011 1:06 pm

Re: MySQL Database of an English Dictionary

Post by Coder007 »

Ok this is my setup

In my main "/" on my ftp account I have
-cgi-bin
-xml_files

In the cgi-bin I have the perl script
-headword_search.pl
and the file dir is
-$xml_file_dir = "/xml_files";

And in the xml_files bin there is of course all of the xml files.

Here is a screenshot of the folders in the main "/" on my ftp account

Image

So did I set up the $xml_file_dir wrong? I can't really see why this isn't working. :banghead:
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: MySQL Database of an English Dictionary

Post by McInfo »

There might be more to the "/xml_file" path that Perl needs to know about. Upload this script into the xml_files folder and execute it from your browser.

getcwd.php

Code: Select all

<?php echo getcwd();
Edit: Added echo
Last edited by McInfo on Mon Feb 21, 2011 5:24 pm, edited 1 time in total.
Coder007
Forum Newbie
Posts: 5
Joined: Sun Feb 20, 2011 1:06 pm

Re: MySQL Database of an English Dictionary

Post by Coder007 »

McInfo wrote:There might be more to the "/xml_file" path that Perl needs to know about. Upload this script into the xml_files folder and execute it from your browser.

getcwd.php

Code: Select all

<?php getcwd();
Ok I tried that and uh nothing seemed to happen. I'm kind of lost on what this is supposed to do?
Post Reply