MySQL Database of an English Dictionary
Moderator: General Moderators
MySQL Database of an English Dictionary
I'm looking for a comprehensive list of words you would find in an english dictionary, so I don't need definitions. It would be helpful if it was already in an sql file that I could import into phpmyadmin. Has anyone seen anything like this before?
Any help would be appreciated. Thanks.
Any help would be appreciated. Thanks.
Re: MySQL Database of an English Dictionary
Search: "english word list"
I have a database of words used in Wheel of Fortune* puzzles, but it's not a comprehensive list of English words.
Edit: This post was recovered from search engine cache.
I have a database of words used in Wheel of Fortune* puzzles, but it's not a comprehensive list of English words.
Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:32 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
Surprisingly I was able to get exactly what I wanted, with just a little bit of work so I thought I'd share what I did.
First, I found the free dictionary GCIDE which is the GNU dictionary in xml format and can be downloaded here:
http://www.ibiblio.org/webster/gcide_xml.zip
I thought I wouldn't be able to use it because the xml data was pretty cryptic, but I had an idea so I put the unzipped xml fils in an folder labled "xml" in my script directory on my server and after some trial and error I came up with this script.
And that did the trick! Running the script creates a list of over 100,000 unique english words. This requires that you have a mysql table called ind_words with a column called words in it.
I went ahead and made a mysql dump of my words list and deleted a few rows which didn't have actual words because the beginning of the first xml file had some funky stuff at the beginning. Its still not perfect, but you can download the sql file if you want from:
http://capicy.com/extra/ind_words.sql
Update** Link works now, I moved the file from free hosting to my own.
First, I found the free dictionary GCIDE which is the GNU dictionary in xml format and can be downloaded here:
http://www.ibiblio.org/webster/gcide_xml.zip
I thought I wouldn't be able to use it because the xml data was pretty cryptic, but I had an idea so I put the unzipped xml fils in an folder labled "xml" in my script directory on my server and after some trial and error I came up with this script.
Code: Select all
// standard mysql
require('config.php');
$connection = mysql_connect(SQL_HOST, SQL_USER, SQL_PASS) or die('Could not connect to MySQL database. ' . mysql_error());
$db = mysql_select_db(SQL_DB,$connection);
$xml = array('xml/gcide_a.xml', 'xml/gcide_b.xml', 'xml/gcide_c.xml', 'xml/gcide_d.xml', 'xml/gcide_e.xml','xml/gcide_f.xml','xml/gcide_g.xml', 'xml/gcide_h.xml', 'xml/gcide_i.xml', 'xml/gcide_j.xml', 'xml/gcide_k.xml', 'xml/gcide_l.xml', 'xml/gcide_m.xml', 'xml/gcide_n.xml', 'xml/gcide_o.xml', 'xml/gcide_p.xml', 'xml/gcide_q.xml', 'xml/gcide_r.xml', 'xml/gcide_s.xml', 'xml/gcide_t.xml', 'xml/gcide_u.xml', 'xml/gcide_v.xml', 'xml/gcide_w.xml', 'xml/gcide_x.xml', 'xml/gcide_y.xml', 'xml/gcide_z.xml');
$numberoffiles = count($xml);
for ($i = 0; $i <= $numberoffiles-1; $i++) {
$xmlfile = $xml[$i];
// original file contents
$original_file = @file_get_contents($xmlfile);
// if file_get_contents fails to open the link do nothing
if(!$original_file) {}
else {
// find words in original file contents
preg_match_all("/<hw>(.*?)<\/hw>/", $original_file, $matches);
$result = array_unique($matches[1]);
$numberofwords = count($result);
// traverse words array
for ($j = 0; $j <= $numberofwords-1; $j++) {
$word = preg_replace("/[-]/", " ", $result[$j]);
$word = preg_replace("/[^a-zA-Z0-9\s]/", "", $word);
$word = strtolower($word);
if ($word != "") {
$uniquesql = "SELECT word FROM ind_words WHERE word='$word'";
$uniqueresult = mysql_query($uniquesql) or die(mysql_error());
$uniquenum = mysql_num_rows($uniqueresult);
// prevent duplicates
if ($uniquenum == 0) {
$insertsql = "INSERT INTO ind_words (word) VALUES ('$word')";
$insertresult = mysql_query($insertsql) or die(mysql_error());
}
}
}
}
}
echo 'Done!';
I went ahead and made a mysql dump of my words list and deleted a few rows which didn't have actual words because the beginning of the first xml file had some funky stuff at the beginning. Its still not perfect, but you can download the sql file if you want from:
http://capicy.com/extra/ind_words.sql
Update** Link works now, I moved the file from free hosting to my own.
Last edited by gth759k on Mon Aug 29, 2011 12:42 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
Nice work. Thanks for sharing. 
Edit: This post was recovered from search engine cache.
Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:32 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
Wooohoooo! I got it to grab definitions too!
Here's the new script:
And here's the new english dictionary mysql dump with definitions (106134 unique english words and their definitions):
http://capicy.com/extra/dictionary.sql
Update** Link works now, I moved the file from free hosting to my own.
P.S. This script takes quite a bit longer to run (about 75 minutes on my laptop).
Here's the new script:
Code: Select all
<?php
// standard mysql
require('config.php');
$connection = mysql_connect(SQL_HOST, SQL_USER, SQL_PASS) or die('Could not connect to MySQL database. ' . mysql_error());
$db = mysql_select_db(SQL_DB,$connection);
$xml = array('xml/gcide_a.xml', 'xml/gcide_b.xml', 'xml/gcide_c.xml', 'xml/gcide_d.xml', 'xml/gcide_e.xml','xml/gcide_f.xml','xml/gcide_g.xml', 'xml/gcide_h.xml', 'xml/gcide_i.xml', 'xml/gcide_j.xml', 'xml/gcide_k.xml', 'xml/gcide_l.xml', 'xml/gcide_m.xml', 'xml/gcide_n.xml', 'xml/gcide_o.xml', 'xml/gcide_p.xml', 'xml/gcide_q.xml', 'xml/gcide_r.xml', 'xml/gcide_s.xml', 'xml/gcide_t.xml', 'xml/gcide_u.xml', 'xml/gcide_v.xml', 'xml/gcide_w.xml', 'xml/gcide_x.xml', 'xml/gcide_y.xml', 'xml/gcide_z.xml');
$numberoffiles = count($xml);
for ($i = 0; $i <= $numberoffiles-1; $i++) {
$xmlfile = $xml[$i];
// original file contents
$original_file = @file_get_contents($xmlfile);
// if file_get_contents fails to open the link do nothing
if(!$original_file) {}
else {
// find words in original file contents
preg_match_all("/<hw>(.*?)<\/hw>(.*?)<def>(.*?)<\/def>/", $original_file, $results);
$blocks = $results[0];
// traverse blocks array
for ($j = 0; $j <= count($blocks)-1; $j++) {
preg_match_all("/<hw>(.*?)<\/hw>/", $blocks[$j], $wordarray);
$words = $wordarray[0];
$word = addslashes(strip_tags($words[0]));
$word = preg_replace('{-}', ' ', $word);
$word = preg_replace("/[^a-zA-Z0-9\s]/", "", $word);
preg_match_all("/<def>(.*?)<\/def>/", $blocks[$j], $definitionarray);
$definitions = $definitionarray[0];
$definition = addslashes(strip_tags($definitions[0]));
$definition = preg_replace('{-}', ' ', $definition);
$definition = preg_replace("/[^a-zA-Z0-9\s]/", "", $definition);
// select all words from the dictionary matching the current word
$wordsql = "SELECT * FROM dictionary WHERE word ='$word'";
$wordresult = mysql_query($wordsql) or die(mysql_error());
$worddata = mysql_fetch_array($wordresult, MYSQL_ASSOC);
$wordnum = mysql_num_rows($wordresult);
if ($wordnum == 0) {
$insertsql = "INSERT INTO dictionary (word, definition) VALUES ('$word', '$definition')";
$insertresult = mysql_query($insertsql) or die(mysql_error());
}
}
}
}
echo 'Done!';
?>
http://capicy.com/extra/dictionary.sql
Update** Link works now, I moved the file from free hosting to my own.
P.S. This script takes quite a bit longer to run (about 75 minutes on my laptop).
Last edited by gth759k on Mon Aug 29, 2011 12:43 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
You are introducing some errors into your data with the character limits you have set and by not converting the character entities that are defined in the DTD in gcide.xml.
For example,
The benefit of using PHP's DOM classes to read the files is that DOMDocument will convert the XML entities automatically. Also, it is trivial to getElementsByTagName('hw'). The only cleanup you would need to do is strip out the three characters: " (chr(34)), * (chr(42)), and ` (chr(96)).
Edit: This post was recovered from search engine cache.
For example,
should becomeAuf"klä*rung
Instead, it becomesaufklärung
Also, hyphens should not be removed. The suffixaufklaumlrung
incorrectly becomes-ation
It's unfortunate that using entity references to include external XML files is not supported by PHP's DOM classes (nor Firefox). DOMDocument will validate the XML only if it is in a single file. A work-around is to combine the letter files (gcide_*.xml) and the main file (gcide.xml) into a single XML file using a text editor that is capable of handling massive files. However, it might be best to use the main file as a template and make a separate, valid XML file for each alphabet letter.ation
The benefit of using PHP's DOM classes to read the files is that DOMDocument will convert the XML entities automatically. Also, it is trivial to getElementsByTagName('hw'). The only cleanup you would need to do is strip out the three characters: " (chr(34)), * (chr(42)), and ` (chr(96)).
Edit: This post was recovered from search engine cache.
Last edited by McInfo on Thu Jun 17, 2010 2:35 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
I only removed the hyphens because in another script I'm working on I wanted to cross reference my input with the dictionary, for instance, the input is a url like:
and the script breaks the link down into each sub folder listed in the url and then queries the database to see if any of the terms in each index of the array have a hit in the database, if they do, they're left in the array, otherwise they're removed from the array. The result is something like this:
so if there happend to be a url that had a word like low-budget in it, the words low and budget wouldn't get excluded from the array just because the word low-budget isn't in the database. This still only fixes a portion of what I'm trying to do, but its a start.
Anyways, you're right about the words that have symbols in them. I didn't realize it was leaving characters in that shouldn't be there. I guess its back to the drawing board. Lol. Thanks.
Code: Select all
http://www.guardian.co.uk/environment/2009/nov/15/obama-copenhagen-climate-talks-treaty
Code: Select all
Array
(
[0] => environment
[4] => copenhagen climate treaty
)
Anyways, you're right about the words that have symbols in them. I didn't realize it was leaving characters in that shouldn't be there. I guess its back to the drawing board. Lol. Thanks.
Re: MySQL Database of an English Dictionary
If you store the hyphens, the data will be more useful because it is easier to take the hyphens out than to put them back in.
When you need non-hyphenated data, you can do the replacement in SQL.
Or you could have an additional field in the database.
Edit: This post was recovered from search engine cache.
When you need non-hyphenated data, you can do the replacement in SQL.
Code: Select all
SELECT REPLACE('high-definition', '-', ' ') FROM DUALEdit: This post was recovered from search engine cache.
Re: MySQL Database of an English Dictionary
Hello just to start out sorry for the necro of this forum but this seems to be the only place I have found a potentially working version of the english dictionary which could be dumped into a mysql forum. Now I have downloaded the GNU dictionary and I tried the code that gth759k listed (and I tried the code that he posted later that grabbed the definitions too which is what I really want for my database) and was to no avail. I did rename the folder to xml and I did create a "ind_words" table with a column named "words" in phpmyadmin and I uploaded the xml files in the "xml" folder and the script. I loaded my page and nothing seemed to happen. I was wondering if someone has been able to do this or if someone could help me to get this to work. (or maybe if someone could re-upload the .sql files so I could add it to my db directly)
Any help would be appreciated,
Thanks
Any help would be appreciated,
Thanks
Re: MySQL Database of an English Dictionary
Configure the following php.ini directives:Coder007 wrote:I loaded my page and nothing seemed to happen.
- error_reporting
- display_errors
- max_execution_time
- memory_limit
Re: MySQL Database of an English Dictionary
Thaks for the reply. I just had to mess around with how I set up a few things and I eventually got it to work, although I could only get up to "b" into my db. But I found out that you can access the xml files from your site directly by using a perl script that they also provide, except for I can't seem to get it to actaully find any matches. What I really want is to have a working "search" button like on the webiste here (http://www.ibiblio.org/webster/#search_form). I added the files to my website via ftp and seemed to put them in the correct folders (webster/cgi-bin/headwordscript, and webster/sml_files/dictionaryfiles). It lets me enter a search term but it always comes up with no matches found. I'm not sure if there's a problem with how I set it up on my ftp or the perl script is bugged, or some other problem. But I would like to know if anyone else has been able to get a working dictionary search like on the ibiblio website and how to do it.
Thanks
Thanks
Re: MySQL Database of an English Dictionary
Make sure $xml_file_dir points to the directory where the XML files are.
Code: Select all
$xml_file_dir = "/public/html/webster/xml_files";Re: MySQL Database of an English Dictionary
Ok this is my setup
In my main "/" on my ftp account I have
-cgi-bin
-xml_files
In the cgi-bin I have the perl script
-headword_search.pl
and the file dir is
-$xml_file_dir = "/xml_files";
And in the xml_files bin there is of course all of the xml files.
Here is a screenshot of the folders in the main "/" on my ftp account

So did I set up the $xml_file_dir wrong? I can't really see why this isn't working.
In my main "/" on my ftp account I have
-cgi-bin
-xml_files
In the cgi-bin I have the perl script
-headword_search.pl
and the file dir is
-$xml_file_dir = "/xml_files";
And in the xml_files bin there is of course all of the xml files.
Here is a screenshot of the folders in the main "/" on my ftp account

So did I set up the $xml_file_dir wrong? I can't really see why this isn't working.
Re: MySQL Database of an English Dictionary
There might be more to the "/xml_file" path that Perl needs to know about. Upload this script into the xml_files folder and execute it from your browser.
getcwd.php
Edit: Added echo
getcwd.php
Code: Select all
<?php echo getcwd();
Last edited by McInfo on Mon Feb 21, 2011 5:24 pm, edited 1 time in total.
Re: MySQL Database of an English Dictionary
Ok I tried that and uh nothing seemed to happen. I'm kind of lost on what this is supposed to do?McInfo wrote:There might be more to the "/xml_file" path that Perl needs to know about. Upload this script into the xml_files folder and execute it from your browser.
getcwd.phpCode: Select all
<?php getcwd();