Page 1 of 1

Simple Solution for Huge Problem

Posted: Wed May 18, 2011 7:46 am
by Weiry
Hey everyone,

I have started writing a simple php script to read the data out of an xml file and into a database.
Simple right? Well here are a few more details.

The current xml file is about 9.4 million lines long.
I dont have a full description for each field, meaning i dont what fields share common values or what those values might be.
It needs to be efficient enough so that it wont take hundreds of hours to import.

So where im at...
Currently i have a basic script that will loop through all the xml data there is and format it nicely into a very large associative array:

Code: Select all

<?php
class xmlImport{
	
	protected $xmlArr;
	protected $xml;
	
	public function xmlImport($file){

		$this->xml = simplexml_load_file($file);
		
		$this->xmlArr = $this->processXML($this->xml);
		
		print "<pre>";print_r($this->xmlArr);print "</pre>";
		
		return $this->xmlArr;
		
	}
	
	private function processXML(&$xml){
		
		foreach($xml->children() as $child){

			if(count($child) > 1){
				
				$arr[][$child->getName()] = $child;

			}else{

				$arr[$child->getName()] = $this->processXML($child);
				
			}
			
		}
		
		return $arr;
	
	}
	
	private function importData(&$arr){
			
	}
	
}

$xmlImport = new xmlImport("example.xml");
?>
example.xml

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<Items>
  <Item>
    <ItemKey>D9B6982B01010101</ItemKey>
    <Name>
      <English>A Scrap of Paper</English>
      <French>Morceau de papier</French>
      <German>Ein Stück Papier</German>
    </Name>
    <Description>
      <English>It&apos;s beginning to crumble</English>
      <French>Le papier commence à s&apos;effriter.</French>
      <German>Beginnt zu verfallen.</German>
    </Description>
    <Value>1</Value>
    <Rarity>Trash</Rarity>
    <SoulboundTrigger>BindOnPickup</SoulboundTrigger>
    <Icon>item_icons\blank_parchment</Icon>
    <FirstLootedBy>
      <EU_Akala>
        <Name>Eleonas</Name>
        <Guild>Old Heroes</Guild>
        <Date>2011-02-24T22:28:14</Date>
      </EU_Akala>
      <EU_Argent>
        <Name>Welf</Name>
        <Date>2011-02-24T21:38:13</Date>
      </EU_Argent>
      <EU_Blightweald>
        <Name>Doradan</Name>
        <Guild>Daedalus</Guild>
        <Date>2011-02-25T01:03:56</Date>
      </EU_Blightweald>
    </FirstLootedBy>
  </Item>
  <Item>
    <ItemKey>B6BBBD1F01D2D6F7D805663C</ItemKey>
    <Name>
      <English>Templar&apos;s Bow of the Sinister</English>
      <French>Arc de Templier du Sinistre</French>
      <German>Templerbogen</German>
    </Name>
    <Value>2350</Value>
    <Rarity>Uncommon</Rarity>
    <Icon>item_icons\ranged_bow_043</Icon>
    <Slot>Weapon_Ranged</Slot>
    <WeaponType>2h_ranged_bow</WeaponType>
    <MinimumDamage>42</MinimumDamage>
    <MaximumDamage>78</MaximumDamage>
    <Speed>2.600000</Speed>
    <Range>30.000000</Range>
    <OnEquip>
      <Strength>11</Strength>
      <Dexterity>15</Dexterity>
      <Endurance>9</Endurance>
    </OnEquip>
    <RequiredLevel>49</RequiredLevel>
    <RunebreakSkillLevel>226</RunebreakSkillLevel>
    <FirstLootedBy>
      <EU_Blightweald>
        <Name>Mels</Name>
        <Guild>Desire</Guild>
        <Date>2011-03-06T00:56:43</Date>
      </EU_Blightweald>
      <EU_Bloodiron>
        <Name>Miaki</Name>
        <Guild>Serenity</Guild>
        <Date>2011-04-01T14:35:38</Date>
      </EU_Bloodiron>
    </FirstLootedBy>
  </Item>
</Items>
My main problem now relates to the database and the importing of the data.

Basically i dont know how many fields there will possibly be inside the xml file with no full structure available.
Is it possible to create tables/fields as the script goes along and finds new children in the xml?
So basically if it finds a new child, it creates a new database field and inserts the data, else it goes on to the next one/inserts data.

However, here in lies my third problem, efficiency.
I would like to optimize the database into several tables, to help reduce the size of a single table.
So in the example.xml you see above, OnEquip will have several sub-fields that to my knowledge, can possibly have up to 21 individual children. This is the point where i would simply make OnEquip (in the database) a pointer to another table with these fields in it.

Open to all thoughts and suggestions, if you need more info just ask and hopefully i can provide :D

Re: Simple Solution for Huge Problem

Posted: Wed May 18, 2011 8:48 am
by Jade
I would worry about getting the data into a database first and then move fields around and optimize once you have the data in a database and you more easily move it around and normalize the tables.

To answer your question, yes you can add fields to a table and create tables on the fly as long as the database user has permissions to create and alter tables. There's a way to check if a table or field already exists so you don't try to add duplicate tables and fields.

Creating A Table: http://dev.mysql.com/doc/refman/5.1/en/ ... table.html

Altering A Table: http://dev.mysql.com/doc/refman/5.1/en/alter-table.html

Re: Simple Solution for Huge Problem

Posted: Wed May 18, 2011 8:43 pm
by Weiry
Jade wrote:I would worry about getting the data into a database first and then move fields around and optimize once you have the data in a database and you more easily move it around and normalize the tables.
I can see why this would be a quicker way of doing things and would definitely make the importing of the data very simple. Although the problem here would be that if i needed to import more data into the database after it has been normalized, there would then be an inconsistency and require normalization again.
It would make checking for modified data a lot easier if it stayed like this though :D

My current thoughts towards solving this problem on the fly seems to possibly quite simple (Pseudo mode incoming)

Generate Assoc array of XML data. ($key => $value)
For each item in assoc array, read the $key.
IF
--the $key matches a string located in the haystack, call a processing function ( processSubTree() )
----processSubTree sends $key through a switch($key), and directs the $value to a specific function designed for handling that data
------The return from the specific function returns the ID number after the data is inserted into the database (auto_increment)
----processSubTree returns this ID as $value
--the $key now contains $value which is an ID number of the record in the database.
ENDIF
add the current data into empty array ( $data[$key] = $value ) (this hopefully should normalize the main array)
loop through the $data array and generate insert queries based on this info.

During this process is where the fields would be created, of course this would depend on the processSubTree(), specific functions and the main function.

Thoughts?

Re: Simple Solution for Huge Problem

Posted: Wed May 18, 2011 11:32 pm
by Weiry
Ok i have updated the code to:
  • Use XMLReader (streaming over loading a file)
  • Allow for subtrees to be processed into a different table
I still need to implement the auto-creation of database fields if i've missed any fields. All the keys are the field names, so that i can dynamically generate the queries.

The code hasn't been tested with a database because im currently on a different PC.

Code: Select all

<?php
require_once 'api/api.database.php';
require_once 'api/api.error.php';

class xmlImport extends RAPIDatabase{

	protected $xmlArr;
	protected $xml;

	public function xmlImport($file){
		
		parent::__Construct();
		
		/* Define some globals */
		define("TBL_PREFIX", 		"rift_");
		define("TBL_ITEMS", 		'`'.TBL_PREFIX.'items`');
		define("TBL_ITEMS_RARITY", 	'`'.TBL_PREFIX.'items_rarity`');
		define("TBL_ITEMS_SLOTS", 	'`'.TBL_PREFIX.'items_slots`');
		define("TBL_ITEMS_STATS", 	'`'.TBL_PREFIX.'items_stats`');
		define("TBL_LANG",		'`'.TBL_PREFIX.'language`');

		$this->xml = new XMLReader();
		$this->xml->open($file);
			
		$this->xmlArr = $this->processXML($this->xml);
		$this->xml->close();
		
		//print "<pre>";print_r($this->xmlArr);print "</pre>";

		$arr = $this->importData($this->xmlArr);
		print "<pre>";print_r($arr);print "</pre>";
		

		return $this->xmlArr;

	}

	private function processXML(&$xml){

		$assoc = null;

		while($xml->read()){
			 
			switch ($xml->nodeType) {
				 
				case XMLReader::END_ELEMENT: return $assoc;

				case XMLReader::ELEMENT:
					if($xml->isEmptyElement == ''){

						if($xml->name == "Item"){
							$assoc[$xml->name][] = $this->processXML($xml);
						}else{
							$assoc[$xml->name] = $this->processXML($xml);
						}

					}
					 
					if($xml->hasAttributes){
						 
						$el =& $assoc[$xml->name][count($assoc[$xml->name]) - 1];

						while($xml->moveToNextAttribute()){
							 
							$el['attributes'][$xml->name] = $xml->value;
							 
						}

					}
					break;

				case XMLReader::TEXT:
				case XMLReader::CDATA: $assoc .= $xml->value;

			}

		}

		return $assoc;

	}
	
	/*
	 * Somewhere in importData() needs to have a data check
	 * against data existing in the database, to prevent duplicates
	 * and to update existing data.
	 */
	private function importData(&$arr){

		$data = array();
		//$query = "INSERT INTO ".TBL_ITEMS." VALUES ";

		foreach($arr as $key => $value){
				
			$haystack = array("OnEquip","Name","Rarity","Slot","WeaponType","Description");
				
			if(in_array($key, $haystack)){

				$value = $this->processSubTree($key,$value);

			}
				
			$data[$key] = $value;
				
		}

		return $data;

	}

	private function processSubTree($key,$value){

		switch($key){
			case "OnEquip":
				return $this->setData(TBL_ITEMS_STATS, $value);
				break;
			case "Name":
				return $this->setData(TBL_LANG, $value);
				break;
			case "Rarity":
				return $this->setData(TBL_ITEMS_RARITY, $value);
				break;
			case "Slot":
				return $this->setData(TBL_ITEMS_SLOTS, $value);
				break;
			/*case "WeaponType":
				return $this->setData($value);
				break;*/
			case "Description":
				return $this->setData(TBL_LANG, $value);
				break;
		}

		return NULL;

	}
	
	private function setData($table,$arr){
		
		$field = ""; $value = "";
		
		foreach($arr as $dataKey => $dataValue){
							
			$fields .= "`".mysql_real_escape_string($dataKey)."`,";
			$values .= "'".mysql_real_escape_string($dataValue)."',";
							
		}
	
		$cArr['table'] = $table;
		$cArr['fields'] = $this->cleanAutoGenQuery($fields);
		$cArr['values'] = $this->cleanAutoGenQuery($values);
		
		/*
		 * Process the $cArr['fields'] list to make sure that
		 * all fields are available to have data inserted.
		 */

		if($this->create($cArr)){
			
			return $this->getLastInsertID($result)
			
		}
		
		return NULL;
		
	}

}

$xmlImport = new xmlImport("example.xml");
?>

Re: Simple Solution for Huge Problem

Posted: Thu May 19, 2011 12:46 pm
by Jade
Why would you need to import more data after the fact? The point behind this is to convert existing data into a database so you can use the database from that point on.

If you're going to constantly import files into the database then you may want to rethink the strategy all together. At that point it might be easier to keep a data dictionary rather than try to normalize anything.