Page 1 of 1

A regex to parse data

Posted: Mon Oct 03, 2005 1:58 pm
by PatrickE
Hello all,

I seriously suck at writing regular expressions, so if anyone could help me that'd be great.

I'd like to write a regex to parse data from a web page into an array (or arrays) so I can insert in into a database. Basically what I'm trying to do is turn a static list of a lot of data into a searchable database containing the same data. Here is an example of the data I'm working with to begin with..

http://home.att.net/~jbaugher/1938.html

These are aircraft serial numbers. I can parse the data to get the serial numbers easily enough, but I need the data that is associated with the serial numbers, not just the serial numbers.

I'm not entirely sure that it's possible, but if it is then it would save me from months of manually entering data.. Does anyone know if this is possible, and, if so, how I can go about doing this?

Thanks

Posted: Mon Oct 03, 2005 4:57 pm
by timvw
I would start looping through the lines..
And count how many tabs/spaces there are before there are characters..
This should give you a good idea at which "level" you are..

And as soon as you know the level, you can lookup the values in the previous levels, and build a complete number with explanation..

Posted: Mon Oct 03, 2005 5:11 pm
by Chris Corbyn
Specifically what data do you need? Is it consistent (or at least uniform)? I love regex and am more than willing to do this out of shear boredom, publically, for free. I'm 99% certain it can be done on that data you gave as an example, i'm just not sure how long the function/regex will take to build :P

Posted: Tue Oct 04, 2005 6:42 am
by shailendra
Give the exact idea what data you need? is it static or from a searchable database? I think this is possible by using CURL( read the page) and then use regular expression for getting data you want? please give me exact idea wat data you need, i'm trying to make a regex function for that.


Thanks

shailendra

Posted: Tue Oct 04, 2005 9:21 am
by PatrickE
All of the data..

It is a static list which is relatively uniform (some things aren't, but I don't mind a few errors).

As I mentioned before, here is (one of) the page(s) I'm working with..
http://home.att.net/~jbaugher/1938.html

This is aircraft serial number data, so I want to extract the info on the serial. As an example we'll try to extract data for serial number 38-214.

Serial numbers and ranges are listed on the left side. In the case of 38-214 there is no serial number listed; instead it's part of range 38-211/223 (which is to say, serial numbers 38-211 through 38-223). I've written a regular expression which can go and get all the serials and ranges, and then I wrote a for loop which will convert the ranges into individual serials. So once we've actually got the serial 38-214, I want to get the manufacturer and designation. In this case the manufacturer for all aircraft in the range is Boeing and the designation is B-17B (note, Fortress is just a nickname and not part of the designation; I don't want that data). Then I want to get any range-wide data, which is the first piece of info below the serial ranges aircraft designation. In this case the range-wide data is c/n 2004/2016. Not all ranges or serials will have this. Then I want to get the data specific to 38-214. Once again, 38-214 is not listed, but 214 is, so we'll grab the data within that range whose row begins with 214. This data is: 214 crashed in Santa Catalina Mts near Davis Monthan AAF after inflight engine fire Apr 6, 1942. 2 bailed out, 6 killed.

I would like to retrieve all that data, but I'm not very good at this.. So far I've written some crap which actually works (however inefficently) except when a serial has more than one line of data, or when the range-wide data is more than one line; in these situations it will only grab one line. Also, if there's no data for the serials, it still uses the next line, which is incorrect because the next line is actually another serial or range. Please note that I'm using an altered form of the data which I've manually created; I don't know if it's actually easier but it's easier for me. This format of data is easily achieved by copying and pasting between a few things. Here is an example of the data I'm using..

Code: Select all

38-211/223 Boeing B-17B Fortress<br>
c/n 2004/2016<br>
214 crashed in Santa Catalina Mts near Davis Monthan AAF after inflight engine<br>
fire Apr 6, 1942. 2 bailed out, 6 killed.<br>
215 attached to Cold Weather Testing Detachment at Ladd Field, Alaska 1941-42. <br>
Participated in bomb strikes against Japanese fleet during the Dutch Harbor<br>
operation and was involved in air battle above Umnak Pass June 4, 1942.<br>
Crashed Jul 18, 1942 while returning from weather recon to Kiska. All 6 crew KIA.<br>
217 crashed near Lovelock, NV while enroute to Wright Field Feb 6, 1942. All 8 killed.<br>
38-224/257 North American BT-9C<br>
Here's the code I'm using..

Code: Select all

preg_match_all("/^[0-9]{2}-[0-9]{1,7}(\/[0-9]{1,7}){0,1}\s.+?<br>/ms",$contents,$matches);
for($x=0;$x<20000;$x++){
	// DESIGNATION
	$dsgnt = NULL;
	$dsgnt0 = explode(" ",$matches[0][$x]);
	for($v=1;$v<count($dsgnt0);$v++) $dsgnt .= $dsgnt0[$v] . " ";
	$dsgnt = str_replace("<br>","",rtrim($dsgnt));
	preg_match("/^(.+)\s((\w+)-(\w+))(\s(\w+)){0,1}/",$dsgnt,$d_matches);
	$man = $d_matches[1];
	$des = $d_matches[2];
	
	$match = str_replace("<br>","",$dsgnt0[0]);
	
	$ex = explode("/",$match);
	if(count($ex)>1){
		$ex0 = explode("-",$ex[0]);
		$pre = $ex0[0];
		for($y=$ex0[1];$y<($ex[1]+1);$y++){
			$details = NULL; $details_pre = NULL;
			$details = explode($match,$contents);
			$details = explode("<br>",$details[1]);
			$details = ltrim(rtrim($details[1]));
			$details0 = explode(" ",$details);
			if(is_numeric($details0[0])){ $bah = 1; }
			else{
				$details = strtoupper($details{0}).substr($details,1);
				if($details AND substr($details,-1,1)!=".") $details .= ".";
				$details_pre = $details . "<br />";
			}
			
			$details = NULL;
			$details = explode($match,$contents);
			$details = explode("<br>
$y ",$details[1]);
			$details = explode("<br>",$details[1]);
			$details = ltrim(rtrim($details[0]));
			$details = strtoupper($details{0}).substr($details,1);
			if($details AND substr($details,-1,1)!=".") $details .= ".";
			$details = $details_pre . $details;
			
			$serial = $pre . "-" . $y;
			
			if($serial) print "$serial --- match: $match - y: $y<br />\n";
			if($serial) mysql_db_query($database,"INSERT INTO aviation_aircraft_serials2 VALUES ('$serial', '1', '$man', '', '$des', '$details')") or die(mysql_error());
		}
	}else{		
		$details = NULL;
		$details = explode($match,$contents);
		$details = explode("<br>",$details[1]);
		$details = ltrim(rtrim($details[1]));
		$details0 = explode(" ",$details);
		$details = strtoupper($details{0}).substr($details,1);
		if($details AND substr($details,-1,1)!=".") $details .= ".";
		
		if($match) print $match." - $dsgnt<br />\n";
		if($match) mysql_db_query($database,"INSERT INTO aviation_aircraft_serials2 VALUES ('$match', '1', '$man', '', '$des', '$details')") or die(mysql_error());
	}
}
If anyone can do anything to help me with this, I'd really appreciate it! I'm sure that regular expressions can be used to do this much more efficently, but I don't really know how..

Thanks for all the help!