Page 1 of 1

Address parsing

Posted: Sun Jan 22, 2012 4:53 pm
by chunkymoves
Input
LEVEL 1 1234 EXAMPLE SMELBOURNE VIC
LOT 1234 EXAMPLE ST PORT HEDLAND WA

Desired output
LEVEL 1 1234 EXAMPLE S:MELBOURNE:VIC
LOT 1234 EXAMPLE ST:PORT HEDLAND:WA

The subsections have been merged badly into a short text field, and I can't access the origional data.
I'm new to regular expressions, but found it easy to isolate the State field, but don't see a way to isolate the city.

Any ideas?

Re: Address parsing

Posted: Sun Jan 22, 2012 5:22 pm
by ragax
Inserting the colon before the State is straightforward: something like
Search (?m)[ ](\w+)$
Replace: :\1

In PHP:

Code: Select all

<?php
$regex=',(?s)[ ](\w+\r),';
$string='LEVEL 1 1234 EXAMPLE SMELBOURNE VIC
LOT 1234 EXAMPLE ST PORT HEDLAND WA
';
echo '<pre>'.preg_replace($regex, ':$1', $string).'</pre>';
?>
Output:
LEVEL 1 1234 EXAMPLE SMELBOURNE:VIC
LOT 1234 EXAMPLE ST PORT HEDLAND:WA

For the colon with the street, I don't have a good idea right now: what is the rule to let regex know that SMELBOURNE is not the town?
If you can give me a rule in plain English, I'm happy to have a look.

:)

Re: Address parsing

Posted: Sun Jan 22, 2012 5:32 pm
by twinedev
To be able to help with this, would need a lot more examples of data to help determine what would break up the address from the city.

What if you have LOT 1234 EXAMPLE ST ST MARY WA assuming there is a city named St. Mary. Not trying to be a pain, but without a better sampling, would only be able to say use preg_match() to strip out data. (and look on here, you will see I love figuring things like this out)

Re: Address parsing

Posted: Sun Jan 22, 2012 5:49 pm
by chunkymoves
"can give me a rule in plain English"
Understood, thanks for the advice, and for the welcoming writing style.

With the state, it's either the last two or three charaters seperated by a space. That part I've coded now.

With the city, I'm not sure I can. I know from experience that the word "MELBORNE" is a town, but that "SMELBORNE" isn't, but from what I've read so far, that not the way regex works.
I know that the first part is never longer than 22 characters, and the third part (state) is from the last space onwards, but don't see how to isolate the town.

I'll look into what distinguishes the town and just work on describing it in English.

Thanks for your time.

Re: Address parsing

Posted: Sun Jan 22, 2012 6:17 pm
by ragax
No worries!

If you can get it in plain English, regex probably has the grammar to make it work for you.
But the plain English rule looks hairy to me. :)

Re: Address parsing

Posted: Sun Jan 22, 2012 7:05 pm
by chunkymoves
twinedev wrote:To be able to help with this, would need a lot more examples of data to help determine what would break up the address from the city.

What if you have LOT 1234 EXAMPLE ST ST MARY WA assuming there is a city named St. Mary. Not trying to be a pain, but without a better sampling, would only be able to say use preg_match() to strip out data. (and look on here, you will see I love figuring things like this out)
I'm working with customer data, so can't post the list, so I was posting examples. I read through "how to ask good questions" and it said to narrow it down to one that works and one that doesn't. But I can see now that more data is needed, so will build up a list and post it shortly. Thanks.

Just so I'm on the same page, you're saying ST can either be short of STREET or SAINT. Yep, this certainly may come up.

Another example input
12/1234 ST EXAMPLE TCEPERTH WA
Desired output
12/1234 ST EXAMPLE TCE:PERTH:WA

This one works as I can break it on the 22nd character, but often the street name is shorter, so the town starts earlier.

Re: Address parsing

Posted: Sun Jan 22, 2012 7:42 pm
by abareplace
Here is an idea: take a list of cities and towns (click "view more cities and places" link, hold Ctrl and select the place names with your mouse in Mozilla) and match it against your lines.

Code: Select all

<?php

$cities = file('cities.txt'); // place names from World Gazetteer
$input = file('input.txt');

foreach($cities as &$city)
   $city = strtoupper(rtrim($city));
unset($city);

foreach($input as $line) {
   foreach($cities as $city) {
       $pos = strpos($line, $city);
       if ($pos !== false) {
          $line = substr($line, 0, $pos) . ':' . $city .
                   ':' . substr($line, $pos + strlen($city));
          break;
       }
   }
   print($line);
}

?>
Output:

Code: Select all

LEVEL 1 1234 EXAMPLE S:MELBOURNE: VIC
LOT 1234 EXAMPLE ST :PORT HEDLAND: WA
12/1234 ST EXAMPLE TCE:PERTH: WA

Re: Address parsing

Posted: Sun Jan 22, 2012 8:42 pm
by chunkymoves
@abareplace

A fine idea, and nice bit of googling too. Will try it out.

------

As a newbie to this forum and php, I'm pleasently stunned by the speed, quality and diversity of answers. Cheers.

Re: Address parsing

Posted: Sun Jan 22, 2012 9:00 pm
by ragax
Very cool idea, ABA!

Re: Address parsing

Posted: Sun Jan 22, 2012 11:15 pm
by chunkymoves
95% there. Hurrah - I can figure out the rest.

A epilogue if you areinterested...

I've done as you have suggested, and running it against a list of town names and it's working.

One fell through the net though.
"EXAMPLE ST BELL BAY TAS"
There is a town called BELL, and also a town called BELL BAY, and it picked up the first one.

I'm working on triming of the 3 letter state code, then finding the match position, and using the rest of the string as the town.

Thanks again.

Re: Address parsing

Posted: Sun Jan 22, 2012 11:50 pm
by twinedev
To keep from matching against the wrong value, put the list in order from longest to shortest. Also, when you are looping through the list of cities, when you find the match do a break; to stop the loop so it can't match on shorter ones.

To take you raw list (cities.txt) and put it in order, use:

Code: Select all

$cities = file('cities.txt');

$arySize = array();
foreach($cities as $strCity) {
	$arySize[$strCity] = strlen($strCity);
}
arsort($arySize);

$fp = fopen('citiesBySize.txt','w');
foreach($arySize as $key=>$val) {
	fwrite($fp,trim($key)."\n");
}
fclose($fp);

echo "Done!";

Re: Address parsing

Posted: Mon Jan 23, 2012 4:08 pm
by chunkymoves
Of course!

That's more elegant than my solution.