Address parsing

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Address parsing

Post by chunkymoves »

Input
LEVEL 1 1234 EXAMPLE SMELBOURNE VIC
LOT 1234 EXAMPLE ST PORT HEDLAND WA

Desired output
LEVEL 1 1234 EXAMPLE S:MELBOURNE:VIC
LOT 1234 EXAMPLE ST:PORT HEDLAND:WA

The subsections have been merged badly into a short text field, and I can't access the origional data.
I'm new to regular expressions, but found it easy to isolate the State field, but don't see a way to isolate the city.

Any ideas?
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: Address parsing

Post by ragax »

Inserting the colon before the State is straightforward: something like
Search (?m)[ ](\w+)$
Replace: :\1

In PHP:

Code: Select all

<?php
$regex=',(?s)[ ](\w+\r),';
$string='LEVEL 1 1234 EXAMPLE SMELBOURNE VIC
LOT 1234 EXAMPLE ST PORT HEDLAND WA
';
echo '<pre>'.preg_replace($regex, ':$1', $string).'</pre>';
?>
Output:
LEVEL 1 1234 EXAMPLE SMELBOURNE:VIC
LOT 1234 EXAMPLE ST PORT HEDLAND:WA

For the colon with the street, I don't have a good idea right now: what is the rule to let regex know that SMELBOURNE is not the town?
If you can give me a rule in plain English, I'm happy to have a look.

:)
Last edited by ragax on Sun Jan 22, 2012 5:34 pm, edited 1 time in total.
User avatar
twinedev
Forum Regular
Posts: 984
Joined: Tue Sep 28, 2010 11:41 am
Location: Columbus, Ohio

Re: Address parsing

Post by twinedev »

To be able to help with this, would need a lot more examples of data to help determine what would break up the address from the city.

What if you have LOT 1234 EXAMPLE ST ST MARY WA assuming there is a city named St. Mary. Not trying to be a pain, but without a better sampling, would only be able to say use preg_match() to strip out data. (and look on here, you will see I love figuring things like this out)
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Re: Address parsing

Post by chunkymoves »

"can give me a rule in plain English"
Understood, thanks for the advice, and for the welcoming writing style.

With the state, it's either the last two or three charaters seperated by a space. That part I've coded now.

With the city, I'm not sure I can. I know from experience that the word "MELBORNE" is a town, but that "SMELBORNE" isn't, but from what I've read so far, that not the way regex works.
I know that the first part is never longer than 22 characters, and the third part (state) is from the last space onwards, but don't see how to isolate the town.

I'll look into what distinguishes the town and just work on describing it in English.

Thanks for your time.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: Address parsing

Post by ragax »

No worries!

If you can get it in plain English, regex probably has the grammar to make it work for you.
But the plain English rule looks hairy to me. :)
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Re: Address parsing

Post by chunkymoves »

twinedev wrote:To be able to help with this, would need a lot more examples of data to help determine what would break up the address from the city.

What if you have LOT 1234 EXAMPLE ST ST MARY WA assuming there is a city named St. Mary. Not trying to be a pain, but without a better sampling, would only be able to say use preg_match() to strip out data. (and look on here, you will see I love figuring things like this out)
I'm working with customer data, so can't post the list, so I was posting examples. I read through "how to ask good questions" and it said to narrow it down to one that works and one that doesn't. But I can see now that more data is needed, so will build up a list and post it shortly. Thanks.

Just so I'm on the same page, you're saying ST can either be short of STREET or SAINT. Yep, this certainly may come up.

Another example input
12/1234 ST EXAMPLE TCEPERTH WA
Desired output
12/1234 ST EXAMPLE TCE:PERTH:WA

This one works as I can break it on the 22nd character, but often the street name is shorter, so the town starts earlier.
abareplace
Forum Newbie
Posts: 9
Joined: Fri Jan 06, 2012 1:43 am

Re: Address parsing

Post by abareplace »

Here is an idea: take a list of cities and towns (click "view more cities and places" link, hold Ctrl and select the place names with your mouse in Mozilla) and match it against your lines.

Code: Select all

<?php

$cities = file('cities.txt'); // place names from World Gazetteer
$input = file('input.txt');

foreach($cities as &$city)
   $city = strtoupper(rtrim($city));
unset($city);

foreach($input as $line) {
   foreach($cities as $city) {
       $pos = strpos($line, $city);
       if ($pos !== false) {
          $line = substr($line, 0, $pos) . ':' . $city .
                   ':' . substr($line, $pos + strlen($city));
          break;
       }
   }
   print($line);
}

?>
Output:

Code: Select all

LEVEL 1 1234 EXAMPLE S:MELBOURNE: VIC
LOT 1234 EXAMPLE ST :PORT HEDLAND: WA
12/1234 ST EXAMPLE TCE:PERTH: WA
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Re: Address parsing

Post by chunkymoves »

@abareplace

A fine idea, and nice bit of googling too. Will try it out.

------

As a newbie to this forum and php, I'm pleasently stunned by the speed, quality and diversity of answers. Cheers.
User avatar
ragax
Forum Commoner
Posts: 85
Joined: Thu Dec 15, 2011 1:40 pm
Location: Nelson, NZ

Re: Address parsing

Post by ragax »

Very cool idea, ABA!
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Re: Address parsing

Post by chunkymoves »

95% there. Hurrah - I can figure out the rest.

A epilogue if you areinterested...

I've done as you have suggested, and running it against a list of town names and it's working.

One fell through the net though.
"EXAMPLE ST BELL BAY TAS"
There is a town called BELL, and also a town called BELL BAY, and it picked up the first one.

I'm working on triming of the 3 letter state code, then finding the match position, and using the rest of the string as the town.

Thanks again.
User avatar
twinedev
Forum Regular
Posts: 984
Joined: Tue Sep 28, 2010 11:41 am
Location: Columbus, Ohio

Re: Address parsing

Post by twinedev »

To keep from matching against the wrong value, put the list in order from longest to shortest. Also, when you are looping through the list of cities, when you find the match do a break; to stop the loop so it can't match on shorter ones.

To take you raw list (cities.txt) and put it in order, use:

Code: Select all

$cities = file('cities.txt');

$arySize = array();
foreach($cities as $strCity) {
	$arySize[$strCity] = strlen($strCity);
}
arsort($arySize);

$fp = fopen('citiesBySize.txt','w');
foreach($arySize as $key=>$val) {
	fwrite($fp,trim($key)."\n");
}
fclose($fp);

echo "Done!";
chunkymoves
Forum Newbie
Posts: 6
Joined: Sun Jan 22, 2012 4:44 pm

Re: Address parsing

Post by chunkymoves »

Of course!

That's more elegant than my solution.
Post Reply