Parsing html page with russian content using regex

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
littlebiker
Forum Newbie
Posts: 1
Joined: Tue Feb 07, 2006 7:17 am

Parsing html page with russian content using regex

Post by littlebiker »

Hey guys,

I am trying to parse a russian html file from a russian webpage. I am using curl. I am supposed to get some values from some set fields:

Like

Product Id: 400212
Product Type: engine

Here both the labels product Id and product price are in russian. I need to extract their values.

If the content was in english I could have done it without a problem but I am just not sure how to handle foreign languages? Any one has done this before?

Thanks!
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Tried running it through Google translation, or Babelfish? It should be possible to script through them, or maybe to just get your barings as to where in the text the information is actually stored.

If we could see several examples of text (not just the specific text, but many lines around it), we may be able to write one, or give you more direction.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

yeah, post the html source. It would be even better if you had posted the url (there could be issues with charsets, etc.)

Короче, код в студию ;)
Post Reply