PHP Developers Network
http://forums.devnetwork.net/

Search & replace regex gone wrong
http://forums.devnetwork.net/viewtopic.php?f=38&t=138882
Page 1 of 1

Author:  Ann DH [ Wed Dec 11, 2013 6:17 am ]
Post subject:  Search & replace regex gone wrong

Hi all,

Background:
I work at a multilingual communication company, where we’re working with quite a good CMS system. Since its last update, however, all exported or downloaded files from the system are ‘polluted’ with metadata. And I don’t want to see the metadata. Boo.

Situation:
To clean and to preprocess the files for further processing and translation, we use a couple of search & replace regexes. One of the preprocessing steps we apply on our files goes as follows:

Syntax: [ Download ] [ Hide ]
(?<!=)"\b(.+?)\b"(?! \[)


Replacement:
Syntax: [ Download ] [ Hide ]
1


The ‘find’ regex has straight quotation marks, the ‘replace’ regex has curly quotation marks. This regex finds all words between straight quotation marks and changes the quotation marks into curly ones (this is done so the curly quotation marks automatically change into the right ones according to the languages: << for French, ,,” for German, etc)

Problem:
The metadata. All of a sudden, all our files are clutterd with – among others – “concept.dtd” and “map.dtd”. As these metadata are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. With the existing regex, they will get replaced.

So I tried rewriting it, and after a lot of trials & errors, this is what I came up with:

Syntax: [ Download ] [ Hide ]
(?<!=)”\b(.+?[\.d])\b”(?! \[)


Somehow, this almost seems to work: the metadata are skipped and most of the terms between quotation marks that should be found, are found. Exactly, most of. I want all of the terms. I tested the regex on some extra files where I added small sentences between brackets (five words). The regex doesn’t seem to find those groups…

Help?
What am I missing or doing wrong? I've tried
Syntax: [ Download ] [ Hide ]
(?<!=)”\b(.+?[\.dtd])\b”(?! \[)
as well (okay, that and about 300 other regexes), but it didn't do the trick.

Thank you very very much for any advice :)

edit: corrected wrong closing bracket

Author:  requinix [ Wed Dec 11, 2013 1:47 pm ]
Post subject:  Re: Search & replace regex gone wrong

Sample text would be great because I have no idea what this "metadata" stuff is.

Author:  Ann DH [ Fri Dec 13, 2013 5:17 am ]
Post subject:  Re: Search & replace regex gone wrong

Eh, well, I'll try :)

There's a bunch of regular xml files, but instead of proper code & text, it also contains "concept.dtd" and "map.dtd" or "dita11todita12.dtd". Those things are extra information about the system, put into the files automatically, by the system. I don't want to change anything to them, since I believe they're important (I'm afraid I won't be able to import the files again without this metadata)

Syntax: [ Download ] [ Hide ]
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>


These are the first lines of the xml. Perfectly normal, but I just want to avoid finding
Syntax: [ Download ] [ Hide ]
"concept.dtd"
all the time when I apply my regex search & find.

Did this help?

Author:  requinix [ Fri Dec 13, 2013 1:44 pm ]
Post subject:  Re: Search & replace regex gone wrong

XML? Then use an XML parser like SimpleXML to go through each regular node in the document and convert the quotes in their values. Because that's all that you should be replacing: values and not markup.

Author:  Ann DH [ Tue Dec 24, 2013 4:07 am ]
Post subject:  Re: Search & replace regex gone wrong

Hi all,

Thanks for the input :) aaand: yay, problem solved!

This is what the regex looks like now:
Syntax: [ Download ] [ Hide ]
(?<=<entry)[^>]+>[^<>]*?"(.+?[^0-9])"[^<>]*?(?=<\x2Fentry>)


replacement hasn't changed
Syntax: [ Download ] [ Hide ]
1


This matches words between straight quotation marks, skips metadata between quotation marks and skips inches (like 4"7'). Happydance.

Grtz,
Ann

Page 1 of 1 All times are UTC - 5 hours
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/