Search & replace regex gone wrong
Posted: Wed Dec 11, 2013 5:17 am
Hi all,
Background:
I work at a multilingual communication company, where we’re working with quite a good CMS system. Since its last update, however, all exported or downloaded files from the system are ‘polluted’ with metadata. And I don’t want to see the metadata. Boo.
Situation:
To clean and to preprocess the files for further processing and translation, we use a couple of search & replace regexes. One of the preprocessing steps we apply on our files goes as follows:
Replacement:
The ‘find’ regex has straight quotation marks, the ‘replace’ regex has curly quotation marks. This regex finds all words between straight quotation marks and changes the quotation marks into curly ones (this is done so the curly quotation marks automatically change into the right ones according to the languages: << for French, ,,” for German, etc)
Problem:
The metadata. All of a sudden, all our files are clutterd with – among others – “concept.dtd” and “map.dtd”. As these metadata are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. With the existing regex, they will get replaced.
So I tried rewriting it, and after a lot of trials & errors, this is what I came up with:
Somehow, this almost seems to work: the metadata are skipped and most of the terms between quotation marks that should be found, are found. Exactly, most of. I want all of the terms. I tested the regex on some extra files where I added small sentences between brackets (five words). The regex doesn’t seem to find those groups…
Help?
What am I missing or doing wrong? I've tried as well (okay, that and about 300 other regexes), but it didn't do the trick.
Thank you very very much for any advice
edit: corrected wrong closing bracket
Background:
I work at a multilingual communication company, where we’re working with quite a good CMS system. Since its last update, however, all exported or downloaded files from the system are ‘polluted’ with metadata. And I don’t want to see the metadata. Boo.
Situation:
To clean and to preprocess the files for further processing and translation, we use a couple of search & replace regexes. One of the preprocessing steps we apply on our files goes as follows:
Code: Select all
(?<!=)"\b(.+?)\b"(?! \[)Code: Select all
“1”Problem:
The metadata. All of a sudden, all our files are clutterd with – among others – “concept.dtd” and “map.dtd”. As these metadata are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. With the existing regex, they will get replaced.
So I tried rewriting it, and after a lot of trials & errors, this is what I came up with:
Code: Select all
(?<!=)”\b(.+?[\.d])\b”(?! \[)Help?
What am I missing or doing wrong? I've tried
Code: Select all
(?<!=)”\b(.+?[\.dtd])\b”(?! \[)Thank you very very much for any advice
edit: corrected wrong closing bracket