PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Tue Oct 15, 2019 10:25 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Wed Dec 11, 2013 6:17 am 
Offline
Forum Newbie

Joined: Wed Dec 11, 2013 6:15 am
Posts: 4
Hi all,

Background:
I work at a multilingual communication company, where we’re working with quite a good CMS system. Since its last update, however, all exported or downloaded files from the system are ‘polluted’ with metadata. And I don’t want to see the metadata. Boo.

Situation:
To clean and to preprocess the files for further processing and translation, we use a couple of search & replace regexes. One of the preprocessing steps we apply on our files goes as follows:

Syntax: [ Download ] [ Hide ]
(?<!=)"\b(.+?)\b"(?! \[)


Replacement:
Syntax: [ Download ] [ Hide ]
1


The ‘find’ regex has straight quotation marks, the ‘replace’ regex has curly quotation marks. This regex finds all words between straight quotation marks and changes the quotation marks into curly ones (this is done so the curly quotation marks automatically change into the right ones according to the languages: << for French, ,,” for German, etc)

Problem:
The metadata. All of a sudden, all our files are clutterd with – among others – “concept.dtd” and “map.dtd”. As these metadata are part of the file, I don’t want to replace their quotation marks in order not to change anything crucial. With the existing regex, they will get replaced.

So I tried rewriting it, and after a lot of trials & errors, this is what I came up with:

Syntax: [ Download ] [ Hide ]
(?<!=)”\b(.+?[\.d])\b”(?! \[)


Somehow, this almost seems to work: the metadata are skipped and most of the terms between quotation marks that should be found, are found. Exactly, most of. I want all of the terms. I tested the regex on some extra files where I added small sentences between brackets (five words). The regex doesn’t seem to find those groups…

Help?
What am I missing or doing wrong? I've tried
Syntax: [ Download ] [ Hide ]
(?<!=)”\b(.+?[\.dtd])\b”(?! \[)
as well (okay, that and about 300 other regexes), but it didn't do the trick.

Thank you very very much for any advice :)

edit: corrected wrong closing bracket


Top
 Profile  
 
PostPosted: Wed Dec 11, 2013 1:47 pm 
Offline
Spammer :|
User avatar

Joined: Wed Oct 15, 2008 2:35 am
Posts: 6617
Location: WA, USA
Sample text would be great because I have no idea what this "metadata" stuff is.


Top
 Profile  
 
PostPosted: Fri Dec 13, 2013 5:17 am 
Offline
Forum Newbie

Joined: Wed Dec 11, 2013 6:15 am
Posts: 4
Eh, well, I'll try :)

There's a bunch of regular xml files, but instead of proper code & text, it also contains "concept.dtd" and "map.dtd" or "dita11todita12.dtd". Those things are extra information about the system, put into the files automatically, by the system. I don't want to change anything to them, since I believe they're important (I'm afraid I won't be able to import the files again without this metadata)

Syntax: [ Download ] [ Hide ]
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>


These are the first lines of the xml. Perfectly normal, but I just want to avoid finding
Syntax: [ Download ] [ Hide ]
"concept.dtd"
all the time when I apply my regex search & find.

Did this help?


Top
 Profile  
 
PostPosted: Fri Dec 13, 2013 1:44 pm 
Offline
Spammer :|
User avatar

Joined: Wed Oct 15, 2008 2:35 am
Posts: 6617
Location: WA, USA
XML? Then use an XML parser like SimpleXML to go through each regular node in the document and convert the quotes in their values. Because that's all that you should be replacing: values and not markup.


Top
 Profile  
 
PostPosted: Tue Dec 24, 2013 4:07 am 
Offline
Forum Newbie

Joined: Wed Dec 11, 2013 6:15 am
Posts: 4
Hi all,

Thanks for the input :) aaand: yay, problem solved!

This is what the regex looks like now:
Syntax: [ Download ] [ Hide ]
(?<=<entry)[^>]+>[^<>]*?"(.+?[^0-9])"[^<>]*?(?=<\x2Fentry>)


replacement hasn't changed
Syntax: [ Download ] [ Hide ]
1


This matches words between straight quotation marks, skips metadata between quotation marks and skips inches (like 4"7'). Happydance.

Grtz,
Ann


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group