Page 1 of 1
PCDATA or CDATA?
Posted: Sat Dec 10, 2005 2:26 pm
by Ambush Commander
I'm building an HTML lexer as part of my DTD-sensitive HTML validator. This is so trivial but...
I have this HTML:
would parse into these tokens:
Code: Select all
0 => HTML_PCDATA('This ')
1 => HTML_StartTag('b')
2 => HTML_PCDATA('is bold')
3 => HTML_EndTag('b')
So, two petty naming questions
1. In the context I am using it, does PCDATA make sense, or should it be CDATA?
2. Is HTML the wrong namespace? If so, what should I use? XML?
Posted: Sat Dec 10, 2005 5:11 pm
by shoebappa
I'm no DTD expert, but I think the different between PCDATA and CDATA is PCDATA could have further tags, but CDATA isn't parsed. PCDATA standing for Parsed Character Data...
So in your example if the bold tag couldn't contain any child tags than CDATA would be the better one. If it could it would be PCDATA, but those child tags would have to be declared. And there's also the "ANY" option...
Posted: Sat Dec 10, 2005 5:13 pm
by Ambush Commander
That makes sense. I just renamed it to Text to avoid ambiguity.
Posted: Sat Dec 10, 2005 5:20 pm
by shoebappa
Heh, I was hoping you wouldn't see the last part of my previous post before I edited it... I'm pretty sure on the DTD side CDATA doesn't have to be enclosed in the <![CDATA[ ]]> tags, but when validating can't contain tags unless they are escaped by <![CDATA[ ]]> tags.
Posted: Sat Dec 10, 2005 5:23 pm
by Ambush Commander
I suppose entities like & are not considered "parsed".
This is XML then?
Posted: Sat Dec 10, 2005 5:24 pm
by shoebappa
Yeah, I know in DOM crap it calls them #text nodes, but I'm pretty sure that doesn't jive with the DTD lingo.
Posted: Sat Dec 10, 2005 5:28 pm
by shoebappa
Technically the snippet wouldn't be XML because it would need a root node <html></html>... Usually when I here DTD I think XML, and then your HTML might or might not be valid XML, and it might or might not validate off whatever DTD you are using.
I think entities are parsed. They are also delcared in the DTD... I know <![CDATA[ ]]> pretty much ignores everything enclosed in them so, & doesn't display & it is &, same goes for tags (nodes)
Posted: Sat Dec 10, 2005 5:32 pm
by Ambush Commander
Ah, so that's what I didn't get: Parsed means you parse both tags and entities, but when you're looking at a DOM, you'll have entities parsed but not tags, so it's still technically PCDATA although you're not parsing the tags.
True, XML does have to have a root element, I'm trying to bend rules here because the data I'm going to be receiving will be loaded with errors.