PCDATA or CDATA?

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

PCDATA or CDATA?

Post by Ambush Commander »

I'm building an HTML lexer as part of my DTD-sensitive HTML validator. This is so trivial but...

I have this HTML:

Code: Select all

This <b>is bold</b>
would parse into these tokens:

Code: Select all

0 => HTML_PCDATA('This ')
1 => HTML_StartTag('b')
2 => HTML_PCDATA('is bold')
3 => HTML_EndTag('b')
So, two petty naming questions

1. In the context I am using it, does PCDATA make sense, or should it be CDATA?
2. Is HTML the wrong namespace? If so, what should I use? XML?
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

I'm no DTD expert, but I think the different between PCDATA and CDATA is PCDATA could have further tags, but CDATA isn't parsed. PCDATA standing for Parsed Character Data...

So in your example if the bold tag couldn't contain any child tags than CDATA would be the better one. If it could it would be PCDATA, but those child tags would have to be declared. And there's also the "ANY" option...
Last edited by shoebappa on Sat Dec 10, 2005 5:17 pm, edited 1 time in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

That makes sense. I just renamed it to Text to avoid ambiguity.
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

Heh, I was hoping you wouldn't see the last part of my previous post before I edited it... I'm pretty sure on the DTD side CDATA doesn't have to be enclosed in the <![CDATA[ ]]> tags, but when validating can't contain tags unless they are escaped by <![CDATA[ ]]> tags.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

I suppose entities like & are not considered "parsed".

This is XML then?
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

Yeah, I know in DOM crap it calls them #text nodes, but I'm pretty sure that doesn't jive with the DTD lingo.
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

Technically the snippet wouldn't be XML because it would need a root node <html></html>... Usually when I here DTD I think XML, and then your HTML might or might not be valid XML, and it might or might not validate off whatever DTD you are using.

I think entities are parsed. They are also delcared in the DTD... I know <![CDATA[ ]]> pretty much ignores everything enclosed in them so, & doesn't display & it is &, same goes for tags (nodes)
Last edited by shoebappa on Sat Dec 10, 2005 5:32 pm, edited 1 time in total.
User avatar
Ambush Commander
DevNet Master
Posts: 3698
Joined: Mon Oct 25, 2004 9:29 pm
Location: New Jersey, US

Post by Ambush Commander »

Ah, so that's what I didn't get: Parsed means you parse both tags and entities, but when you're looking at a DOM, you'll have entities parsed but not tags, so it's still technically PCDATA although you're not parsing the tags.

True, XML does have to have a root element, I'm trying to bend rules here because the data I'm going to be receiving will be loaded with errors.
Post Reply