HTML Scrapping Special Characters

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

HTML Scrapping Special Characters

Post by CodeLab »

Hi...

I Am Using simple_html_dom Class To Parse HTML Page And Parsed Results Are Then Written To A Simple Text File...

Extracted Result Has Some Special/UnReadable Characters..

For Ex.

HTML Page Has - œSaporiti Italia” label
And The Extracted Result Is - œSaporiti Italia” label

HTML Page Has - Jos» Zanine
And The Extracted Result is - Jos» Zanine

Besides Unreadable Characters There Are Some Understandable Things, Which I Want To Get Rid Of

Example:
HTML Page Has - 30"w x 30"d x 42"h
Extracted Result Is - 30"w x 30"d x 42"h

About Last Example, I Understand Why It Happens, Is There Some Simple Workaround To Get Rid Of That Without Complicated Regex

Thanks
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

That's generally not stuff you need to deal with yourself. The first part is because of mixed character encodings and the second part may be because of the code used.

So
1. What page are you scraping?
2. Do you have any header("Content-Type:..."); in your code?
3. What does the top of your HTML output look like?
4. What's the code?
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

Hi tasairis
This Is One Of The Page I Am Scrapping.
http://www.treadwaygallery.com/lots.php ... ctionID=10

You Can See There Are Lots Of " (Quotes)
Which Appear As " In Output Text File.
Then There Are Some Names Like Jer»
Which Appear Differently In Output.

2) I Dont Have header("Content-Type:...") In The Code..
Should I Include It..
What Exactly Should Be Content-Type.

3) Top Of HTML Is Exactly Like, When You Open The Page In Browser And View Source..

4) Code Is Very Simple As To What A Beginner Can Write By Looking At Examples..
Uses Php-CURL, First To Download The Source Of HTML Page, Some Pages Use Cookies And POSTDATA Etc.
So I Use Php-CURL To Get Source Of HTML Page.

Then Uses simple_html_dom Class (http://simplehtmldom.sourceforge.net/) To Parse The HTML Source.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

Their site seems to have a few encoding problems... Inconsistencies, disagreements...

Best I can figure, pages are encoded with UTF-8 (while the HTML itself claims to be ISO-8859-1. Lovely). Which presents a problem, but I don't think your code or that Simple HTML DOM is getting tripped up by it.

So you get a UTF-8 encoded string: that explains the "Å“Saporiti Italiaâ€" problem. The Simple HTML DOM class doesn't seem to do any HTML entity work so I guess it's coming from your code.

1. Do you have any calls to htmlentities in your code?
2. Before writing anything to the file, run it through utf8_decode first.
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

Hi...
I Dont Seem To Very Much Understand Everything You Described In Previous post, (Quite A Beginner Here)

But Some Of Functions You Described, Seems To Have Solved The Problem For Now,

In The Section Of Code That Parses HTML, I Added This Line

Code: Select all

$description = htmlentities($description, ENT_NOQUOTES, "UTF-8");
It Seemed To Have Solved Problem Of Unreadable Characters Like ”

1 More Thing,
How Can I Solve " And Similar,
I Mean, Replace " With Its Equivalent "
And Other Similar Things, In The Output Text File

Is There A Php Function Which Can Easily Do This..

Thanks
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

Before you added that code, was there an htmlentities anywhere?
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

No.
Before I Added That Code, There Was No htmlentities
I Checked simple_html_dom.php File.
It Does Not Have Any htmlentities
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

I suspect you'll have problems with it, so at least for testing purposes,

What if you run everything through html_entity_decode before writing it to the file?
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

Tried That Already,
Before Writing To File I Add This Line

Code: Select all

$description=html_entity_decode($description,ENT_NOQUOTES,"UTF-8");
I Tried All 3
ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES

I Still Get " In Output File,
And Also Lot Of Other Like ê For ê
' For Single Quote And So On...
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

I Still Get " In Output File,
And Also Lot Of Other Like &-#234; For ê
&-#039; For Single Quote And So On...

(Added - After & In &-#039, Otherwise It Was Getting Parsed)
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

Can you post all of the code you have?
CodeLab
Forum Newbie
Posts: 7
Joined: Sun Apr 18, 2010 11:14 am

Re: HTML Scrapping Special Characters

Post by CodeLab »

Thank You Very Much For Your Help So Far..

PMed You The Code...
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: HTML Scrapping Special Characters

Post by requinix »

Ah...

The original page has entities already. You need an html_entity_decode, but you have an htmlentities too. Put those together and you get the same thing you had before - no changes. You need one more html_entity_decode to actually decode entities. Also you can't use the ENT_NOQUOTES flag because you want the quotes converted.

Except the only thing your htmlentities/html_entity_decode pair accomplishes is handling character sets. iconv is better.
Replace

Code: Select all

$iDesc=$table_1TD->plaintext;
$iDesc=htmlentities($iDesc,ENT_NOQUOTES,"UTF-8");
$iDesc=html_entity_decode($iDesc,ENT_NOQUOTES);
with

Code: Select all

$iDesc=$table_1TD->plaintext;
$iDesc=iconv("UTF-8","ASCII",$iDesc);
$iDesc=html_entity_decode($iDesc);
There's one issue though: the Euro symbol is technically in Unicode which regular text can't always display (depends on the font). The best solution here is to adopt some kind of encoding - what you're trying to do now only permits standard characters.
Post Reply