Page 1 of 1

HTML Scrapping Special Characters

Posted: Sun Apr 18, 2010 11:29 am
by CodeLab
Hi...

I Am Using simple_html_dom Class To Parse HTML Page And Parsed Results Are Then Written To A Simple Text File...

Extracted Result Has Some Special/UnReadable Characters..

For Ex.

HTML Page Has - œSaporiti Italia” label
And The Extracted Result Is - œSaporiti Italia” label

HTML Page Has - Jos» Zanine
And The Extracted Result is - Jos» Zanine

Besides Unreadable Characters There Are Some Understandable Things, Which I Want To Get Rid Of

Example:
HTML Page Has - 30"w x 30"d x 42"h
Extracted Result Is - 30"w x 30"d x 42"h

About Last Example, I Understand Why It Happens, Is There Some Simple Workaround To Get Rid Of That Without Complicated Regex

Thanks

Re: HTML Scrapping Special Characters

Posted: Sun Apr 18, 2010 5:12 pm
by requinix
That's generally not stuff you need to deal with yourself. The first part is because of mixed character encodings and the second part may be because of the code used.

So
1. What page are you scraping?
2. Do you have any header("Content-Type:..."); in your code?
3. What does the top of your HTML output look like?
4. What's the code?

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 3:29 am
by CodeLab
Hi tasairis
This Is One Of The Page I Am Scrapping.
http://www.treadwaygallery.com/lots.php ... ctionID=10

You Can See There Are Lots Of " (Quotes)
Which Appear As " In Output Text File.
Then There Are Some Names Like Jer»
Which Appear Differently In Output.

2) I Dont Have header("Content-Type:...") In The Code..
Should I Include It..
What Exactly Should Be Content-Type.

3) Top Of HTML Is Exactly Like, When You Open The Page In Browser And View Source..

4) Code Is Very Simple As To What A Beginner Can Write By Looking At Examples..
Uses Php-CURL, First To Download The Source Of HTML Page, Some Pages Use Cookies And POSTDATA Etc.
So I Use Php-CURL To Get Source Of HTML Page.

Then Uses simple_html_dom Class (http://simplehtmldom.sourceforge.net/) To Parse The HTML Source.

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 5:34 am
by requinix
Their site seems to have a few encoding problems... Inconsistencies, disagreements...

Best I can figure, pages are encoded with UTF-8 (while the HTML itself claims to be ISO-8859-1. Lovely). Which presents a problem, but I don't think your code or that Simple HTML DOM is getting tripped up by it.

So you get a UTF-8 encoded string: that explains the "Å“Saporiti Italiaâ€" problem. The Simple HTML DOM class doesn't seem to do any HTML entity work so I guess it's coming from your code.

1. Do you have any calls to htmlentities in your code?
2. Before writing anything to the file, run it through utf8_decode first.

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 6:46 am
by CodeLab
Hi...
I Dont Seem To Very Much Understand Everything You Described In Previous post, (Quite A Beginner Here)

But Some Of Functions You Described, Seems To Have Solved The Problem For Now,

In The Section Of Code That Parses HTML, I Added This Line

Code: Select all

$description = htmlentities($description, ENT_NOQUOTES, "UTF-8");
It Seemed To Have Solved Problem Of Unreadable Characters Like ”

1 More Thing,
How Can I Solve " And Similar,
I Mean, Replace " With Its Equivalent "
And Other Similar Things, In The Output Text File

Is There A Php Function Which Can Easily Do This..

Thanks

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 7:45 am
by requinix
Before you added that code, was there an htmlentities anywhere?

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 8:16 am
by CodeLab
No.
Before I Added That Code, There Was No htmlentities
I Checked simple_html_dom.php File.
It Does Not Have Any htmlentities

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 8:30 am
by requinix
I suspect you'll have problems with it, so at least for testing purposes,

What if you run everything through html_entity_decode before writing it to the file?

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 10:10 am
by CodeLab
Tried That Already,
Before Writing To File I Add This Line

Code: Select all

$description=html_entity_decode($description,ENT_NOQUOTES,"UTF-8");
I Tried All 3
ENT_COMPAT, ENT_QUOTES, ENT_NOQUOTES

I Still Get " In Output File,
And Also Lot Of Other Like ê For ê
' For Single Quote And So On...

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 10:12 am
by CodeLab
I Still Get " In Output File,
And Also Lot Of Other Like &-#234; For ê
&-#039; For Single Quote And So On...

(Added - After & In &-#039, Otherwise It Was Getting Parsed)

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 3:41 pm
by requinix
Can you post all of the code you have?

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 3:55 pm
by CodeLab
Thank You Very Much For Your Help So Far..

PMed You The Code...

Re: HTML Scrapping Special Characters

Posted: Mon Apr 19, 2010 6:28 pm
by requinix
Ah...

The original page has entities already. You need an html_entity_decode, but you have an htmlentities too. Put those together and you get the same thing you had before - no changes. You need one more html_entity_decode to actually decode entities. Also you can't use the ENT_NOQUOTES flag because you want the quotes converted.

Except the only thing your htmlentities/html_entity_decode pair accomplishes is handling character sets. iconv is better.
Replace

Code: Select all

$iDesc=$table_1TD->plaintext;
$iDesc=htmlentities($iDesc,ENT_NOQUOTES,"UTF-8");
$iDesc=html_entity_decode($iDesc,ENT_NOQUOTES);
with

Code: Select all

$iDesc=$table_1TD->plaintext;
$iDesc=iconv("UTF-8","ASCII",$iDesc);
$iDesc=html_entity_decode($iDesc);
There's one issue though: the Euro symbol is technically in Unicode which regular text can't always display (depends on the font). The best solution here is to adopt some kind of encoding - what you're trying to do now only permits standard characters.