Python-parser running Beautiful Soup needs some review

XML, Perl, Python, and other languages can be discussed here, even if it isn't PHP (We might forgive you).

Moderator: General Moderators

Post Reply
lin
Forum Commoner
Posts: 49
Joined: Tue Dec 07, 2010 1:53 pm

Python-parser running Beautiful Soup needs some review

Post by lin »

good day - again me - again lin,

your new fan - i love this place! Frst of all. i have to admit: i am very happy that i have found
this great community!

i am currently trying to get a new scraper up and running. I want to create this in Python - making usage of Beautiful Soup. To be frank: i am new to Python and to Beatiful Soup also! It is told to be a great tool to parse and extract content. So here i am...:

I want to take the content of a <td>-tag of a table in a html-document. For example, i have this table

Code: Select all

<table class="bp_ergebnis_tab_info">
   <tr>
           <td>
                    This is a sample text
           </td>

           <td>
                    This is the second sample text
           </td>
   </tr>
</table>


How can i use beautifulsoup to take the text "This is a sample text"?

Should i make use of

Code: Select all

soup.findAll('table' ,attrs={'class':'bp_ergebnis_tab_info'}) 
to get
the whole table.

See the target http://www.schulministerium.nrw.de/BP/S ... pDO=142323

Here my approach:

Well - what have we to do first:

The first thing is t o find the table:

i do this with Using find rather than findall returns the first item in the list (rather than returning a list of all finds - in which case we'd have to add an extra [0] to take the first element of the list):

Code: Select all

table = soup.find('table' ,attrs={'class':'bp_ergebnis_tab_info'})
Then use find again to find the first td:

Code: Select all

first_td = soup.find('td')
Then we have to use renderContents() to extract the textual contents:

Code: Select all

text = first_td.renderContents()
... and the job is done (though we may also want to use strip() to
remove leading and trailing spaces:

Code: Select all

trimmed_text = text.strip()
This should give us:....

Code: Select all

print trimmed_text
This is a sample text
as desired.

What do you think about the code? I love to hear from you!?

greetings
your lin
Post Reply