Page 1 of 1

share advice on data/implementation design

Posted: Wed Oct 19, 2005 2:00 pm
by newmember
Hi,

The project i'm working on deals with report forms.
There expected to be as much as 10000 instances of such forms.

On one hand, each report form has definite hierarchical structure that can be naturally implemented as xml document. It is not practical to store it in database for at least 2 reasons:

1. the hierarchy of this form is quite complex and using normalized approach is quite a mess that will require 7+ extra tables.

2. the hierarchy elements do not need to be accessed in any special way like sorting.

On the other hand if i store each report as xml there will be huge space overhead...(although i could improve that by compressing xml data)

What do you about this space overhead issue or the decision to use xml format??

Re: share advice on data/implementation design

Posted: Wed Oct 19, 2005 4:41 pm
by timvw
newmember wrote: 1. the hierarchy of this form is quite complex and using normalized approach is quite a mess that will require 7+ extra tables.
As you say: the model is a hierarchy. And most people struggle implementing a hierarchical model/tree with SQL.
But there are a couple usable solutions: Adjacency List, Materialized path, Nested Sets, ..
newmember wrote: 2. the hierarchy elements do not need to be accessed in any special way like sorting.
Imho these are the tasks of a database:
- make data accessible
- secure access to data
- avoid redundancy
- guard consistency of data

The fact that you are going to manipulate them or not, doesn't really matter imho.
newmember wrote: On the other hand if i store each report as xml there will be huge space overhead...(although i could improve that by compressing xml data)
An XML file is a hierarchical model by design. So it can be convenient to store your hierarchical data in such a model. I don't think that XML's verboseness should be a reason to (not) use it, certainly if you consider that datastorage is extremely cheep these days...

Issues with an XML file:
- no support for concurrent access
- no real guards to maintain data in a consistent state (dtd and xsd can help here)
- less performant than a standard rdbms

Basically, i would look for a solution that can store a tree/xml document in a SQL database. This way you have best of two worlds.

Posted: Wed Oct 19, 2005 6:03 pm
by newmember
i'll start quoting you back :D
Basically, i would look for a solution that can store a tree/xml document in a SQL database
yes, i was thinking about this too...


thanks for the link... very interesting... :D
particularly the part about nested sets...once i was thinking about using number ranges,i don't recall now what for, but i couldn't figure out the mathematics :D
Imho these are the tasks of a database:
- make data accessible
- secure access to data
- avoid redundancy
- guard consistency of data

The fact that you are going to manipulate them or not, doesn't really matter imho.
not less important is the ability to find the data fast...that's why i mentioned 'sorting' earlier...
- no support for concurrent access
as you suggested, if i store xml data in database then no problem with that...

- less performant than a standard rdbms


two cases:
1. entire xml tree(report) is stored in database
2. storing report form using nested sets approach.

consider the operation of displaying the report form.

in case 1: i retrive entire xml blob from db and using dom api or xsl api to generate html.

in case 2: i need to perfom certain number of non trivial queries to get all the fields (the hierarchy has at least 5 levels of nesting and these levels are all different).

do you believe that database will outperfom the dom api/xsl?
(imho in these particular cases dom will be faster)

Posted: Thu Oct 20, 2005 9:19 am
by timvw
newmember wrote: two cases:
1. entire xml tree(report) is stored in database
I wouldn't consider that a good idea.. In that case you are simply storing a file.. And files are best stored in a filesystem...
newmember wrote: 2. storing report form using nested sets approach.
This is the approach i like most...

newmember wrote: in case 1: i retrive entire xml blob from db and using dom api or xsl api to generate html.
In my experience performing XSL is a CPU hungry process...
newmember wrote: in case 2: i need to perfom certain number of non trivial queries to get all the fields (the hierarchy has at least 5 levels of nesting and these levels are all different).
With the Nested Sets, or any other Tree "solution" for a SQL product i think you can reduce that number of queries... Might want to have a look at http://pear.php.net/package/DB_NestedSet

Posted: Thu Oct 20, 2005 12:22 pm
by newmember
the host doesn't have DB_NestedSet package installed and doesn't want to install it either :?
and i don't have the time to write my own implementation...

so what another solution would you use?
newmember wrote:

two cases:
1. entire xml tree(report) is stored in database


I wouldn't consider that a good idea..
can you please explain why? (because of waste of space?)

Posted: Thu Oct 20, 2005 1:03 pm
by newmember
Consider these 3 points:

1.
as i already mentioned earlier, is that in my case levels in hierarchy are all completely different from each other...
it not the same as maintaining hierarchy in single table...

various algorithems for maintaing hierarchy, that i saw, are dealing with hierarchy of identical elements(like menu entries for example).
so if i want to use them for structuring reports storage i probably would need additional tables for cross-linking...


2.
another point that comes to mind is that if i use nested intervals kind of approach then it will be very hard later to expand the report format because it is actually is hardcoded into cross-linked table structure...

3.
In my experience performing XSL is a CPU hungry process...
on the contrary,in many cases, i can ease on the server by using the visitor's machine for doing xsl transformation.
(ofcourse the user's browser must on of the recent ones...but it is not really much of a problem)



Do you still think that
Nested Sets, or any other Tree "solution"
is suitable one for my needs?

thanks, very appriciate your answers :D

Posted: Thu Oct 20, 2005 6:31 pm
by timvw
newmember wrote:the host doesn't have DB_NestedSet package installed and doesn't want to install it either :?
You can have a custom/local install of pear packages..

newmember wrote: 1. entire xml tree(report) is stored in database
Because in that case your xml would only be a bit of blob.. binary data.. And a dbms is usually not specialized for this.. A filesystem on the other hand is..

Posted: Thu Oct 20, 2005 6:38 pm
by timvw
newmember wrote: as i already mentioned earlier, is that in my case levels in hierarchy are all completely different from each other...
it not the same as maintaining hierarchy in single table...

various algorithems for maintaing hierarchy, that i saw, are dealing with hierarchy of identical elements(like menu entries for example).
so if i want to use them for structuring reports storage i probably would need additional tables for cross-linking...
Imho, a XML document is a hierarchical structure that nodes and/or attributes.
So basically, i only see 2 tables at most (nodes and attributes)
newmember wrote: another point that comes to mind is that if i use nested intervals kind of approach then it will be very hard later to expand the report format because it is actually is hardcoded into cross-linked table structure...
I've got the feeling that you might want to rethink your complete "report/data" solution. Because i don't think it's very wise to store a specific format/report of your data anyway. All you need is the actual data, and then retrieve it from your database and decorate it with any markup/xml you want.

Posted: Fri Oct 21, 2005 6:08 am
by newmember
All you need is the actual data, and then retrieve it from your database and decorate it with any markup/xml you want
Well, xml format itself is meant for describing data....so i wouldn't call it just "decorate"...
"decoration" is when you transform your xml data/file into some other target-specific format...

Posted: Fri Oct 21, 2005 6:56 am
by timvw
newmember wrote: Well, xml format itself is meant for describing data....
Imho, XML does not describe data. And it never will..

from http://www.w3.org/XML/
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

Posted: Fri Oct 21, 2005 9:45 am
by newmember
bear with me just a little longer :)

i still thinking about the right approach...and
the thing is that i have the feeling that for my particular needs storing such a report as xml blob is the right solution...
i can't write here the exact hierarchy of report form simply because i'm still don't have all the details.
they showed me some example reports and i basicly looked at the structure...

so imagine heirarchy something like here:

Image

if i'm going for "nested sets" kind of solution then there are two important thing about this hierarchy:

1. each level describes different data so there must be a separate table for each level (i don't see any other way)

2. all this data will never be manupilated (from users point of view) in any way exept viewing it...

now as i mentioned the expected number of such report forms is of 10^4 order.
calculate now how many rows will such a report require... two last levels are expected to have very high branching factor...
so in the end we going to have millions if not more of rows.
add to that all the huge extra space taken by cross-referencing all this data...
and finally recall that all these millions of rows by itself have no information value...

compare all this mess with 10000+ records of xml blobs...
one query to extract the blob and ....a little parsing for display...

tell me what you think about it...
thanks

Posted: Fri Oct 21, 2005 3:30 pm
by timvw
newmember wrote:and
the thing is that i have the feeling that for my particular needs storing such a report as xml blob is the right solution...
Imho, filesystems are optimized for storing files. So why do you think that storing a file in a sql database is better? Where's the added value?

newmember wrote: if i'm going for "nested sets" kind of solution then there are two important thing about this hierarchy:
1. each level describes different data so there must be a separate table for each level (i don't see any other way)
I've got the feeling you don't understand the model. It's not about the actual data is contains, it's about the relation betweeen the data (tree). Just as in an xml document you have a root node with childnodes.
newmember wrote: 2. all this data will never be manupilated (from users point of view) in any way exept viewing it...
In that case, simply storing your xml file in a filesystem seems most appropriate.


But i still got the feeling the data analysis smells..

Posted: Fri Oct 21, 2005 4:53 pm
by newmember
So why do you think that storing a file in a sql database is better?
then i don't need to track filenames and paths...i mean everything is put in one place...
I've got the feeling you don't understand the model. It's not about the actual data is contains, it's about the relation betweeen the data (tree).
i might not understand... :?

how would you model the report data hierarchy? how many tables would you use and how?

Posted: Sat Oct 22, 2005 4:20 am
by timvw
newmember wrote: then i don't need to track filenames and paths...i mean everything is put in one place...
You still have to track the primary key. Just as the absolute file path is the primary key in a filesystem.
And you can store all files in the same directory, so they are all in one place too...
I've got the feeling you don't understand the model. It's not about the actual data is contains, it's about the relation betweeen the data (tree).
It all depends on the actual data (and what the data means) and the relationships between the elements of data.

Probably the "report" itself is just a product/process of other data sources.. Fe: You know that the age of a person is 15. But you know that age equals difference between now and birthdate years. Which one do you consider most flexible? :)

On the other hand if the reports aren't going to change anymore, and you don't need to perform searches in the report data storing them in a static file isn't a bad idea. This way you don't need to (re)calculate stuff that isn't going to change anyway.

Posted: Sat Oct 22, 2005 6:23 am
by newmember
earlier i didn't explain correctly why i wanted to use xml blobs instead filesystem...
it is because i need to track some data along with each report(for example order is important)...
so that there will be anyway record in db for each report and i thought that if so then why not store xml file itself in db as blob.
but after some thinking i guess that filesystem is more appropriate...


ok thanks for all your help,
now it is the time to start coding all the stuff...
thanks again :D