Page 1 of 1

Complex query -- any suggestions

Posted: Wed Jan 08, 2003 12:50 pm
by Tbull
Very complex Grouping and counting challenge

Can anyone offer guidance and suggest SQL which will assist in
resolving this complex and challenging (for me) issue? I hope that
my attempt to clearly state this problem is successful.

The Problem:

Queries and presentation using PHP Version 4.0.4pl1, have an
MySQL 3.23.33 dataset with the following columns of interest.
Due to the number of records (literally 10s of thousands which must
be calculated) I think that the more that can be done in the query using
the MySQL engine the better. Using array_walk() or stepping through row
by row of thousands of records takes too long!

Running a simple query for EvDate between 2001-09-01 and 2001-09-07 (7 day
period used in AveConc) will return a dataset from db rows similar to those
in the table A.

In the query they are ordered using EvDate so that the earliest hour/start
time will be in order. I have shown "ifCalc'd" and the "Conc" to show what
needs to process. (Conc means concurrent)

Using Table A as an example, indicates that an event, RID 31911
starting at 01:12:20 on the 1st of Sep occurred for a Duration of
9 hrs, 53 mins and 25 seconds. If one were to add the duration
to the EvDate_Time the "ifCalc'd" or end time would be 11:05:25.

The event in RID 31912 starts later then RID 31911, and occurs during
the window when 31911 is occurring. Likewise, RID 31913 has a period of
concurrent time with both 31911 and 31912. The number of events which
are occuring concurrently is 3. Looking at the remaining rows shows that
RID 31915 on the 4th is concurrent with 31914 as the start time for 31915
is less then or equal and therefore concurrent with the "ifCalc'd" end
time of 31914.

Neither 31916 or 31917 overlap so the conc for those records is 1, even
though 2 events occured.

31918, 31919 and 31920 have concurrency for 31918 and 31919 so the
concurrency for that date is 2, even though 3 events occured.

Looking at the remaining rows should look similar and familiar to those
above. The AveConc uses the MAXConc on each day divided by number of
days in the original request. In this example it was 7 days, 9/1 - 9/6,
but could have been 3 to X.

Code: Select all

Table A -----------

  RID    Event          EvDate_Time         Duration  ifCalc'd    
 31911  EventType1    2001-09-01 01:12:20   09:53:05  11:05:25 x \
 31912  EventType1    2001-09-01 10:12:40   05:56:39  16:09:19 x -- Conc= 3
 31913  EventType1    2001-09-01 11:08:05   05:53:36  17:01:41 x /
 31914  EventType1    2001-09-04 00:01:42   01:11:09  01:12:51 x - Conc= 2
 31915  EventType1    2001-09-04 01:12:51   22:48:12  24:01:03 x /
 31916  EventType1    2001-09-05 00:01:03   01:11:32  01:12:35 -- Conc= 1
 31917  EventType1    2001-09-05 01:12:36   22:48:46  24:01:22 /
 31918  EventType1    2001-09-06 00:01:23   01:11:02  01:12:25 x - Conc= 2
 31919  EventType1    2001-09-06 01:12:28   07:27:36  08:40:03 x /
 31920  EventType1    2001-09-06 09:45:32   00:00:16  09:45:48 - --
  n+1...
 33111  EventType42    2001-09-01 00:12:20  08:53:05  11:05:25 -- Conc= 1
 33112  EventType42    2001-09-01 11:12:40  05:56:39  17:09:19 /
 33113  EventType42    2001-09-03 18:08:05  05:53:36  24:01:41 - Conc= 1
N+1 to end of data set


Desired output results to be returned as an HTML table.

 Table B ------------

  Event     Events    MAXConc   AveConc    TotalDuration
EventType1    10         3       1.14       78:21:53  

 ... other eventtypes rows ......

EventType42    3         1       0.29       20:43:20
Any and all thoughts or suggestions (well almost any *grin*) will be
appreciated.

Thanks,
Tbull

Posted: Thu Jan 09, 2003 4:36 am
by DaiWelsh
Hmm, the site disappeared when I posted my response and I dont have time to go through it all again, but in essence I had what I thought could be a solution by

querying total duration for all events
joining the table to itself to get the total duration where conc=2 or more
joining to itself twice to get total duration where conc=3 or more

etc. until there are no matches so you have reached max conc

Then you have the split of time spent at conc=0-x so you can get average conc.

Hope that still makes sense in the greatly condensed form.

Regards,

Dai

Posted: Thu Jan 09, 2003 12:52 pm
by Tbull
First and most important, THANK YOU for looking at and offering your suggestion.

I am not certain that I followed each of your steps, but the concept of duration isn't the problem. It is the number of concurrent events. MAXIMUM number in anyone day that are occuring at the same time.

Thanks again and I sincerely appreciate your input.

Regards,
Tbull

Posted: Thu Jan 09, 2003 1:24 pm
by hedge
Ok, this is how I would do this with Oracle, it uses a sub-select so not sure if your DB supports that or not.

Code: Select all

select event, count(*), max(conc.conc_count), avg(conc.conc_count), sum(duration)
  from A, (select rid, sum(a2.rid)
            from a a1, a a2
           where a.evdate_time <= a2.ifCalc'd
              and a.ifcalc'd >= a.2.evDate_Time) conc
 where A.rid = conc.rid
group by event

Posted: Thu Jan 09, 2003 1:58 pm
by Tbull
Hedge,

Thank you. MySQL doesn't support subquery as Oracle does at this time. One of life's annoyances, and MySQL's flaws.

I see what you are suggesting and will explore this from a mutliple array standpoint. As to the data have to work with what is given. Actually the duration time may be a blessing, because when converted into seconds and added for the calc time, it has the anamoly of 24:mm:ss which is easy to evaluate against *smirk*.

Thanks again,
Tbull

Posted: Thu Jan 09, 2003 3:38 pm
by DaiWelsh
Tbull wrote:the concept of duration isn't the problem. It is the number of concurrent events. MAXIMUM number in anyone day that are occuring at the same time.
In the original post I explained a little better, max concurrency should be ok to find if that is all you need, though it is a two step process, writing a temporary table then reading the max values out of that. I can flesh that out if required, however you also asked about average concurrency in your original post I think. To find the mean concurrency over a period of time you will need to find what proportion of that period was spent at each level of concurrency, e.g. of 72 hours, 48 hours had no events occurring, 20 hours had only 1, 3 hours had 2 concurrent events and 1 hour had 3 concurrent. then you multiply number of concurrent events by time spent with that number of concurrent events to get the average (mean).

It was the need to get this mean value that lead to my more complicated approach, but if I have misunderstood your requirements then there may be an easier way to do it.

Dai

Posted: Fri Jan 10, 2003 6:11 am
by Tbull
Dai,
I will welcome any expansion (fleshing out) you might be able to present. The MAX is always the maximum number of concurrent events for the eventype in a 1 day (24 clock hr) period. As an example, if 4 records of eventype1 had times which overlapped on day 1 and 2 records out of 5 on day 2, then the MAX is 4. The average would be (4+2)/2 for a 2 day sample. Likewise if the same sample was for a 7 day, then it would be (4+2)/7. The level of complexity is straight forward. The number of days is always known as one of the selection criteria is a beginning and ending date, and that must be at least 1.

I will be curious about performance and disk space using a temp. The other question is whether "permission" to write a file will be an issue.

Thank you for any consideration you may provide.

Best to you,
Tbull

Posted: Fri Jan 10, 2003 7:37 am
by DaiWelsh
Ah, I see, so it is the average of maximum concurrency in a day you want, rather than the average concurrency over the period, my bad.

I have seen some flaws with my approach so I dont think it will help but I will show you what I was thinking in case it gives you any inspiration. The method I was thinking of involves joining the table to itself. The approximate SQL would be

SELECT t1.RID,t1.Event,DayofYear(t1.EvDate_Time) AS EventDay,count(t2.RID) AS Conc FROM Test t1 INNER JOIN Test As t2 ON ((t1.Event = t2.Event) AND ((t1.EvDate_Time + t1.Duration) >= t2.EvDate_Time) AND ((t2.EvDate_Time + t2.Duration) >= t1.EvDate_Time)) GROUP BY t1.RID

and I was thinking of doing an INSERT INTO with this SELECT to store the results in a temporary table you could then do something like (untested)

SELECT Event,Max(Conc),Avg(Conc) FROM TempTable GROUP BY Event,EventDay

However I doubt how efficient that is going to be given that you are creating a near duplicate of all records in the chosen period in the temp table just to get the conc value. Also as mentioned, I noticed a flaw while writing this, namely that this gives for each event the number of other events that overlap with it, but does not tell you whether they overlap with each other, e.g.

if x starts then y starts then x finishes then z starts then y finishes then z finishes, I believe you would want comnc=2 as only 2 events are concurrent, but this query would give 3 as both x and z overlap with y, but not with each other.

Guess it would be back to the multiple joins with itself approach but you are talking a lot of queries there and I have to feel you would be better of doing it in PHP from the raw data or preparing the data in advance to get the columns you need ready myself.

Sorry I cant get closer to what you need.

Dai

Thank you.

Posted: Fri Jan 10, 2003 12:47 pm
by Tbull
Dai,

Thank you for your assistance, and providing the logic and structure you had in mind. I am going to have a play with the approach you provided. I know that there is some kind of processing that I can do in the query that will assist in this.

In the meantime, know that you your efforts are very much appreciated.

Tbull