Page 1 of 1

Head scratching regex problem for a newbie

Posted: Sat Aug 23, 2008 12:39 pm
by omit46
Hi everyone,
I am a complete newbie in regular expression, went thru a lot of regular expression tutorial but still can't figure out how to solve this problem. Maybe regex gurus can help.

What I am trying to do is: from the following text I want to first find a particular pattern of lines. "\\S+([ \\t]+-?[0-9.]+){8}" expression gives me all the lines I am looking for. Out of these lines I want to check if there are two lines that starts with the same word. If such a match is found then I want to add all the high, low, open, close values of the 2nd line to the 1st line and then remove the 2nd line from the text. I hope it doesnt sound too complicated. Is it possible? Or is it too difficult and too much to handle by regex?

e.g. "\\S+([ \\t]+-?[0-9.]+){8}" matches all the stocks from the "A GROUP" stocks to the Spot Transactions. The stock "LEGACYFOOT" is present in two lines (once in Z group and once in spot transaction). I want add the open,high, close, low of LEGACYFOOT in Spot Transaction to the "LEGACYFOOT" in Z category. After that delete the 2nd line of "LEGACYFOOT" occurence from the text.

thanks
omit


the text:(edited to simplify)

DHAKA STOCK EXCHANGE LTD.




TODAY'S SHARE MARKET : 2008-08-21
=================================
(If the page is not updated please press the refresh button)


EQUITY : 745081109873.65
DEBT SECURITIES : 202154936500.00


TOTAL : 947236046373.65







PRICES IN PUBLIC TRANSACTIONS : 2008-08-21
==========================================
A Group
-------

Instr Code Open High Low Close %Chg Trade Volume Value(Lc)

1STBSRS 705.00 710.00 686.00 691.25 -.18 85 5650 39.365
1STICB 5200.00 5250.00 5200.00 5224.75 4.22 6 40 2.090
2NDICB 1650.00 1650.00 1561.00 1583.00 -.07 9 75 1.187
3RDICB 1020.25 1036.00 1020.25 1029.50 -.50 6 85 .875
4THICB 1006.25 1050.00 1006.25 1035.00 1.42 11 160 1.656
MIRACLEIND 26.20 27.00 26.10 26.80 3.87 64 60000 15.965
MITHUNKNIT 184.50 185.00 176.00 180.75 .97 21 960 1.739
QSMDRYCELL 37.50 38.50 37.30 38.00 3.26 191 150500 57.158
RAHIMTEXT 390.00 420.00 390.00 410.00 5.12 2 30 .123
RANFOUNDRY 59.50 62.00 58.90 61.50 5.12 126 81000 49.217
UTTARABANK 2849.00 2956.50 2848.00 2900.25 2.94 2892 48130 1404.152
UTTARAFIN 766.00 825.00 766.00 819.75 5.06 179 15350 124.758
----- -------- ---------
----- -------- ---------
55122 12922983 22780.243
"A Group" Scrips traded in Public Market = 146



B Group
-------

Instr Code Open High Low Close %Chg Trade Volume Value(Lc)

AGRANINS 213.00 239.00 213.00 226.50 8.76 179 17550 39.805
BDAUTOCA 157.00 159.75 153.00 156.00 -.63 23 875 1.366
NITOLINS 332.25 357.00 332.25 340.00 2.10 67 6250 21.441
SONARBAINS 145.00 153.00 143.75 150.50 6.54 99 11800 17.340
----- -------- ---------
----- -------- ---------
741 223380 154.313
"B Group" Scrips traded in Public Market = 12




G Group
-------

"G Group" Scrips traded in Public Market = 0




N Group
-------

Instr Code Open High Low Close %Chg Trade Volume Value(Lc)

CONTININS 228.00 240.00 215.00 231.75 8.29 153 11450 26.258
DBH 1180.00 1249.00 1155.00 1224.75 6.63 94 5150 61.723
MPETROLEUM 131.50 133.00 129.90 130.60 2.03 496 96800 126.857
TITASGAS 354.50 357.75 344.00 350.75 .64 2126 379400 1331.344
----- -------- ---------
----- -------- ---------
3979 742515 1729.643

"N Group" Scrips traded in Public Market = 8




Z Group
-------

Instr Code Open High Low Close %Chg Trade Volume Value(Lc)

ALLTEX 68.75 73.00 68.50 71.75 4.36 18 1500 1.078
ANLIMAYARN 50.25 50.25 50.00 50.00 3.62 2 150 .075
LAFSURCEML 568.00 582.00 567.00 577.50 1.27 206 18550 107.275
LEGACYFOOT 14.80 17.00 14.80 16.50 10.73 77 64000 10.261
LEXCO 122.00 124.00 122.00 122.50 4.25 2 70 .086
SHYAMPSUG 10.90 10.90 10.90 10.90 3.80 6 700 .076
SOCIALINV 365.50 375.00 365.00 371.00 2.77 584 52100 193.397
WATACHEM 305.25 312.25 305.25 311.25 4.01 6 180 .560
WONDERTOYS 60.75 62.50 59.25 61.50 2.50 21 2700 1.662
ZEALBANGLA 14.50 14.90 14.50 14.60 .68 7 3900 .570
----- -------- ---------
----- -------- ---------
2888 467200 962.587
"Z Group" Scrips traded in Public Market = 60

===========================

62730 14356078 25626.792

Total number of scrips traded in Public Market = 226







PRICES IN SPOT TRANSACTIONS : 2008-08-21
==========================================

Instr Code Open High Low Close %Chg Trade Volume Value(Lc)

LEGACYFOOT 14.80 16.80 16.00 16.50 10.73 9 9000 1.461
PUBALIBANK 859.00 872.75 853.00 857.00 1.48 1216 38105 328.444
----- -------- ---------
----- -------- ---------
1225 47105 329.904


Total number of scrips traded in Spot Market = 2






PRICES IN SPOT TRANSACTIONS (BONDs) : 2008-08-21
==================================================

Total number of BONDs traded in Spot Market = 0






PRICES IN ODDLOT TRANSACTIONS : 2008-08-21
============================================

Instr Code Max Price Min Price Trades Quantity Value(In lakhs)

ABBANK 909.00 902.00 2 4 .036
ACI 475.00 475.00 2 30 .143
AGNISYSL 67.00 60.10 4 540 .345
ALARABANK 465.00 395.00 19 354 1.534
APEXADELFT 2600.00 2600.00 3 30 .780
UTTARABANK 2950.00 2950.00 1 1 .030
UTTARAFIN 800.00 800.00 3 62 .496
------ -------- ------------
------ -------- ------------
438 12122 27.815
Total number of scrips traded in Oddlot = 75






PRICES IN BLOCK TRANSACTIONS : 2008-08-21
===========================================

Total number of scrips traded in Block = 0

Re: Head scratching regex problem for a newbie

Posted: Sat Aug 23, 2008 3:51 pm
by prometheuzz
omit46 wrote:Hi everyone,
I am a complete newbie in regular expression, went thru a lot of regular expression tutorial but still can't figure out how to solve this problem. Maybe regex gurus can help.

What I am trying to do is: from the following text I want to first find a particular pattern of lines. "\\S+([ \\t]+-?[0-9.]+){8}" expression gives me all the lines I am looking for. Out of these lines I want to check if there are two lines that starts with the same word. If such a match is found then I want to add all the high, low, open, close values of the 2nd line to the 1st line and then remove the 2nd line from the text. I hope it doesnt sound too complicated. Is it possible? Or is it too difficult and too much to handle by regex?

...
Parts can be done using regex, but not all of it: comparing strings is not something regex can handle. In what language are you implementing this? Java?

Re: Head scratching regex problem for a newbie

Posted: Sat Aug 23, 2008 4:07 pm
by omit46
thanks. I am doing this in c# and .net. I know how to replace it once I find the duplicate stocks. But I can't figure out how to search for it using regex.

Re: Head scratching regex problem for a newbie

Posted: Sat Aug 23, 2008 4:17 pm
by prometheuzz
omit46 wrote:thanks. I am doing this in c# and .net. I know how to replace it once I find the duplicate stocks. But I can't figure out how to search for it using regex.
Like I said: comparing/searching strings is not something a regex engine can/should do. Matching the lines you're interested in (as you are doing now) is indeed a job for regex. Those matches can then be stored in a map-like collection (I believe it's called a NameValueCollection in C#). The name of the stock is your key, and a custom class/object holding those values is the value belonging to that key. It seems like the regex part of your problem has already been done: the rest is up to C#.

Good luck!

Re: Head scratching regex problem for a newbie

Posted: Sat Aug 23, 2008 5:09 pm
by omit46
prometheuzz wrote:
omit46 wrote:thanks. I am doing this in c# and .net. I know how to replace it once I find the duplicate stocks. But I can't figure out how to search for it using regex.
Like I said: comparing/searching strings is not something a regex engine can/should do. Matching the lines you're interested in (as you are doing now) is indeed a job for regex. Those matches can then be stored in a map-like collection (I believe it's called a NameValueCollection in C#). The name of the stock is your key, and a custom class/object holding those values is the value belonging to that key. It seems like the regex part of your problem has already been done: the rest is up to C#.

Good luck!

thanks. After the matching I should be using the c#. But once I get my matching result all I have to do is "match two lines that starts with the same word". Isn't it regex's job?

right now this is what I want to do: Match lines that start with the same word

For the following text two lines start with "2NDICB" and two lines with "3NDICB". I just want to get the matching lines. Is regex or c#'s string library better choice for extracting the values from the lines?
1STICB 5200.00 5250.00 5200.00 5224.75 4.22 6 40 2.090
2NDICB 1650.00 1650.00 1561.00 1583.00 -.07 9 75 1.187
3RDICB 1020.25 1036.00 1020.25 1029.50 -.50 6 85 .875
4THICB 1006.25 1050.00 1006.25 1035.00 1.42 11 160 1.656
3RDICB 1650.00 1650.00 1561.00 1583.00 -.07 9 75 1.187
2NDICB 5200.00 5250.00 5200.00 5224.75 4.22 6 40 2.090

Re: Head scratching regex problem for a newbie

Posted: Sun Aug 24, 2008 12:37 am
by prometheuzz
omit46 wrote:...

thanks. After the matching I should be using the c#. But once I get my matching result all I have to do is "match two lines that starts with the same word". Isn't it regex's job?
...
No, the functionality you describe isn't regex' job. You could however (ab)use regex' back references.
Since this is a PHP forum, I'll post an example in PHP:

Code: Select all

<?php
$contents_of_file = '
1STICB 5200.00 5250.00 5200.00 5224.75 4.22 6 40 2.090
2NDICB 1650.00 1650.00 1561.00 1583.00 -.07 9 75 1.187
3RDICB 1020.25 1036.00 1020.25 1029.50 -.50 6 85 .875
4THICB 1006.25 1050.00 1006.25 1035.00 1.42 11 160 1.656
3RDICB 50.00 10.00 61.00 13.00 -.07 9 75 1.187
2NDICB 50.00 50.00 50.00 52.75 4.22 6 40 2.090
';
 
if(preg_match_all(
    '/(^\w++)(?:\s++-?(?:(?:\d+)?\.)?\d+)+$(?=.*?(\1[^\n]++))/ms', 
    $contents_of_file, $matches)) {
  print_r($matches);
}
/* the output when running this example:
Array
(
    [0] => Array
        (
            [0] => 2NDICB 1650.00 1650.00 1561.00 1583.00 -.07 9 75 1.187
            [1] => 3RDICB 1020.25 1036.00 1020.25 1029.50 -.50 6 85 .875
        )
 
    [1] => Array
        (
            [0] => 2NDICB
            [1] => 3RDICB
        )
 
    [2] => Array
        (
            [0] => 2NDICB 50.00 50.00 50.00 52.75 4.22 6 40 2.090
            [1] => 3RDICB 50.00 10.00 61.00 13.00 -.07 9 75 1.187
        )
)
*/
?>
The regex, which should be reproducible in C#, is this one:

Code: Select all

'(^\w++)(?:\s++-?(?:(?:\d+)?\.)?\d+)+$(?=.*?(\1[^\n]++))'
the flags behind it (see the original code snippet) mean:
m = multi line, so that for each line you can use the ^ as a beginning of the line and $ as the end of the line, otherwise the ^ and $ would have matches the beginning and end of the entire string;
s = dot-all, which causes the . (dot) meta character to match all characters. If not used, it wouldn't match a new-line character.

A short explanation:

Code: Select all

'(^\w++)(?:\s++-?(?:(?:\d+)?\.)?\d+)+$'
// Matches a complete line you're interested in and because of the ( and ) 
// around the '^\w+' it "remembers" the first word of your match and stores
// it in back reference "\1".
 
'(?=.*?(\1[^\n]++))'
// (?=X) is called positive look ahead. It matches any number of characters
// followed by the match from back reference 1. If such a back reference is 
// found then keep matching until you encounter a new-line character. And again: 
// because of the ( and ) around '\1[^\n]++' the match is remembered (as you 
// can see in the output of my code snippet).
But now you have only found the duplicates (this won't work for entries that occur more than twice!!!) and when I understand you correctly, you also want to replace certain matches as well (not quite clear to me). This will get rather messy when you want to do this mainly using regexes. And for larger strings/files, it can (and most probably will!) become very slow because of the regex making expensive operations.

So, my recommendation still stands: find the strings you're interested in using regex and store those matches in a key-value based collection and take steps when a key is found twice. It will be easier to program, and much, much easier to maintain.

Best of luck.

More information on:
back references: http://www.regular-expressions.info/brackets.html
(positive) look a rounds: http://www.regular-expressions.info/lookaround.html

Re: Head scratching regex problem for a newbie

Posted: Sun Aug 24, 2008 6:21 am
by prometheuzz
Thanks for letting me know you've moved the discussion here:
http://regexadvice.com/forums/thread/45573.aspx

Bye.

Re: Head scratching regex problem for a newbie

Posted: Wed Aug 27, 2008 1:39 pm
by omit46
prometheuzz wrote:Thanks for letting me know you've moved the discussion here:
http://regexadvice.com/forums/thread/45573.aspx

Bye.
Finally figured it out.
I posted in other forums to get reply early. Your soln is very simple. Thank you for giving me your time.
cheers
omi