1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
|
# V RegEx (Regular expression) 1.0 alpha
[TOC]
## Introduction
Here are the assumptions made during the writing of the implementation, that
are valid for all the `regex` module features:
1. The matching stops at the end of the string, *not* at newline characters.
2. The basic atomic elements of this regex engine are the tokens.
In a query string a simple character is a token.
## Differences with PCRE:
NB: We must point out that the **V-Regex module is not PCRE compliant** and thus
some behaviour will be different. This difference is due to the V philosophy,
to have one way and keep it simple.
The main differences can be summarized in the following points:
- The basic element **is the token not the sequence of symbols**, and the most
simple token, is a single character.
- `|` **the OR operator acts on tokens,** for example `abc|ebc` is not
`abc` OR `ebc`. Instead it is evaluated like `ab`, followed by `c OR e`,
followed by `bc`, because the **token is the base element**,
not the sequence of symbols.
- The **match operation stops at the end of the string**. It does *NOT* stop
at new line characters.
## Tokens
The tokens are the atomic units, used by this regex engine.
They can be one of the following:
### Simple char
This token is a simple single character like `a` or `b` etc.
### Match positional delimiters
`^` Matches the start of the string.
`$` Matches the end of the string.
### Char class (cc)
The character classes match all the chars specified inside. Use square
brackets `[ ]` to enclose them.
The sequence of the chars in the character class, is evaluated with an OR op.
For example, the cc `[abc]`, matches any character, that is `a` or `b` or `c`,
but it doesn't match `C` or `z`.
Inside a cc, it is possible to specify a "range" of characters, for example
`[ad-h]` is equivalent to writing `[adefgh]`.
A cc can have different ranges at the same time, for example `[a-zA-z0-9]`
matches all the latin lowercase, uppercase and numeric characters.
It is possible to negate the meaning of a cc, using the caret char at the
start of the cc like this: `[^abc]` . That matches every char that is NOT
`a` or `b` or `c`.
A cc can contain meta-chars like: `[a-z\d]`, that match all the lowercase
latin chars `a-z` and all the digits `\d`.
It is possible to mix all the properties of the char class together.
NB: In order to match the `-` (minus) char, it must be preceded by
a backslash in the cc, for example `[\-_\d\a]` will match:
`-` minus,
`_` underscore,
`\d` numeric chars,
`\a` lower case chars.
### Meta-chars
A meta-char is specified by a backslash, before a character.
For example `\w` is the meta-char `w`.
A meta-char can match different types of characters.
* `\w` matches a word char char `[a-zA-Z0-9_]`
* `\W` matches a non word char
* `\d` matches a digit `[0-9]`
* `\D` matches a non digit
* `\s` matches a space char, one of `[' ','\t','\n','\r','\v','\f']`
* `\S` matches a non space char
* `\a` matches only a lowercase char `[a-z]`
* `\A` matches only an uppercase char `[A-Z]`
### Quantifier
Each token can have a quantifier, that specifies how many times the character
must be matched.
#### **Short quantifiers**
- `?` matches 0 or 1 time, `a?b` matches both `ab` or `b`
- `+` matches *at least* 1 time, for example, `a+` matches both `aaa` or `a`
- `*` matches 0 or more times, for example, `a*b` matches `aaab`, `ab` or `b`
#### **Long quantifiers**
- `{x}` matches exactly x times, `a{2}` matches `aa`, but not `aaa` or `a`
- `{min,}` matches at least min times, `a{2,}` matches `aaa` or `aa`, not `a`
- `{,max}` matches at least 0 times and at maximum max times,
for example, `a{,2}` matches `a` and `aa`, but doesn't match `aaa`
- `{min,max}` matches from min times, to max times, for example
`a{2,3}` matches `aa` and `aaa`, but doesn't match `a` or `aaaa`
A long quantifier, may have a `greedy off` flag, that is the `?`
character after the brackets. `{2,4}?` means to match the minimum
number of possible tokens, in this case 2.
### Dot char
The dot is a particular meta-char, that matches "any char".
It is simpler to explain it with an example:
Suppose you have `abccc ddeef` as a source string, that you want to parse
with a regex. The following table show the query strings and the result of
parsing source string.
| query string | result |
|--------------|-------------|
| `.*c` | `abc` |
| `.*dd` | `abcc dd` |
| `ab.*e` | `abccc dde` |
| `ab.{3} .*e` | `abccc dde` |
The dot matches any character, until the next token match is satisfied.
**Important Note:** *Consecutive dots, for example `...`, are not allowed.*
*This will cause a syntax error. Use a quantifier instead.*
### OR token
The token `|`, means a logic OR operation between two consecutive tokens,
i.e. `a|b` matches a character that is `a` or `b`.
The OR token can work in a "chained way": `a|(b)|cd ` means test first `a`,
if the char is not `a`, then test the group `(b)`, and if the group doesn't
match too, finally test the token `c`.
NB: ** unlike in PCRE, the OR operation works at token level!**
It doesn't work at concatenation level!
That also means, that a query string like `abc|bde` is not equal to
`(abc)|(bde)`, but instead to `ab(c|b)de.
The OR operation works only for `c|b`, not at char concatenation level.
### Groups
Groups are a method to create complex patterns with repetitions of blocks
of tokens. The groups are delimited by round brackets `( )`. Groups can be
nested. Like all other tokens, groups can have a quantifier too.
`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` .
`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz`
Lets analyze this last case, first we have the group `#0`, that is the most
outer round brackets `(...)+`. This group has a quantifier `+`, that say to
match its content *at least one time*.
Then we have a simple char token `c`, and a second group `#1`: `(pa)+`.
This group also tries to match the sequence `pa`, *at least one time*,
as specified by the `+` quantifier.
Then, we have another simple token `z` and another simple token ` ?`,
i.e. the space char (ascii code 32) followed by the `?` quantifier,
which means that the preceding space should be matched 0 or 1 time.
This explains why the `(c(pa)+z ?)+` query string,
can match `cpaz cpapaz cpapapaz` .
In this implementation the groups are "capture groups". This means that the
last temporal result for each group, can be retrieved from the `RE` struct.
The "capture groups" are stored as indexes in the field `groups`,
that is an `[]int` inside the `RE` struct.
**example:**
```v oksyntax
text := 'cpaz cpapaz cpapapaz'
query := r'(c(pa)+z ?)+'
mut re := regex.regex_opt(query) or { panic(err) }
println(re.get_query())
// #0(c#1(pa)+z ?)+
// #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2
start, end := re.match_string(text)
// [start=0, end=20] match => [cpaz cpapaz cpapapaz]
mut gi := 0
for gi < re.groups.len {
if re.groups[gi] >= 0 {
println('${gi / 2} :[${text[re.groups[gi]..re.groups[gi + 1]]}]')
}
gi += 2
}
// groups captured
// 0 :[cpapapaz]
// 1 :[pa]
```
**note:** *to show the `group id number` in the result of the `get_query()`*
*the flag `debug` of the RE object must be `1` or `2`*
In order to simplify the use of the captured groups, it possible to use the
utility function: `get_group_list`.
This function return a list of groups using this support struct:
```v oksyntax
pub struct Re_group {
pub:
start int = -1
end int = -1
}
```
Here an example of use:
```v oksyntax
/*
This simple function converts an HTML RGB value with 3 or 6 hex digits to
an u32 value, this function is not optimized and it is only for didatical
purpose. Example: #A0B0CC #A9F
*/
fn convert_html_rgb(in_col string) u32 {
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
// this is the regex query, it use the V string interpolation to customize the regex query
// NOTE: if you want use escaped code you must use the r"" (raw) strings,
// *** please remember that the V interpoaltion doesn't work on raw strings. ***
query := '#([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})'
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(in_col)
println('start: $start, end: $end')
mut res := u32(0)
if start >= 0 {
group_list := re.get_group_list() // this is the utility function
r := ('0x' + in_col[group_list[0].start..group_list[0].end]).int() << col_mul
g := ('0x' + in_col[group_list[1].start..group_list[1].end]).int() << col_mul
b := ('0x' + in_col[group_list[2].start..group_list[2].end]).int() << col_mul
println('r: $r g: $g b: $b')
res = u32(r) << 16 | u32(g) << 8 | u32(b)
}
return res
}
```
Others utility functions are `get_group_by_id` and `get_group_bounds_by_id`
that get directly the string of a group using its `id`:
```v ignore
txt := "my used string...."
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
```
More helper functions are listed in the **Groups query functions** section.
### Groups Continuous saving
In particular situations, it is useful to have a continuous group saving.
This is possible by initializing the `group_csave` field in the `RE` struct.
This feature allows you to collect data in a continuous/streaming way.
In the example, we can pass a text, followed by an integer list,
that we wish to collect. To achieve this task, we can use the continuous
group saving, by enabling the right flag: `re.group_csave_flag = true`.
The `.group_csave` array will be filled then, following this logic:
`re.group_csave[0]` - number of total saved records
`re.group_csave[1+n*3]` - id of the saved group
`re.group_csave[1+n*3]` - start index in the source string of the saved group
`re.group_csave[1+n*3]` - end index in the source string of the saved group
The regex will save groups, until it finishes, or finds that the array has no
more space. If the space ends, no error is raised, and further records will
not be saved.
```v ignore
import regex
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
//println(re.get_code()) // uncomment to see the print of the regex execution code
re.debug=2 // enable maximum log
println("String: ${txt}")
println("Query : ${re.get_query()}")
re.debug=0 // disable log
re.group_csave_flag = true
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{
println("cg: $re.group_csave")
mut cs_i := 1
for cs_i < re.group_csave[0]*3 {
g_id := re.group_csave[cs_i]
st := re.group_csave[cs_i+1]
en := re.group_csave[cs_i+2]
println("cg[$g_id] $st $en:[${txt[st..en]}]")
cs_i += 3
}
}
}
```
The output will be:
```
String: http://www.ciao.mondo/hello/pippo12_/pera.html
Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+
Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html]
cg: [8, 0, 0, 4, 1, 7, 11, 1, 11, 16, 1, 16, 22, 1, 22, 28, 1, 28, 37, 1, 37, 42, 1, 42, 46]
cg[0] 0 4:[http]
cg[1] 7 11:[www.]
cg[1] 11 16:[ciao.]
cg[1] 16 22:[mondo/]
cg[1] 22 28:[hello/]
cg[1] 28 37:[pippo12_/]
cg[1] 37 42:[pera.]
cg[1] 42 46:[html]
```
### Named capturing groups
This regex module supports partially the question mark `?` PCRE syntax for groups.
`(?:abcd)` **non capturing group**: the content of the group will not be saved.
`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled
as `mygroup`.
The label of the groups is saved in the `group_map` of the `RE` struct,
that is a map from `string` to `int`, where the value is the index in
`group_csave` list of indexes.
Here is an example for how to use them:
```v ignore
import regex
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
//println(re.get_code()) // uncomment to see the print of the regex execution code
re.debug=2 // enable maximum log
println("String: ${txt}")
println("Query : ${re.get_query()}")
re.debug=0 // disable log
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
}
```
Output:
```
String: http://www.ciao.mondo/hello/pippo12_/pera.html
Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+
Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html]
group:'format' => [http] bounds: (0, 4)
group:'token' => [html] bounds: (42, 46)
```
In order to simplify the use of the named groups, it is possible to
use a name map in the `re` struct, using the function `re.get_group_by_name`.
Here is a more complex example of using them:
```v oksyntax
// This function demostrate the use of the named groups
fn convert_html_rgb_n(in_col string) u32 {
mut n_digit := if in_col.len == 4 { 1 } else { 2 }
mut col_mul := if in_col.len == 4 { 4 } else { 0 }
query := '#(?P<red>[a-fA-F0-9]{$n_digit})' + '(?P<green>[a-fA-F0-9]{$n_digit})' +
'(?P<blue>[a-fA-F0-9]{$n_digit})'
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(in_col)
println('start: $start, end: $end')
mut res := u32(0)
if start >= 0 {
red_s, red_e := re.get_group_by_name('red')
r := ('0x' + in_col[red_s..red_e]).int() << col_mul
green_s, green_e := re.get_group_by_name('green')
g := ('0x' + in_col[green_s..green_e]).int() << col_mul
blue_s, blue_e := re.get_group_by_name('blue')
b := ('0x' + in_col[blue_s..blue_e]).int() << col_mul
println('r: $r g: $g b: $b')
res = u32(r) << 16 | u32(g) << 8 | u32(b)
}
return res
}
```
Other utilities are `get_group_by_name` and `get_group_bounds_by_name`,
that return the string of a group using its `name`:
```v ignore
txt := "my used string...."
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
```
### Groups query functions
These functions are helpers to query the captured groups
```v ignore
// get_group_bounds_by_name get a group boundaries by its name
pub fn (re RE) get_group_bounds_by_name(group_name string) (int, int)
// get_group_by_name get a group string by its name
pub fn (re RE) get_group_by_name(group_name string) string
// get_group_by_id get a group boundaries by its id
pub fn (re RE) get_group_bounds_by_id(group_id int) (int,int)
// get_group_by_id get a group string by its id
pub fn (re RE) get_group_by_id(in_txt string, group_id int) string
struct Re_group {
pub:
start int = -1
end int = -1
}
// get_group_list return a list of Re_group for the found groups
pub fn (re RE) get_group_list() []Re_group
```
## Flags
It is possible to set some flags in the regex parser, that change
the behavior of the parser itself.
```v ignore
// example of flag settings
mut re := regex.new()
re.flag = regex.F_BIN
```
- `F_BIN`: parse a string as bytes, utf-8 management disabled.
- `F_EFM`: exit on the first char matches in the query, used by the
find function.
- `F_MS`: matches only if the index of the start match is 0,
same as `^` at the start of the query string.
- `F_ME`: matches only if the end index of the match is the last char
of the input string, same as `$` end of query string.
- `F_NL`: stop the matching if found a new line char `\n` or `\r`
## Functions
### Initializer
These functions are helper that create the `RE` struct,
a `RE` struct can be created manually if you needed.
#### **Simplified initializer**
```v ignore
// regex create a regex object from the query string and compile it
pub fn regex_opt(in_query string) ?RE
```
#### **Base initializer**
```v ignore
// new_regex create a REgex of small size, usually sufficient for ordinary use
pub fn new() RE
```
#### **Custom initialization**
For some particular needs, it is possible to initialize a fully customized regex:
```v ignore
pattern = r"ab(.*)(ac)"
// init custom regex
mut re := regex.RE{}
// max program length, can not be longer then the pattern
re.prog = []Token {len: pattern.len + 1}
// can not be more char class the the length of the pattern
re.cc = []CharClass{len: pattern.len}
re.group_csave_flag = false // true enable continuos group saving if needed
re.group_max_nested = 128 // set max 128 group nested possible
re.group_max = pattern.len>>1 // we can't have more groups than the half of the pattern legth
re.group_stack = []int{len: re.group_max, init: -1}
re.group_data = []int{len: re.group_max, init: -1}
```
### Compiling
After an initializer is used, the regex expression must be compiled with:
```v ignore
// compile compiles the REgex returning an error if the compilation fails
pub fn (re mut RE) compile_opt(in_txt string) ?
```
### Matching Functions
These are the matching functions
```v ignore
// match_string try to match the input string, return start and end index if found else start is -1
pub fn (re mut RE) match_string(in_txt string) (int,int)
```
## Find and Replace
There are the following find and replace functions:
#### Find functions
```v ignore
// find try to find the first match in the input string
// return start and end index if found else start is -1
pub fn (re mut RE) find(in_txt string) (int,int)
// find_all find all the "non overlapping" occurrences of the matching pattern
// return a list of start end indexes like: [3,4,6,8]
// the matches are [3,4] and [6,8]
pub fn (re mut RE) find_all(in_txt string) []int
// find_all find all the "non overlapping" occurrences of the matching pattern
// return a list of strings
// the result is like ["first match","secon match"]
pub fn (mut re RE) find_all_str(in_txt string) []string
```
#### Replace functions
```v ignore
// replace return a string where the matches are replaced with the repl_str string,
// this function support groups in the replace string
pub fn (re mut RE) replace(in_txt string, repl string) string
```
replace string can include groups references:
```v ignore
txt := "Today it is a good day."
query := r'(a\w)[ ,.]'
mut re := regex.regex_opt(query)?
res := re.replace(txt, r"__[\0]__")
```
in this example we used the group `0` in the replace string: `\0`, the result will be:
```
Today it is a good day. => Tod__[ay]__it is a good d__[ay]__
```
**Note:** in the replace strings can be used only groups from `0` to `9`.
If the usage of `groups` in the replace process, is not needed, it is possible
to use a quick function:
```v ignore
// replace_simple return a string where the matches are replaced with the replace string
pub fn (mut re RE) replace_simple(in_txt string, repl string) string
```
#### Custom replace function
For complex find and replace operations, you can use `replace_by_fn` .
The `replace_by_fn`, uses a custom replace callback function, thus
allowing customizations. The custom callback function is called for
every non overlapped find.
The custom callback function must be of the type:
```v ignore
// type of function used for custom replace
// in_txt source text
// start index of the start of the match in in_txt
// end index of the end of the match in in_txt
// --- the match is in in_txt[start..end] ---
fn (re RE, in_txt string, start int, end int) string
```
The following example will clarify its usage:
```v ignore
import regex
// customized replace functions
// it will be called on each non overlapped find
fn my_repl(re regex.RE, in_txt string, start int, end int) string {
g0 := re.get_group_by_id(in_txt, 0)
g1 := re.get_group_by_id(in_txt, 1)
g2 := re.get_group_by_id(in_txt, 2)
return "*$g0*$g1*$g2*"
}
fn main(){
txt := "today [John] is gone to his house with (Jack) and [Marie]."
query := r"(.)(\A\w+)(.)"
mut re := regex.regex_opt(query) or { panic(err) }
result := re.replace_by_fn(txt, my_repl)
println(result)
}
```
Output:
```
today *[*John*]* is gone to his house with *(*Jack*)* and *[*Marie*]*.
```
## Debugging
This module has few small utilities to you write regex patterns.
### **Syntax errors highlight**
The next example code shows how to visualize regex pattern syntax errors
in the compilation phase:
```v oksyntax
query := r'ciao da ab[ab-]'
// there is an error, a range not closed!!
mut re := new()
re.compile_opt(query) or { println(err) }
// output!!
// query: ciao da ab[ab-]
// err : ----------^
// ERROR: ERR_SYNTAX_ERROR
```
### **Compiled code**
It is possible to view the compiled code calling the function `get_query()`.
The result will be something like this:
```
========================================
v RegEx compiler v 1.0 alpha output:
PC: 0 ist: 92000000 ( GROUP_START #:0 { 1, 1}
PC: 1 ist: 98000000 . DOT_CHAR nx chk: 4 { 1, 1}
PC: 2 ist: 94000000 ) GROUP_END #:0 { 1, 1}
PC: 3 ist: 92000000 ( GROUP_START #:1 { 1, 1}
PC: 4 ist: 90000000 [\A] BSLS { 1, 1}
PC: 5 ist: 90000000 [\w] BSLS { 1,MAX}
PC: 6 ist: 94000000 ) GROUP_END #:1 { 1, 1}
PC: 7 ist: 92000000 ( GROUP_START #:2 { 1, 1}
PC: 8 ist: 98000000 . DOT_CHAR nx chk: -1 last! { 1, 1}
PC: 9 ist: 94000000 ) GROUP_END #:2 { 1, 1}
PC: 10 ist: 88000000 PROG_END { 0, 0}
========================================
```
`PC`:`int` is the program counter or step of execution, each single step is a token.
`ist`:`hex` is the token instruction id.
`[a]` is the char used by the token.
`query_ch` is the type of token.
`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token
### **Log debug**
The log debugger allow to print the status of the regex parser when the
parser is running. It is possible to have two different levels of
debug information: 1 is normal, while 2 is verbose.
Here is an example:
*normal* - list only the token instruction with their values
```ignore
// re.flag = 1 // log level normal
flags: 00000000
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 10 PROG_END
```
*verbose* - list all the instructions and states of the parser
```ignore
flags: 00000000
# 0 s: start PC: NA
# 1 s: ist_next PC: NA
# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1)
# 3 s: ist_quant_p PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1)
# 4 s: ist_next PC: NA
# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1)
# 6 s: ist_quant_p PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1)
# 8 s: ist_quant_p PC: i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1)
# 9 s: ist_next PC: NA
# 10 PROG_END
# 11 PROG_END
```
the columns have the following meaning:
`# 2` number of actual steps from the start of parsing
`s: ist_next` state of the present step
`PC: 1` program counter of the step
`=>7fffffff ` hex code of the instruction
`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed,
`len` the length in byte of the char parsed
`f.m:[ 0, 1]` `f` index of the first match in the source string, `m` index that is actual matching
`query_ch: [b]` token in use and its char
`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition,
`?` is the greedy off flag if present.
### **Custom Logger output**
The debug functions output uses the `stdout` as default,
it is possible to provide an alternative output, by setting a custom
output function:
```v oksyntax
// custom print function, the input will be the regex debug string
fn custom_print(txt string) {
println('my log: $txt')
}
mut re := new()
re.log_func = custom_print
// every debug output from now will call this function
```
## Example code
Here an example that perform some basically match of strings
```v ignore
import regex
fn main(){
txt := "http://www.ciao.mondo/hello/pippo12_/pera.html"
query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+"
mut re := regex.regex_opt(query) or { panic(err) }
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
} else {
println("No Match")
}
}
```
Here an example of total customization of the regex environment creation:
```v ignore
import regex
fn main(){
txt := "today John is gone to his house with Jack and Marie."
query := r"(?:(?P<word>\A\w+)|(?:\a\w+)[\s.]?)+"
// init regex
mut re := regex.RE{}
// max program length, can not be longer then the query
re.prog = []regex.Token {len: query.len + 1}
// can not be more char class the the length of the query
re.cc = []regex.CharClass{len: query.len}
re.prog = []regex.Token {len: query.len+1}
// enable continuos group saving
re.group_csave_flag = true
// set max 128 group nested
re.group_max_nested = 128
// we can't have more groups than the half of the query legth
re.group_max = query.len>>1
// compile the query
re.compile_opt(query) or { panic(err) }
start, end := re.match_string(txt)
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
} else {
println("No Match")
}
// show results for continuos group saving
if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{
println("cg: $re.group_csave")
mut cs_i := 1
for cs_i < re.group_csave[0]*3 {
g_id := re.group_csave[cs_i]
st := re.group_csave[cs_i+1]
en := re.group_csave[cs_i+2]
println("cg[$g_id] $st $en:[${txt[st..en]}]")
cs_i += 3
}
}
// show results for captured groups
if start >= 0 {
println("Match ($start, $end) => [${txt[start..end]}]")
for g_index := 0; g_index < re.group_count ; g_index++ {
println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \
bounds: ${re.get_group_bounds_by_id(g_index)}")
}
for name in re.group_map.keys() {
println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \
bounds: ${re.get_group_bounds_by_name(name)}")
}
} else {
println("No Match")
}
}
```
More examples are available in the test code for the `regex` module,
see `vlib/regex/regex_test.v`.
|