diff options
author | Indrajith K L | 2022-12-03 17:00:20 +0530 |
---|---|---|
committer | Indrajith K L | 2022-12-03 17:00:20 +0530 |
commit | f5c4671bfbad96bf346bd7e9a21fc4317b4959df (patch) | |
tree | 2764fc62da58f2ba8da7ed341643fc359873142f /v_windows/v/vlib/regex/README.md | |
download | cli-tools-windows-master.tar.gz cli-tools-windows-master.tar.bz2 cli-tools-windows-master.zip |
Diffstat (limited to 'v_windows/v/vlib/regex/README.md')
-rw-r--r-- | v_windows/v/vlib/regex/README.md | 874 |
1 files changed, 874 insertions, 0 deletions
diff --git a/v_windows/v/vlib/regex/README.md b/v_windows/v/vlib/regex/README.md new file mode 100644 index 0000000..0faa833 --- /dev/null +++ b/v_windows/v/vlib/regex/README.md @@ -0,0 +1,874 @@ +# V RegEx (Regular expression) 1.0 alpha + +[TOC] + +## Introduction + +Here are the assumptions made during the writing of the implementation, that +are valid for all the `regex` module features: + +1. The matching stops at the end of the string, *not* at newline characters. + +2. The basic atomic elements of this regex engine are the tokens. +In a query string a simple character is a token. + + +## Differences with PCRE: + +NB: We must point out that the **V-Regex module is not PCRE compliant** and thus +some behaviour will be different. This difference is due to the V philosophy, +to have one way and keep it simple. + +The main differences can be summarized in the following points: + +- The basic element **is the token not the sequence of symbols**, and the most +simple token, is a single character. + +- `|` **the OR operator acts on tokens,** for example `abc|ebc` is not +`abc` OR `ebc`. Instead it is evaluated like `ab`, followed by `c OR e`, +followed by `bc`, because the **token is the base element**, +not the sequence of symbols. + +- The **match operation stops at the end of the string**. It does *NOT* stop +at new line characters. + + +## Tokens + +The tokens are the atomic units, used by this regex engine. +They can be one of the following: + + +### Simple char + +This token is a simple single character like `a` or `b` etc. + + +### Match positional delimiters + +`^` Matches the start of the string. + +`$` Matches the end of the string. + + +### Char class (cc) + +The character classes match all the chars specified inside. Use square +brackets `[ ]` to enclose them. + +The sequence of the chars in the character class, is evaluated with an OR op. + +For example, the cc `[abc]`, matches any character, that is `a` or `b` or `c`, +but it doesn't match `C` or `z`. + +Inside a cc, it is possible to specify a "range" of characters, for example +`[ad-h]` is equivalent to writing `[adefgh]`. + +A cc can have different ranges at the same time, for example `[a-zA-z0-9]` +matches all the latin lowercase, uppercase and numeric characters. + +It is possible to negate the meaning of a cc, using the caret char at the +start of the cc like this: `[^abc]` . That matches every char that is NOT +`a` or `b` or `c`. + +A cc can contain meta-chars like: `[a-z\d]`, that match all the lowercase +latin chars `a-z` and all the digits `\d`. + +It is possible to mix all the properties of the char class together. + +NB: In order to match the `-` (minus) char, it must be preceded by + a backslash in the cc, for example `[\-_\d\a]` will match: + `-` minus, + `_` underscore, + `\d` numeric chars, + `\a` lower case chars. + +### Meta-chars + +A meta-char is specified by a backslash, before a character. +For example `\w` is the meta-char `w`. + +A meta-char can match different types of characters. + +* `\w` matches a word char char `[a-zA-Z0-9_]` +* `\W` matches a non word char +* `\d` matches a digit `[0-9]` +* `\D` matches a non digit +* `\s` matches a space char, one of `[' ','\t','\n','\r','\v','\f']` +* `\S` matches a non space char +* `\a` matches only a lowercase char `[a-z]` +* `\A` matches only an uppercase char `[A-Z]` + +### Quantifier + +Each token can have a quantifier, that specifies how many times the character +must be matched. + +#### **Short quantifiers** + +- `?` matches 0 or 1 time, `a?b` matches both `ab` or `b` +- `+` matches *at least* 1 time, for example, `a+` matches both `aaa` or `a` +- `*` matches 0 or more times, for example, `a*b` matches `aaab`, `ab` or `b` + +#### **Long quantifiers** + +- `{x}` matches exactly x times, `a{2}` matches `aa`, but not `aaa` or `a` +- `{min,}` matches at least min times, `a{2,}` matches `aaa` or `aa`, not `a` +- `{,max}` matches at least 0 times and at maximum max times, + for example, `a{,2}` matches `a` and `aa`, but doesn't match `aaa` +- `{min,max}` matches from min times, to max times, for example + `a{2,3}` matches `aa` and `aaa`, but doesn't match `a` or `aaaa` + +A long quantifier, may have a `greedy off` flag, that is the `?` +character after the brackets. `{2,4}?` means to match the minimum +number of possible tokens, in this case 2. + +### Dot char + +The dot is a particular meta-char, that matches "any char". + +It is simpler to explain it with an example: + +Suppose you have `abccc ddeef` as a source string, that you want to parse +with a regex. The following table show the query strings and the result of +parsing source string. + +| query string | result | +|--------------|-------------| +| `.*c` | `abc` | +| `.*dd` | `abcc dd` | +| `ab.*e` | `abccc dde` | +| `ab.{3} .*e` | `abccc dde` | +The dot matches any character, until the next token match is satisfied. + +**Important Note:** *Consecutive dots, for example `...`, are not allowed.* +*This will cause a syntax error. Use a quantifier instead.* + +### OR token + +The token `|`, means a logic OR operation between two consecutive tokens, +i.e. `a|b` matches a character that is `a` or `b`. + +The OR token can work in a "chained way": `a|(b)|cd ` means test first `a`, +if the char is not `a`, then test the group `(b)`, and if the group doesn't +match too, finally test the token `c`. + +NB: ** unlike in PCRE, the OR operation works at token level!** +It doesn't work at concatenation level! + +That also means, that a query string like `abc|bde` is not equal to +`(abc)|(bde)`, but instead to `ab(c|b)de. +The OR operation works only for `c|b`, not at char concatenation level. + +### Groups + +Groups are a method to create complex patterns with repetitions of blocks +of tokens. The groups are delimited by round brackets `( )`. Groups can be +nested. Like all other tokens, groups can have a quantifier too. + +`c(pa)+z` match `cpapaz` or `cpaz` or `cpapapaz` . + +`(c(pa)+z ?)+` matches `cpaz cpapaz cpapapaz` or `cpapaz` + +Lets analyze this last case, first we have the group `#0`, that is the most +outer round brackets `(...)+`. This group has a quantifier `+`, that say to +match its content *at least one time*. + +Then we have a simple char token `c`, and a second group `#1`: `(pa)+`. +This group also tries to match the sequence `pa`, *at least one time*, +as specified by the `+` quantifier. + +Then, we have another simple token `z` and another simple token ` ?`, +i.e. the space char (ascii code 32) followed by the `?` quantifier, +which means that the preceding space should be matched 0 or 1 time. + +This explains why the `(c(pa)+z ?)+` query string, +can match `cpaz cpapaz cpapapaz` . + +In this implementation the groups are "capture groups". This means that the +last temporal result for each group, can be retrieved from the `RE` struct. + +The "capture groups" are stored as indexes in the field `groups`, +that is an `[]int` inside the `RE` struct. + +**example:** + +```v oksyntax +text := 'cpaz cpapaz cpapapaz' +query := r'(c(pa)+z ?)+' +mut re := regex.regex_opt(query) or { panic(err) } +println(re.get_query()) +// #0(c#1(pa)+z ?)+ +// #0 and #1 are the ids of the groups, are shown if re.debug is 1 or 2 +start, end := re.match_string(text) +// [start=0, end=20] match => [cpaz cpapaz cpapapaz] +mut gi := 0 +for gi < re.groups.len { + if re.groups[gi] >= 0 { + println('${gi / 2} :[${text[re.groups[gi]..re.groups[gi + 1]]}]') + } + gi += 2 +} +// groups captured +// 0 :[cpapapaz] +// 1 :[pa] +``` + +**note:** *to show the `group id number` in the result of the `get_query()`* +*the flag `debug` of the RE object must be `1` or `2`* + +In order to simplify the use of the captured groups, it possible to use the +utility function: `get_group_list`. + +This function return a list of groups using this support struct: + +```v oksyntax +pub struct Re_group { +pub: + start int = -1 + end int = -1 +} +``` + +Here an example of use: + +```v oksyntax +/* +This simple function converts an HTML RGB value with 3 or 6 hex digits to +an u32 value, this function is not optimized and it is only for didatical +purpose. Example: #A0B0CC #A9F +*/ +fn convert_html_rgb(in_col string) u32 { + mut n_digit := if in_col.len == 4 { 1 } else { 2 } + mut col_mul := if in_col.len == 4 { 4 } else { 0 } + // this is the regex query, it use the V string interpolation to customize the regex query + // NOTE: if you want use escaped code you must use the r"" (raw) strings, + // *** please remember that the V interpoaltion doesn't work on raw strings. *** + query := '#([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})([a-fA-F0-9]{$n_digit})' + mut re := regex.regex_opt(query) or { panic(err) } + start, end := re.match_string(in_col) + println('start: $start, end: $end') + mut res := u32(0) + if start >= 0 { + group_list := re.get_group_list() // this is the utility function + r := ('0x' + in_col[group_list[0].start..group_list[0].end]).int() << col_mul + g := ('0x' + in_col[group_list[1].start..group_list[1].end]).int() << col_mul + b := ('0x' + in_col[group_list[2].start..group_list[2].end]).int() << col_mul + println('r: $r g: $g b: $b') + res = u32(r) << 16 | u32(g) << 8 | u32(b) + } + return res +} +``` + +Others utility functions are `get_group_by_id` and `get_group_bounds_by_id` +that get directly the string of a group using its `id`: + +```v ignore +txt := "my used string...." +for g_index := 0; g_index < re.group_count ; g_index++ { + println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \ + bounds: ${re.get_group_bounds_by_id(g_index)}") +} +``` + +More helper functions are listed in the **Groups query functions** section. + +### Groups Continuous saving + +In particular situations, it is useful to have a continuous group saving. +This is possible by initializing the `group_csave` field in the `RE` struct. + +This feature allows you to collect data in a continuous/streaming way. + +In the example, we can pass a text, followed by an integer list, +that we wish to collect. To achieve this task, we can use the continuous +group saving, by enabling the right flag: `re.group_csave_flag = true`. + +The `.group_csave` array will be filled then, following this logic: + +`re.group_csave[0]` - number of total saved records +`re.group_csave[1+n*3]` - id of the saved group +`re.group_csave[1+n*3]` - start index in the source string of the saved group +`re.group_csave[1+n*3]` - end index in the source string of the saved group + +The regex will save groups, until it finishes, or finds that the array has no +more space. If the space ends, no error is raised, and further records will +not be saved. + +```v ignore +import regex +fn main(){ + txt := "http://www.ciao.mondo/hello/pippo12_/pera.html" + query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+" + + mut re := regex.regex_opt(query) or { panic(err) } + //println(re.get_code()) // uncomment to see the print of the regex execution code + re.debug=2 // enable maximum log + println("String: ${txt}") + println("Query : ${re.get_query()}") + re.debug=0 // disable log + re.group_csave_flag = true + start, end := re.match_string(txt) + if start >= 0 { + println("Match ($start, $end) => [${txt[start..end]}]") + } else { + println("No Match") + } + + if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{ + println("cg: $re.group_csave") + mut cs_i := 1 + for cs_i < re.group_csave[0]*3 { + g_id := re.group_csave[cs_i] + st := re.group_csave[cs_i+1] + en := re.group_csave[cs_i+2] + println("cg[$g_id] $st $en:[${txt[st..en]}]") + cs_i += 3 + } + } +} +``` + +The output will be: + +``` +String: http://www.ciao.mondo/hello/pippo12_/pera.html +Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+ +Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html] +cg: [8, 0, 0, 4, 1, 7, 11, 1, 11, 16, 1, 16, 22, 1, 22, 28, 1, 28, 37, 1, 37, 42, 1, 42, 46] +cg[0] 0 4:[http] +cg[1] 7 11:[www.] +cg[1] 11 16:[ciao.] +cg[1] 16 22:[mondo/] +cg[1] 22 28:[hello/] +cg[1] 28 37:[pippo12_/] +cg[1] 37 42:[pera.] +cg[1] 42 46:[html] +``` + +### Named capturing groups + +This regex module supports partially the question mark `?` PCRE syntax for groups. + +`(?:abcd)` **non capturing group**: the content of the group will not be saved. + +`(?P<mygroup>abcdef)` **named group:** the group content is saved and labeled +as `mygroup`. + +The label of the groups is saved in the `group_map` of the `RE` struct, +that is a map from `string` to `int`, where the value is the index in +`group_csave` list of indexes. + +Here is an example for how to use them: +```v ignore +import regex +fn main(){ + txt := "http://www.ciao.mondo/hello/pippo12_/pera.html" + query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+" + + mut re := regex.regex_opt(query) or { panic(err) } + //println(re.get_code()) // uncomment to see the print of the regex execution code + re.debug=2 // enable maximum log + println("String: ${txt}") + println("Query : ${re.get_query()}") + re.debug=0 // disable log + start, end := re.match_string(txt) + if start >= 0 { + println("Match ($start, $end) => [${txt[start..end]}]") + } else { + println("No Match") + } + + for name in re.group_map.keys() { + println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \ + bounds: ${re.get_group_bounds_by_name(name)}") + } +} +``` + +Output: + +``` +String: http://www.ciao.mondo/hello/pippo12_/pera.html +Query : #0(?P<format>https?)|{8,14}#0(?P<format>ftps?)://#1(?P<token>[\w_]+.)+ +Match (0, 46) => [http://www.ciao.mondo/hello/pippo12_/pera.html] +group:'format' => [http] bounds: (0, 4) +group:'token' => [html] bounds: (42, 46) +``` + +In order to simplify the use of the named groups, it is possible to +use a name map in the `re` struct, using the function `re.get_group_by_name`. + +Here is a more complex example of using them: +```v oksyntax +// This function demostrate the use of the named groups +fn convert_html_rgb_n(in_col string) u32 { + mut n_digit := if in_col.len == 4 { 1 } else { 2 } + mut col_mul := if in_col.len == 4 { 4 } else { 0 } + query := '#(?P<red>[a-fA-F0-9]{$n_digit})' + '(?P<green>[a-fA-F0-9]{$n_digit})' + + '(?P<blue>[a-fA-F0-9]{$n_digit})' + mut re := regex.regex_opt(query) or { panic(err) } + start, end := re.match_string(in_col) + println('start: $start, end: $end') + mut res := u32(0) + if start >= 0 { + red_s, red_e := re.get_group_by_name('red') + r := ('0x' + in_col[red_s..red_e]).int() << col_mul + green_s, green_e := re.get_group_by_name('green') + g := ('0x' + in_col[green_s..green_e]).int() << col_mul + blue_s, blue_e := re.get_group_by_name('blue') + b := ('0x' + in_col[blue_s..blue_e]).int() << col_mul + println('r: $r g: $g b: $b') + res = u32(r) << 16 | u32(g) << 8 | u32(b) + } + return res +} +``` + +Other utilities are `get_group_by_name` and `get_group_bounds_by_name`, +that return the string of a group using its `name`: + +```v ignore +txt := "my used string...." +for name in re.group_map.keys() { + println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \ + bounds: ${re.get_group_bounds_by_name(name)}") +} +``` + + + +### Groups query functions + +These functions are helpers to query the captured groups + +```v ignore +// get_group_bounds_by_name get a group boundaries by its name +pub fn (re RE) get_group_bounds_by_name(group_name string) (int, int) + +// get_group_by_name get a group string by its name +pub fn (re RE) get_group_by_name(group_name string) string + +// get_group_by_id get a group boundaries by its id +pub fn (re RE) get_group_bounds_by_id(group_id int) (int,int) + +// get_group_by_id get a group string by its id +pub fn (re RE) get_group_by_id(in_txt string, group_id int) string + +struct Re_group { +pub: + start int = -1 + end int = -1 +} + +// get_group_list return a list of Re_group for the found groups +pub fn (re RE) get_group_list() []Re_group +``` + +## Flags + +It is possible to set some flags in the regex parser, that change +the behavior of the parser itself. + +```v ignore +// example of flag settings +mut re := regex.new() +re.flag = regex.F_BIN +``` + +- `F_BIN`: parse a string as bytes, utf-8 management disabled. + +- `F_EFM`: exit on the first char matches in the query, used by the + find function. + +- `F_MS`: matches only if the index of the start match is 0, + same as `^` at the start of the query string. + +- `F_ME`: matches only if the end index of the match is the last char + of the input string, same as `$` end of query string. + +- `F_NL`: stop the matching if found a new line char `\n` or `\r` + +## Functions + +### Initializer + +These functions are helper that create the `RE` struct, +a `RE` struct can be created manually if you needed. + +#### **Simplified initializer** + +```v ignore +// regex create a regex object from the query string and compile it +pub fn regex_opt(in_query string) ?RE +``` + +#### **Base initializer** + +```v ignore +// new_regex create a REgex of small size, usually sufficient for ordinary use +pub fn new() RE + +``` +#### **Custom initialization** +For some particular needs, it is possible to initialize a fully customized regex: +```v ignore +pattern = r"ab(.*)(ac)" +// init custom regex +mut re := regex.RE{} +// max program length, can not be longer then the pattern +re.prog = []Token {len: pattern.len + 1} +// can not be more char class the the length of the pattern +re.cc = []CharClass{len: pattern.len} + +re.group_csave_flag = false // true enable continuos group saving if needed +re.group_max_nested = 128 // set max 128 group nested possible +re.group_max = pattern.len>>1 // we can't have more groups than the half of the pattern legth +re.group_stack = []int{len: re.group_max, init: -1} +re.group_data = []int{len: re.group_max, init: -1} +``` +### Compiling + +After an initializer is used, the regex expression must be compiled with: + +```v ignore +// compile compiles the REgex returning an error if the compilation fails +pub fn (re mut RE) compile_opt(in_txt string) ? +``` + +### Matching Functions + +These are the matching functions + +```v ignore +// match_string try to match the input string, return start and end index if found else start is -1 +pub fn (re mut RE) match_string(in_txt string) (int,int) + +``` + +## Find and Replace + +There are the following find and replace functions: + +#### Find functions + +```v ignore +// find try to find the first match in the input string +// return start and end index if found else start is -1 +pub fn (re mut RE) find(in_txt string) (int,int) + +// find_all find all the "non overlapping" occurrences of the matching pattern +// return a list of start end indexes like: [3,4,6,8] +// the matches are [3,4] and [6,8] +pub fn (re mut RE) find_all(in_txt string) []int + +// find_all find all the "non overlapping" occurrences of the matching pattern +// return a list of strings +// the result is like ["first match","secon match"] +pub fn (mut re RE) find_all_str(in_txt string) []string +``` + +#### Replace functions + +```v ignore +// replace return a string where the matches are replaced with the repl_str string, +// this function support groups in the replace string +pub fn (re mut RE) replace(in_txt string, repl string) string +``` + +replace string can include groups references: + +```v ignore +txt := "Today it is a good day." +query := r'(a\w)[ ,.]' +mut re := regex.regex_opt(query)? +res := re.replace(txt, r"__[\0]__") +``` + +in this example we used the group `0` in the replace string: `\0`, the result will be: + +``` +Today it is a good day. => Tod__[ay]__it is a good d__[ay]__ +``` + +**Note:** in the replace strings can be used only groups from `0` to `9`. + +If the usage of `groups` in the replace process, is not needed, it is possible +to use a quick function: + +```v ignore +// replace_simple return a string where the matches are replaced with the replace string +pub fn (mut re RE) replace_simple(in_txt string, repl string) string +``` + +#### Custom replace function + +For complex find and replace operations, you can use `replace_by_fn` . +The `replace_by_fn`, uses a custom replace callback function, thus +allowing customizations. The custom callback function is called for +every non overlapped find. + +The custom callback function must be of the type: + +```v ignore +// type of function used for custom replace +// in_txt source text +// start index of the start of the match in in_txt +// end index of the end of the match in in_txt +// --- the match is in in_txt[start..end] --- +fn (re RE, in_txt string, start int, end int) string +``` + +The following example will clarify its usage: + +```v ignore +import regex +// customized replace functions +// it will be called on each non overlapped find +fn my_repl(re regex.RE, in_txt string, start int, end int) string { + g0 := re.get_group_by_id(in_txt, 0) + g1 := re.get_group_by_id(in_txt, 1) + g2 := re.get_group_by_id(in_txt, 2) + return "*$g0*$g1*$g2*" +} + +fn main(){ + txt := "today [John] is gone to his house with (Jack) and [Marie]." + query := r"(.)(\A\w+)(.)" + + mut re := regex.regex_opt(query) or { panic(err) } + + result := re.replace_by_fn(txt, my_repl) + println(result) +} +``` + +Output: + +``` +today *[*John*]* is gone to his house with *(*Jack*)* and *[*Marie*]*. +``` + + + +## Debugging + +This module has few small utilities to you write regex patterns. + +### **Syntax errors highlight** + +The next example code shows how to visualize regex pattern syntax errors +in the compilation phase: + +```v oksyntax +query := r'ciao da ab[ab-]' +// there is an error, a range not closed!! +mut re := new() +re.compile_opt(query) or { println(err) } +// output!! +// query: ciao da ab[ab-] +// err : ----------^ +// ERROR: ERR_SYNTAX_ERROR +``` + +### **Compiled code** + +It is possible to view the compiled code calling the function `get_query()`. +The result will be something like this: + +``` +======================================== +v RegEx compiler v 1.0 alpha output: +PC: 0 ist: 92000000 ( GROUP_START #:0 { 1, 1} +PC: 1 ist: 98000000 . DOT_CHAR nx chk: 4 { 1, 1} +PC: 2 ist: 94000000 ) GROUP_END #:0 { 1, 1} +PC: 3 ist: 92000000 ( GROUP_START #:1 { 1, 1} +PC: 4 ist: 90000000 [\A] BSLS { 1, 1} +PC: 5 ist: 90000000 [\w] BSLS { 1,MAX} +PC: 6 ist: 94000000 ) GROUP_END #:1 { 1, 1} +PC: 7 ist: 92000000 ( GROUP_START #:2 { 1, 1} +PC: 8 ist: 98000000 . DOT_CHAR nx chk: -1 last! { 1, 1} +PC: 9 ist: 94000000 ) GROUP_END #:2 { 1, 1} +PC: 10 ist: 88000000 PROG_END { 0, 0} +======================================== + +``` + +`PC`:`int` is the program counter or step of execution, each single step is a token. + +`ist`:`hex` is the token instruction id. + +`[a]` is the char used by the token. + +`query_ch` is the type of token. + +`{m,n}` is the quantifier, the greedy off flag `?` will be showed if present in the token + +### **Log debug** + +The log debugger allow to print the status of the regex parser when the +parser is running. It is possible to have two different levels of +debug information: 1 is normal, while 2 is verbose. + +Here is an example: + +*normal* - list only the token instruction with their values + +```ignore +// re.flag = 1 // log level normal +flags: 00000000 +# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1) +# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1) +# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1) +# 10 PROG_END +``` + +*verbose* - list all the instructions and states of the parser + +```ignore +flags: 00000000 +# 0 s: start PC: NA +# 1 s: ist_next PC: NA +# 2 s: ist_load PC: i,ch,len:[ 0,'a',1] f.m:[ -1, -1] query_ch: [a]{1,1}:0 (#-1) +# 3 s: ist_quant_p PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [a]{1,1}:1 (#-1) +# 4 s: ist_next PC: NA +# 5 s: ist_load PC: i,ch,len:[ 1,'b',1] f.m:[ 0, 0] query_ch: [b]{2,3}:0? (#-1) +# 6 s: ist_quant_p PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1) +# 7 s: ist_load PC: i,ch,len:[ 2,'b',1] f.m:[ 0, 1] query_ch: [b]{2,3}:1? (#-1) +# 8 s: ist_quant_p PC: i,ch,len:[ 3,'b',1] f.m:[ 0, 2] query_ch: [b]{2,3}:2? (#-1) +# 9 s: ist_next PC: NA +# 10 PROG_END +# 11 PROG_END +``` + +the columns have the following meaning: + +`# 2` number of actual steps from the start of parsing + +`s: ist_next` state of the present step + +`PC: 1` program counter of the step + +`=>7fffffff ` hex code of the instruction + +`i,ch,len:[ 0,'a',1]` `i` index in the source string, `ch` the char parsed, +`len` the length in byte of the char parsed + +`f.m:[ 0, 1]` `f` index of the first match in the source string, `m` index that is actual matching + +`query_ch: [b]` token in use and its char + +`{2,3}:1?` quantifier `{min,max}`, `:1` is the actual counter of repetition, +`?` is the greedy off flag if present. + +### **Custom Logger output** + +The debug functions output uses the `stdout` as default, +it is possible to provide an alternative output, by setting a custom +output function: + +```v oksyntax +// custom print function, the input will be the regex debug string +fn custom_print(txt string) { + println('my log: $txt') +} + +mut re := new() +re.log_func = custom_print +// every debug output from now will call this function +``` + +## Example code + +Here an example that perform some basically match of strings + +```v ignore +import regex + +fn main(){ + txt := "http://www.ciao.mondo/hello/pippo12_/pera.html" + query := r"(?P<format>https?)|(?P<format>ftps?)://(?P<token>[\w_]+.)+" + + mut re := regex.regex_opt(query) or { panic(err) } + + start, end := re.match_string(txt) + if start >= 0 { + println("Match ($start, $end) => [${txt[start..end]}]") + for g_index := 0; g_index < re.group_count ; g_index++ { + println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \ + bounds: ${re.get_group_bounds_by_id(g_index)}") + } + for name in re.group_map.keys() { + println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \ + bounds: ${re.get_group_bounds_by_name(name)}") + } + } else { + println("No Match") + } +} +``` +Here an example of total customization of the regex environment creation: +```v ignore +import regex + +fn main(){ + txt := "today John is gone to his house with Jack and Marie." + query := r"(?:(?P<word>\A\w+)|(?:\a\w+)[\s.]?)+" + + // init regex + mut re := regex.RE{} + // max program length, can not be longer then the query + re.prog = []regex.Token {len: query.len + 1} + // can not be more char class the the length of the query + re.cc = []regex.CharClass{len: query.len} + re.prog = []regex.Token {len: query.len+1} + // enable continuos group saving + re.group_csave_flag = true + // set max 128 group nested + re.group_max_nested = 128 + // we can't have more groups than the half of the query legth + re.group_max = query.len>>1 + + // compile the query + re.compile_opt(query) or { panic(err) } + + start, end := re.match_string(txt) + if start >= 0 { + println("Match ($start, $end) => [${txt[start..end]}]") + } else { + println("No Match") + } + + // show results for continuos group saving + if re.group_csave_flag == true && start >= 0 && re.group_csave.len > 0{ + println("cg: $re.group_csave") + mut cs_i := 1 + for cs_i < re.group_csave[0]*3 { + g_id := re.group_csave[cs_i] + st := re.group_csave[cs_i+1] + en := re.group_csave[cs_i+2] + println("cg[$g_id] $st $en:[${txt[st..en]}]") + cs_i += 3 + } + } + + // show results for captured groups + if start >= 0 { + println("Match ($start, $end) => [${txt[start..end]}]") + for g_index := 0; g_index < re.group_count ; g_index++ { + println("#${g_index} [${re.get_group_by_id(txt, g_index)}] \ + bounds: ${re.get_group_bounds_by_id(g_index)}") + } + for name in re.group_map.keys() { + println("group:'$name' \t=> [${re.get_group_by_name(txt, name)}] \ + bounds: ${re.get_group_bounds_by_name(name)}") + } + } else { + println("No Match") + } +} +``` + +More examples are available in the test code for the `regex` module, +see `vlib/regex/regex_test.v`. |