regular-expression-exercise

RegEx Exercises

RegEx Cheat Sheet:

  • x match character 'x' literally
  • C match character 'C' literally
  • allmost . any character (but invisible line breaks)
  • \. match '.' literally
  • \\ match character '\' literally
  • / match character '/' literally
  • .* repeat any character except '\n', zero or more times.
  • ? a question mark is used for an optional match, e.g., ab?c matches to abc and ac.
  • [0-5] match any char in 0 to 5 (bracket expression)
  • [a-g] match any char in {a,b,c,d,e,f,g}
  • [^0-2]-Matches all execpt 0, 1, 2 (here ^ is the inverse)
  • $ end of line
  • +-Quantifier: Matches one or more of the preceding tokens
  • [0-9]+ repeat any char in 0 to 9, one or more times
  • ^- Start of a line, but also inverse in combination with [ ]- see above
  • \<: the begin of a word. "words" are separated by whitespaces.
  • \>: the end of a word.

Here, the solution should be in extended regular expression , see e.g. https://www.regular-expressions.info/posix.html.

Note: \d nor \w work in POSIX regular expressions (see https://www.regular-expressions.info/posixbrackets.html for POSIX Bracket Expressions.)

To avoid "false positives" (matches that shouldn't match), a regular expression should be as specific as possible.

In [1]:
# don't modify this cell!
should_match(){
    regEx=$1
    for string in "${@:2}"
        do 
            if [[ ! ($string =~ $regEx) ]]; then
                echo "Error: '$string' don't match to the regex '$regEx', but should match!"
            else 
                echo "OK: '$string' match to the regex."
            fi    
    done
}
In [2]:
# don't modify this cell!
should_not_match(){
    regEx=$1
    for string in "${@:2}"
        do 
            if [[ ($string =~ $regEx) ]]; then
                echo "Error: '$string' match to the regex '$regEx', but should not match!"
            else 
                echo "OK: '$string' don't match to the regex."
            fi    
    done
}

Simple character match

A simple character match allow to litearlly match the chars in the text/string.

Example: The regex "ab" maches to "abcd", "aabbcc", but not to "acba".

Exercise

Write a RegEx that matches if there is a char sequence xyz in the text.

In [3]:
regex='YOUR_REG_EX_HERE' # replace the regex string
In [5]:
should_match "$regex" xyz xyzab abxyzcd 
should_not_match "$regex" cab abc xcd yz xz
OK: 'xyz' match to the regex.
OK: 'xyzab' match to the regex.
OK: 'abxyzcd' match to the regex.
OK: 'cab' don't match to the regex.
OK: 'abc' don't match to the regex.
OK: 'xcd' don't match to the regex.
OK: 'yz' don't match to the regex.
OK: 'xz' don't match to the regex.

Numbers

Exercise

Write a RegEx that matches if there is at least a number in the string.

In [6]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string
In [8]:
should_match "$regex" 8 "var 43" "z=7" 123 62 11 2i 7 34 5z a73 09 r7a25r 342 
should_not_match "$regex" abcd ztd xyz one
OK: '8' match to the regex.
OK: 'var 43' match to the regex.
OK: 'z=7' match to the regex.
OK: '123' match to the regex.
OK: '62' match to the regex.
OK: '11' match to the regex.
OK: '2i' match to the regex.
OK: '7' match to the regex.
OK: '34' match to the regex.
OK: '5z' match to the regex.
OK: 'a73' match to the regex.
OK: '09' match to the regex.
OK: 'r7a25r' match to the regex.
OK: '342' match to the regex.
OK: 'abcd' don't match to the regex.
OK: 'ztd' don't match to the regex.
OK: 'xyz' don't match to the regex.
OK: 'one' don't match to the regex.

The Dot

The dot wildcard metacharacter . (or ?) match any single character (letter, digit, whitespace, everything).

This overrides the matching of the period character .. To match a period, you need to escape the dot by using a slash \.. In general, to match chars that have a special meaning literally, we need to escape them with a \.

Exercise

Write a RegEx that matches only if there is a period . in the text.

In [9]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string
In [11]:
# contains a period?
should_match "$regex" "xyz." "*=-." "179." "17.9."  "17.9"
should_not_match "$regex" "xyz1" "*=-:" "1797"
OK: 'xyz.' match to the regex.
OK: '*=-.' match to the regex.
OK: '179.' match to the regex.
OK: '17.9.' match to the regex.
OK: '17.9' match to the regex.
OK: 'xyz1' don't match to the regex.
OK: '*=-:' don't match to the regex.
OK: '1797' don't match to the regex.

Exercise

Write a RegEx that matches only if there is an "a" followed by an arbitrary character and then a "c", i.e, exact one char between "a" and "c":

In [12]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string
In [14]:
# exact one char between "a" and "c"
should_match "$regex" "axc" 'a!c' "aqc" "aßc" "paoc" "paoci" "opa3cip"
should_not_match "$regex" "ac" "avvc" "agsec" "agsec" "agsec" "wagsec" "cxa"
OK: 'axc' match to the regex.
OK: 'a!c' match to the regex.
OK: 'aqc' match to the regex.
OK: 'aßc' match to the regex.
OK: 'paoc' match to the regex.
OK: 'paoci' match to the regex.
OK: 'opa3cip' match to the regex.
OK: 'ac' don't match to the regex.
OK: 'avvc' don't match to the regex.
OK: 'agsec' don't match to the regex.
OK: 'agsec' don't match to the regex.
OK: 'agsec' don't match to the regex.
OK: 'wagsec' don't match to the regex.
OK: 'cxa' don't match to the regex.

Exercise

Write a RegEx that matches only the string "question?" and nothing else.

In [15]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [17]:
should_match "$regex" "question?"
should_not_match "$regex" "questions"
OK: 'question?' match to the regex.
OK: 'questions' don't match to the regex.

Exercise

Write a RegEx that matches only the string "abc\.xyz" and nothing else.

In [18]:
regex='YOUR_REG_EX_HERE' # replace the regex string
In [20]:
should_match "$regex" 'abc\.xyz' 
should_not_match "$regex" 'abc.xyz' 'abcd.xyz' 'abc..xyz'
OK: 'abc\.xyz' match to the regex.
OK: 'abc.xyz' don't match to the regex.
OK: 'abcd.xyz' don't match to the regex.
OK: 'abc..xyz' don't match to the regex.

Char sets by Bracket Expression

The pattern '[xyz]' will only match a single x, y, or z letter and nothing else.

Exercise

Write a RegEx that matches only the strings "can" "man" "fan" and nothing else.

In [21]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string
In [23]:
should_match "$regex" can man fan
should_not_match "$regex" han san ban
OK: 'can' match to the regex.
OK: 'man' match to the regex.
OK: 'fan' match to the regex.
OK: 'han' don't match to the regex.
OK: 'san' don't match to the regex.
OK: 'ban' don't match to the regex.

Excluding specific characters

Square brackets [,]and the hat ^ (first element inside square brackets) for excluding. Examples:

  • [^ac] matches to all chars but not to a and c.
  • [^A-Z] matches to all chars but not to capital letters.

Exercise

Write a RegEx that should match an arbitrary char followed by "an", but it should not match to "han", "san" and "ban".

In [24]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string using square brackets and the hat
In [26]:
should_match "$regex" can man fan lan dan 6an tanxx # etc.
should_not_match "$regex" han san ban an # not an "h","s" or "b" before "an"

# to control your solution: you should use square brackets and the hat!
should_match '\[\^.*\]' "$regex"
OK: 'can' match to the regex.
OK: 'man' match to the regex.
OK: 'fan' match to the regex.
OK: 'lan' match to the regex.
OK: 'dan' match to the regex.
OK: '6an' match to the regex.
OK: 'tanxx' match to the regex.
OK: 'han' don't match to the regex.
OK: 'san' don't match to the regex.
OK: 'ban' don't match to the regex.
OK: 'an' don't match to the regex.
OK: '[^hsb]an' match to the regex.

For the following exercises try to write an specific pattern that matches resp. don't matches the example strings.

Character ranges

[x-z] match x, y or z.

Write a regex that matches first to an A, B,..,or E. Then an arbitrary char followed again by an A, B,..,or E.

In [27]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string. Use char-range two times!
In [29]:
should_match "$regex" AnA BoB CpC AxC BwA BDC BxA EdE DwE
should_not_match "$regex" aax bby ccC Aay Cpy bob Bob anA AaF FaA

# to control your solution: use "[ .. ]" two times
should_match '\[.*\].\[.*\]' "$regex"
OK: 'AnA' match to the regex.
OK: 'BoB' match to the regex.
OK: 'CpC' match to the regex.
OK: 'AxC' match to the regex.
OK: 'BwA' match to the regex.
OK: 'BDC' match to the regex.
OK: 'BxA' match to the regex.
OK: 'EdE' match to the regex.
OK: 'DwE' match to the regex.
OK: 'aax' don't match to the regex.
OK: 'bby' don't match to the regex.
OK: 'ccC' don't match to the regex.
OK: 'Aay' don't match to the regex.
OK: 'Cpy' don't match to the regex.
OK: 'bob' don't match to the regex.
OK: 'Bob' don't match to the regex.
OK: 'anA' don't match to the regex.
OK: 'AaF' don't match to the regex.
OK: 'FaA' don't match to the regex.
OK: '[A-E].[A-E]' match to the regex.

Repetition with curly brackets notation

Examples:

  • e{4,} at least four 'e'.
  • [ab]{3,4} matches e.g. 'abb', 'baa', 'abab', 'bbbb', but not 'ab' or 'abxa'.

Write a regex that matches first to "wa" followed by 3 or 4 "z" followed by "up".

In [30]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string. Use curly braces notation.
In [32]:
should_match "$regex" wazzzzup wazzzup 
should_not_match "$regex" wazzup wazup wazzzzzup

# to control your solution: it has to use curly brackets
should_match '\{.,.\}' "$regex"
OK: 'wazzzzup' match to the regex.
OK: 'wazzzup' match to the regex.
OK: 'wazzup' don't match to the regex.
OK: 'wazup' don't match to the regex.
OK: 'wazzzzzup' don't match to the regex.
OK: 'waz{3,4}up' match to the regex.

Write a regex that matches first to "wa" followed by 0,1,2 or 3 "z" followed by "up".

In [33]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string. Use curly braces notation.
In [35]:
# at most three z
should_match "$regex" wazzup wazup waup wazzzup
should_not_match "$regex" wazzzzup wazzzzzup wazzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzup 


# to control your solution: use curly brackets
should_match '\{,.\}' "$regex"
OK: 'wazzup' match to the regex.
OK: 'wazup' match to the regex.
OK: 'waup' match to the regex.
OK: 'wazzzup' match to the regex.
OK: 'wazzzzup' don't match to the regex.
OK: 'wazzzzzup' don't match to the regex.
OK: 'wazzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzup' don't match to the regex.
OK: 'waz{,3}up' match to the regex.

Kleene Star and the Kleene Plus

  • Kleene Star *: Zero or more, e.g. ab*c matches 'abc', 'abbbc' and 'ac'.
  • Kleene Plus +: One or more, e.g. ab+c matches 'abc', 'abbbc', but not 'ac'.

Write a regex for:

  • first at least two "a" or more
  • then arbitrary many "b"s (also no "b"!)
  • at least a "c" or more
In [36]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string. Use Kleene Star and Kleene Plus
# Don't use curly braces notation!
In [38]:
should_match "$regex" aaaabcc aabbbc aaaacccc
should_not_match "$regex" a ac aaaabb aaabb
OK: 'aaaabcc' match to the regex.
OK: 'aabbbc' match to the regex.
OK: 'aaaacccc' match to the regex.
OK: 'a' don't match to the regex.
OK: 'ac' don't match to the regex.
OK: 'aaaabb' don't match to the regex.
OK: 'aaabb' don't match to the regex.

Optional characters

  • ? for an optional char, e.g. ab?c matches 'ac' and 'abc'.
In [40]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [42]:
should_match "$regex" "1 file found?" "3 files found?" "24 files found?"
should_not_match "$regex" "No files found." "no file found?" "3 files found"
OK: '1 file found?' match to the regex.
OK: '3 files found?' match to the regex.
OK: '24 files found?' match to the regex.
OK: 'No files found.' don't match to the regex.
OK: 'no file found?' don't match to the regex.
OK: '3 files found' don't match to the regex.

Whitespaces

  • Whitespace char are "`" (space),\t(tab) or\n` (new line)
  • [:space:] POSIX character class, used inside a bracket expression, e.g. [[:space:]].
  • Note: whitespace special character \s don't work in POSIX regular expressions.
In [43]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.

At least a number followed by a dot "." and at least a space followed again by "xyz".

In [45]:
should_match "$regex" "1. xyz" "2.      xyz" "3.    xyz"
should_not_match "$regex" "1.xyz" ". xyz" "3. xz"
OK: '1. xyz' match to the regex.
OK: '2.      xyz' match to the regex.
OK: '3.    xyz' match to the regex.
OK: '1.xyz' don't match to the regex.
OK: '. xyz' don't match to the regex.
OK: '3. xz' don't match to the regex.

A whitespace in the middle of the string:

In [46]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [48]:
should_match "$regex" "d xyz" "we  xyz " "dgg xz"
should_not_match "$regex" "xyz  " "daADahga" " adaagXyz"
OK: 'd xyz' match to the regex.
OK: 'we  xyz ' match to the regex.
OK: 'dgg xz' match to the regex.
OK: 'xyz  ' don't match to the regex.
OK: 'daADahga' don't match to the regex.
OK: ' adaagXyz' don't match to the regex.

Start and End of a line

  • Start metacharacter: ^ (hat), e.g., ^bla matches to "blase" but not to "nabla-operator".
  • End metacharacter: $ (dollar sign), ^bla matches to "nabla" but not to "blase".

Note that this is different than the hat inside a set of bracket [^...] for excluding characters.

In [49]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [51]:
should_match "$regex" "Mission: successful"
should_not_match "$regex" "Last Mission: successful" "Mission: successful upon capture of target"
OK: 'Mission: successful' match to the regex.
OK: 'Last Mission: successful' don't match to the regex.
OK: 'Mission: successful upon capture of target' don't match to the regex.

No whitespace in the middle of the string:

In [52]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [54]:
should_match "$regex" "xyz  " "daADahga" " adaagXyz"
should_not_match "$regex" "d xyz" "we  xyz " "dgg xz" "  bla bla"
OK: 'xyz  ' match to the regex.
OK: 'daADahga' match to the regex.
OK: ' adaagXyz' match to the regex.
OK: 'd xyz' don't match to the regex.
OK: 'we  xyz ' don't match to the regex.
OK: 'dgg xz' don't match to the regex.
OK: '  bla bla' don't match to the regex.

Begin and end of a word

  • The symbol \< match the begin of a word.
  • The symbol \> match the end of a word.

Write a regex with "man" at the begin of a word.

In [55]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [124]:
should_match "$regex" "A man was here." "... that the mankind survive."
should_not_match "$regex" "A woman was here." "... calling for humans .."


# to control your solution comment it out
#should_match '\\<' "$regex"
OK: 'A man was here.' match to the regex.
OK: '... that the mankind survive.' match to the regex.
OK: 'A woman was here.' don't match to the regex.
OK: '... calling for humans ..' don't match to the regex.

Write a regex with "man" at the end of a word.

In [125]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [127]:
should_match "$regex" "A man was here." "A woman was here." 
should_not_match "$regex" "... calling for humans .." "... that the mankind survive."

# to control your solution
should_match '\\>' "$regex"
OK: 'A man was here.' match to the regex.
OK: 'A woman was here.' match to the regex.
OK: '... calling for humans ..' don't match to the regex.
OK: '... that the mankind survive.' don't match to the regex.
OK: 'man\>' match to the regex.

Groups

  • Subpattern inside a pair of parentheses (.. ) will be captured as a group.
In [129]:
# don't modify this
groups_should_match(){
    regEx=$1
    groups=$2
    for string in "${@:3}"; do 
        if [[ ! ($string =~ $regEx) ]]; then
            echo "Error: '$string' don't match to '$regEx', but should match!"
        else
            # we need to store BASH_REMATCH in another variable, because 
            #  we do a nested regex-match which overrides the content of BASH_REMATCH
            match=("${BASH_REMATCH[@]}")  # copy of an array
            i=1
            for group_match in ${groups[@]}; do
                if  [[ ! ${match[$i]} =~ $group_match  ]]; then
                    echo "Error: Group $i don't match to '${match[$i]}'."
                fi
                ((i=i+1))
                echo extracted: ${BASH_REMATCH[@]}
            done
            echo
        fi    
    done
}
In [130]:
regex='YOUR_REG_EX_HERE' # replace the regex string.

Your solution should match all "pdf"-files (files with suffix ".pdf") that begin with "file". Extract the filename without the suffix .pdf inside the first group.

In [133]:
groups=('^file.*')
groups_should_match $regex $groups "file_record_transcript.pdf" "file_07241999.pdf"
should_not_match "$regex" "file_fake.pdf.tmp" "starts_not_with_file.pdf"
extracted: file_record_transcript

extracted: file_07241999

OK: 'file_fake.pdf.tmp' don't match to the regex.
OK: 'starts_not_with_file.pdf' don't match to the regex.

Your solution should match all "txt"-files (files with suffix .txt) that begin with file. After file there should be arbitrary chars followed by an _ and a number.

  • Extract the filename without the trailing numbers in the first group
  • Extract the numbers in the second group
  • Extract the suffix .txt in the third group.

E.g. file_record_transcript_66.txt should give the following groups:

  1. file_record_
  2. 66
  3. .txt
In [134]:
regex='YOUR_REG_EX_HERE' # replace the regex string.
In [136]:
groups=('^file.*' '^[[:digit:]]+$' '^\.txt$')
groups_should_match $regex $groups "file_record_transcript_66.txt" "file_a_7_and_more_07241999.txt"
should_not_match "$regex" "file_fake.txt.tmp" "starts_not_with_file.txt"
extracted: file_record_transcript_
extracted: 66
extracted: .txt

extracted: file_a_7_and_more_
extracted: 07241999
extracted: .txt

OK: 'file_fake.txt.tmp' don't match to the regex.
OK: 'starts_not_with_file.txt' don't match to the regex.

Or-Conditional

  • (cats|dogs) can be used to match 'cats' or 'dogs'.
In [137]:
regex='YOUR_REG_EX_HERE' # replace the reg-ex string.
In [139]:
should_match "$regex" 'I love cats' 'I love bats' 'I love dogs' 'I love hogs'
should_not_match "$regex" 'I love rats' 'I love rogs' 'I love vogs' 'I love mats'
OK: 'I love cats' match to the regex.
OK: 'I love bats' match to the regex.
OK: 'I love dogs' match to the regex.
OK: 'I love hogs' match to the regex.
OK: 'I love rats' don't match to the regex.
OK: 'I love rogs' don't match to the regex.
OK: 'I love vogs' don't match to the regex.
OK: 'I love mats' don't match to the regex.