Certainly, here is the revised and proofread version of your tutorial:
Introduction to Regular Expressions with Examples
A regular expression is a sequence of characters that specifies a search pattern. This tutorial is written in Scala, but the tokens/patterns can be used in other programming languages as well.
Summary
Char
Digit / Alphanumeric / Whitespace
\d
: Matches any digit from 0 to 9.\D
: Matches any non-digit character.\w
: Matches any alphanumeric character.\W
: Matches any non-alphanumeric character.\s
: Matches any whitespace character.\S
: Matches any non-whitespace character.
Note: In Scala strings, you need to use double slashes for \d
like "\\d".r
.
Wildcard
.
: The wildcard character matches any single character (letter, digit, whitespace, etc.).
Match Character
[abc]
: Matches specific characters ‘a’, ‘b’, or ‘c’.[^abc]
: Excludes specific characters ‘a’, ‘b’, or ‘c’.
Range
[a-z]
: Matches a character within the range from ‘a’ to ‘z’.[^a-z]
: Excludes a character within the range from ‘a’ to ‘z’.[a-z0-9]
: Matches a character within multiple ranges.
String
Match String
"abc"
: Matches a substring that is the same as the pattern.
Repetitions
{m}
: Matches ‘m’ repetitions of the preceding character or group.{m,n}
: Matches between ‘m’ and ‘n’ repetitions of the preceding character or group.{m,}
: Matches ‘m’ or more repetitions of the preceding character or group.*
: Kleene Star – Matches 0 or more repetitions.+
: Kleene Plus – Matches 1 or more repetitions.
Starting and Ending
^
: Matches the start of a line.: Matches the end of a line.
Capture Group
(…)
: Defines a capture group.case ... match ...
: Captures groups in Scala.
Optional
?
: Matches either zero or one of the preceding character or group.(foo|bar)
: Matches either ‘foo’ or ‘bar’.
Capture All
.*
: Matches everything.
// Scala dependency import scala.util.matching.Regex
Char
Digit / Alphanumeric / Whitespace
\d
: Any digit from 0 to 9
The preceding slash \
distinguishes it from the simple ‘d’ character and indicates a metacharacter.
Note: You need to use double slash in Scala string for
\d
–"\\d".r
For example, "\\d"
:
- Matches
1
in1234
. - Matches
2
in2 foo
.
\D
: Any non-digit character
For example, "\\D"
:
- Matches
" "
(space) in1234 a
. - Matches
a
ina 2 foo
.
\w
: Any alphanumeric character
Equivalent to the character range [A-Za-z0-9_]
.
For example, "\\w"
matches:
A
inAna
.0
in*012
.
And skips "***"
.
\W
: Any non-alphanumeric character
For example, "\\W"
matches *
in "***"
and skips:
Ana
.0123 Bob
.
\s
: Any whitespace character
Whitespace includes space " "
, tab \t
, new line \n
, and carriage return \r
.
val pattern = "\\d".r val text = "1234"
val pattern = "\\D".r val text = "1234 a"
val pattern = "\\w".r val text = "*012"
val pattern = "\\W".r val text = "***"
val pattern = "\\d.\\s+abc".r val text = "3. abc"
val pattern = "\\d.\\s+abc".r val text = "4.abc"
Wildcard
.
: The wildcard character
A wildcard is a card that can represent any card in the deck in poker games. Similarly, .
(dot) can match any single character (letter, digit, whitespace, everything).
Note:
. is the wildcard \\. is the dot symbol or period
For example, ...\\.
matches:
"cat."
"896."
"?=+."
And skips abc1
.
val pattern = "...\\.".r val text = "cat."
val pattern = "...\\.".r val text = "abc1"
Match Character
[abc]
: Match specific characters
Define the specific characters you want to match inside square brackets. The pattern [abc]
will only match a single a
, b
, or c
letter and nothing else.
For example, [cmf]an
matches:
"can"
"man"
"fan"
And skips:
dan
ran
pan
[^abc]
: Exclude specific characters
Exclude specific characters by using the square brackets and the ^
(hat). For example, the pattern [^abc]
will match any single character except for the letters a
, b
, or c
.
Note: It is different from the hat used as “start of the line”
^start
for excluding characters, which can be confusing when reading regular expressions.
For example, [^cmf]an
matches:
dan
ran
pan
And skips:
"can"
"man"
"fan"
val pattern = "[cmf]an".r val text = "can"
val pattern = "[cmf]an".r val text = "dan"
val pattern = "[^cmf]an".r val text = "dan"
val pattern = "[^cmf]an".r val text = "can"
Range
[a-z]
: Match a character within the range
Match a character in a list of sequential characters by using the dash to indicate a character range.
For example, [0-6]
matches any single digit character from 0
to 6
.
[^a-z]
: Exclude a character within the range
For example, [^n-p]
matches any single character except for letters n
to p
.
[a-z0-9]
: Match a character within multiple ranges
Multiple character ranges can also be used in the same set of brackets.
For example, [A-Z0-9]
matches any single digit character from A
to Z
or 0
to 9
.
val pattern = "[A-C][n-p][a-c]".r val text = "Ana"
val pattern = "[A-C][n-p][a-c]".r val text = "aax"
val pattern = "[A-C0-9][A-C0-9]".r val text = "A0x"
String
Match String
“abc”: Match a substring that is the same as the pattern
For example, "foo 1"
matches "foo 1"
in "foo 1 fooo"
.
val pattern = "foo 1".r val text = "foo 1 fooo"
Repetitions
{m}
: m repetitions
For example, B{3}
matches the B
character exactly three times.
{m,n}
: m to n repetitions
For example, B{1,3}
matches the B
character for 1-3 times.
{m,}
: m to infinite repetitions
For example, B{3,}
matches the B
character for at least 3 times.
Note:
{,m}
is Illegal. It will result in an error.
val pattern = "pur{3}".r val text = "purrrrr"
val pattern = "pur{1,3}".r val text = "purrr"
val pattern = "pur{1,3}".r val text = "pu"
val pattern = "pur{3,}".r val text = "purrrrrrr"
val pattern = "\\w+".r val text = ""
val pattern = "\\w*".r val text = ""
val pattern = "\\w*".r val text = "anyAlphanumeric"
Starting and Ending
^
: Start of the line
Note:
^success
matches only a line that begins with the word “success,” but not the line “Error: unsuccessful operation.”Note: It is different from the hat used inside a set of brackets
[^...]
for excluding characters, which can be confusing when reading regular expressions.
: End of the line
val pattern = "end$".r
val text = "The end"
val pattern = "^start".r
val text = "starting"
val pattern = "^start".r
val text = "Now start"
Capture Group
Regular expressions allow us not just to match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses (
and )
metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.
(…)
: Capture Group
Imagine that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as ^(IMG\d+\.png)$
to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png$
, which only captures the part before the period.
case ... match
Capture Group in Scala
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
To extract the capturing groups when a Regex is matched, use it as an extractor in a pattern match:
"2004-01-20" match {
case date(year, month, day) => s"$year $month $day"
}
To check only whether the Regex matches, ignoring any groups, use a sequence wildcard:
"2004-01-20" match {
case date(_*) => "It's a date!"
}
Extracting only the year from a date could also be expressed with a sequence wildcard:
"2004-01-20" match {
case date(year, _*) => s"$year"
}
In a pattern match, Regex matches the entire input typically. However, an unanchored Regex finds the pattern anywhere in the input.
val embeddedDate = date.unanchored
"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
case embeddedDate("2004", "01", "20") => "A Scala is born."
}
In comparison, we cannot capture the group if we only use the date
.
val date = raw"(\d{4})-(\d{2})-(\d{2})".r
"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
case date("2004", "01", "20") => "A Scala is born."
}
Error message:
scala.MatchError:
Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)
(of class java.lang.String)
Optional
?
: Match either zero or one of the preceding character or group
For example, ab?c
matches either the strings "abc"
or "ac"
because the ‘b’ is considered optional.
Note: The question mark is a special character, and you will have to escape it using a slash
\?
to match a plain question mark character in a string.
(foo|bar)
: Match foo
or bar
For example, (abc|def)
matches abc
or def
.
Exercise
Match:
1 file found?
2 files found?
24 files found?
Skip No files found.
Solution: \d+ files? found\?
Capture All
.*
Matches everything
val pattern = ".*".r
val text = "****** any text 123456 ------------"
Reference
- RegexOne – Learn Regular Expressions – Lesson 1: An Introduction, and the ABCs. [Online] Available at: RegexOne – Learn Regular Expressions [Accessed on June 5, 2021].
- Scala – Regular Expressions – Tutorialspoint. [Online] Available at: Scala – Regular Expressions – Tutorialspoint [Accessed on June 5, 2021].
- regex101: build, test, and debug regex. [Online] regex101. Available at: regex101 [Accessed on June 5, 2021].