Regular Expression

Certainly, here is the revised and proofread version of your tutorial:


Introduction to Regular Expressions with Examples

A regular expression is a sequence of characters that specifies a search pattern. This tutorial is written in Scala, but the tokens/patterns can be used in other programming languages as well.

Summary

Char

Digit / Alphanumeric / Whitespace

  • \d: Matches any digit from 0 to 9.
  • \D: Matches any non-digit character.
  • \w: Matches any alphanumeric character.
  • \W: Matches any non-alphanumeric character.
  • \s: Matches any whitespace character.
  • \S: Matches any non-whitespace character.

Note: In Scala strings, you need to use double slashes for \d like "\\d".r.

Wildcard

  • .: The wildcard character matches any single character (letter, digit, whitespace, etc.).

Match Character

  • [abc]: Matches specific characters ‘a’, ‘b’, or ‘c’.
  • [^abc]: Excludes specific characters ‘a’, ‘b’, or ‘c’.

Range

  • [a-z]: Matches a character within the range from ‘a’ to ‘z’.
  • [^a-z]: Excludes a character within the range from ‘a’ to ‘z’.
  • [a-z0-9]: Matches a character within multiple ranges.

String

Match String

  • "abc": Matches a substring that is the same as the pattern.

Repetitions

  • {m}: Matches ‘m’ repetitions of the preceding character or group.
  • {m,n}: Matches between ‘m’ and ‘n’ repetitions of the preceding character or group.
  • {m,}: Matches ‘m’ or more repetitions of the preceding character or group.
  • *: Kleene Star – Matches 0 or more repetitions.
  • +: Kleene Plus – Matches 1 or more repetitions.

Starting and Ending

  • ^: Matches the start of a line.
  • : Matches the end of a line.

Capture Group

  • (…): Defines a capture group.
  • case ... match ...: Captures groups in Scala.

Optional

  • ?: Matches either zero or one of the preceding character or group.
  • (foo|bar): Matches either ‘foo’ or ‘bar’.

Capture All

  • .*: Matches everything.
// Scala dependency import scala.util.matching.Regex

Char

Digit / Alphanumeric / Whitespace

\d: Any digit from 0 to 9

The preceding slash \ distinguishes it from the simple ‘d’ character and indicates a metacharacter.

Note: You need to use double slash in Scala string for \d"\\d".r

For example, "\\d":

  • Matches 1 in 1234.
  • Matches 2 in 2 foo.

\D: Any non-digit character

For example, "\\D":

  • Matches " " (space) in 1234 a.
  • Matches a in a 2 foo.

\w: Any alphanumeric character

Equivalent to the character range [A-Za-z0-9_].

For example, "\\w" matches:

  • A in Ana.
  • 0 in *012.

And skips "***".

\W: Any non-alphanumeric character

For example, "\\W" matches * in "***" and skips:

  • Ana.
  • 0123 Bob.

\s: Any whitespace character

Whitespace includes space " ", tab \t, new line \n, and carriage return \r.

val pattern = "\\d".r val text = "1234"
val pattern = "\\D".r val text = "1234 a"
val pattern = "\\w".r val text = "*012"
val pattern = "\\W".r val text = "***"
val pattern = "\\d.\\s+abc".r val text = "3.           abc"
val pattern = "\\d.\\s+abc".r val text = "4.abc"

Wildcard

.: The wildcard character

A wildcard is a card that can represent any card in the deck in poker games. Similarly, . (dot) can match any single character (letter, digit, whitespace, everything).

Note:

.  is the wildcard \\. is the dot symbol or period

For example, ...\\. matches:

  • "cat."
  • "896."
  • "?=+."

And skips abc1.

val pattern = "...\\.".r val text = "cat."
val pattern = "...\\.".r val text = "abc1"

Match Character

[abc]: Match specific characters

Define the specific characters you want to match inside square brackets. The pattern [abc] will only match a single a, b, or c letter and nothing else.

For example, [cmf]an matches:

  • "can"
  • "man"
  • "fan"

And skips:

  • dan
  • ran
  • pan

[^abc]: Exclude specific characters

Exclude specific characters by using the square brackets and the ^ (hat). For example, the pattern [^abc] will match any single character except for the letters a, b, or c.

Note: It is different from the hat used as “start of the line” ^start for excluding characters, which can be confusing when reading regular expressions.

For example, [^cmf]an matches:

  • dan
  • ran
  • pan

And skips:

  • "can"
  • "man"
  • "fan"
val pattern = "[cmf]an".r val text = "can"
val pattern = "[cmf]an".r val text = "dan"
val pattern = "[^cmf]an".r val text = "dan"
val pattern = "[^cmf]an".r val text = "can"

Range

[a-z]: Match a character within the range

Match a character in a list of sequential characters by using the dash to indicate a character range.

For example, [0-6] matches any single digit character from 0 to 6.

[^a-z]: Exclude a character within the range

For example, [^n-p] matches any single character except for letters n to p.

[a-z0-9]: Match a character within multiple ranges

Multiple character ranges can also be used in the same set of brackets.

For example, [A-Z0-9] matches any single digit character from A to Z or 0 to 9.

val pattern =   "[A-C][n-p][a-c]".r val text = "Ana"
val pattern = "[A-C][n-p][a-c]".r val text = "aax"
val pattern = "[A-C0-9][A-C0-9]".r val text = "A0x"

String

Match String

“abc”: Match a substring that is the same as the pattern

For example, "foo 1" matches "foo 1" in "foo 1 fooo".

val pattern = "foo 1".r val text = "foo 1 fooo"

Repetitions

{m}: m repetitions

For example, B{3} matches the B character exactly three times.

{m,n}: m to n repetitions

For example, B{1,3} matches the B character for 1-3 times.

{m,}: m to infinite repetitions

For example, B{3,} matches the B character for at least 3 times.

Note: {,m} is Illegal. It will result in an error.

val pattern = "pur{3}".r val text = "purrrrr"
val pattern = "pur{1,3}".r val text = "purrr"
val pattern = "pur{1,3}".r val text = "pu"
val pattern = "pur{3,}".r val text = "purrrrrrr"
val pattern = "\\w+".r val text = ""
val pattern = "\\w*".r val text = ""
val pattern = "\\w*".r val text = "anyAlphanumeric"

Starting and Ending

^: Start of the line

Note: ^success matches only a line that begins with the word “success,” but not the line “Error: unsuccessful operation.”

Note: It is different from the hat used inside a set of brackets [^...] for excluding characters, which can be confusing when reading regular expressions.

: End of the line

val pattern = "end$".r
val text = "The end"
val pattern = "^start".r
val text = "starting"
val pattern = "^start".r
val text = "Now start"

Capture Group

Regular expressions allow us not just to match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group.

(…): Capture Group

Imagine that you had a command line tool to list all the image files you have in the cloud. You could then use a pattern such as ^(IMG\d+\.png)$ to capture and extract the full filename, but if you only wanted to capture the filename without the extension, you could use the pattern ^(IMG\d+)\.png$, which only captures the part before the period.

case ... match Capture Group in Scala

val date = raw"(\d{4})-(\d{2})-(\d{2})".r

To extract the capturing groups when a Regex is matched, use it as an extractor in a pattern match:

"2004-01-20" match {
  case date(year, month, day) => s"$year $month $day"
}

To check only whether the Regex matches, ignoring any groups, use a sequence wildcard:

"2004-01-20" match {
  case date(_*) => "It's a date!"
}

Extracting only the year from a date could also be expressed with a sequence wildcard:

"2004-01-20" match {
  case date(year, _*) => s"$year"
}

In a pattern match, Regex matches the entire input typically. However, an unanchored Regex finds the pattern anywhere in the input.

val embeddedDate = date.unanchored

"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
  case embeddedDate("2004", "01", "20") => "A Scala is born."
}

In comparison, we cannot capture the group if we only use the date.

val date = raw"(\d{4})-(\d{2})-(\d{2})".r

"Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)" match {
  case date("2004", "01", "20") => "A Scala is born."
}

Error message:

scala.MatchError: 
Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago) 
(of class java.lang.String)

Optional

?: Match either zero or one of the preceding character or group

For example, ab?c matches either the strings "abc" or "ac" because the ‘b’ is considered optional.

Note: The question mark is a special character, and you will have to escape it using a slash \? to match a plain question mark character in a string.

(foo|bar): Match foo or bar

For example, (abc|def) matches abc or def.

Exercise

Match:

  • 1 file found?
  • 2 files found?
  • 24 files found?

Skip No files found.

Solution: \d+ files? found\?


Capture All

.* Matches everything

val pattern = ".*".r
val text = "****** any text 123456 ------------"

Reference

  1. RegexOne – Learn Regular Expressions – Lesson 1: An Introduction, and the ABCs. [Online] Available at: RegexOne – Learn Regular Expressions [Accessed on June 5, 2021].
  2. Scala – Regular Expressions – Tutorialspoint. [Online] Available at: Scala – Regular Expressions – Tutorialspoint [Accessed on June 5, 2021].
  3. regex101: build, test, and debug regex. [Online] regex101. Available at: regex101 [Accessed on June 5, 2021].