Regex or Regular Expression is a sequence of characters that are used to match a certain pattern in text. It could also be combinations of special character operators, which are symbols that control the search, that we can use for advanced find and/or replace searches.
The regex is applied on the text from left to right. Once a source character has been used in a match, it cannot be reused. For example, the regex aba will match ababababa only 2 times (aba_aba__).
When to use Regex
Regex is a process of pattern matching and should be used when the types of string to be matched are variable or only conform to a particular pattern. When a simple string search needs to be done, we can use the built-in methods of String class. Typical use of Regex is Input Validations, String Parsing, Syntax Highlighting, Data Scraping, String manipulation and Data Mapping .
Where Regex Patterns is used
In real-time, they are used in search engines, finding or finding and replacing a pattern in word processors or in text editors. Here are a bunch of regular expressions that are most commonly used.
Password Strength Validation
Email Validation
Username
Phone number
IP Address
Dates
HTML tags
Time
Empty String
For Email validation the regex can be written as
^[a-zA-Z0–9+_.-]+@[a-zA-Z0–9.-]+$
^ - denotes the starting of the sentence.
[a-zA-Z0–9+_.-] — matches an uppercase or a lowercase English alphabet, digits, +, _, . and, - before the @ symbol.
+ - denotes the occurrence of characters one or more times.
@ - matches the character '@' itself.
[a-zA-Z0–9.-] — matches an uppercase or a lowercase English alphabet, numbers, . and – after the @ symbol.
$ - denotes the end of the sentence.
Regular expressions look much harder than they are
If you learn them, they will almost never be required over other tools
But if you know them, you will constantly find situations where they apply, and they will make many problems much, much easier
Now lets see how regex can be used in Java.
Regex in Java
In Java, Regex or Regular Expressions is used for Pattern matching or String matching. REGEX is used for searching , manipulating and editing a string. They are symbols representing text pattern. Regular expression slightly differs for each language. A regular expression can be a single character, or a more complicated pattern. Regular expressions can be used to perform all types of text search and text replace operations. Java does not have a built-in Regular Expression class, but we can import the java.util.regex package to work with regular expressions. Regex are present in java.util.regex package that contains 3 classes and 1 Interface. They are
Pattern Class
Matcher Class
PatternSyntaxException Class
MatchResult Interface
Pattern Class: It is a Compiled representation of regular expression. It is used to define a pattern for regex engine. Pattern class does not have Public Constructors. To Create a pattern, we must first invoke its public static compile() method by passing regular expression as argument, which returns a Pattern object. Below are the most commonly used methods of Pattern class.
Pattern.matches()
Pattern.compile()
Pattern.matcher()
Pattern.split()
Pattern.matches() – Used to check for single occurrence of regular expression against a text. This is one of the simplest and easiest way of searching a String in a text using Regex.
Output:
The pattern .*tutorial.* allows zero or more characters at the beginning and end of the String “tutorial” (the expression .* is used for zero and more characters).
Pattern.compile() – Compiles the given regular expression into a pattern. This method is used when we want to do a CASE INSENSITIVE search or check if regular expression pattern matches against a text multiple times. Flags are used to control how the pattern behaves.
Pattern Flags | Description |
Pattern.CASE_INSENSITIVE | Enables case-insensitive matching. |
Pattern.COMMENTS | Whitespace and comments starting with # are ignored until the end of a line. |
Pattern.MULTILINE | One expression can match multiple lines. |
Pattern.DOTALL | The expression "." matches any character, including a line terminator. |
Pattern.UNIX_LINES | Only the '\n' line terminator is recognized in the behavior of ., ^, and $. |
Pattern.UNICODE_CASE | Enables Unicode-aware case folding. |
Pattern.LITERAL | Special characters are treated as ordinary sequence of literal characters. |
We have created a Pattern instance, now to match we need Matcher instance, which we can get using Pattern.matcher() method.
Pattern.matcher() – matcher instance is created from pattern instance using the matcher() method. Then the matcher instance is used for pattern matching using matches() method.
Output:
Pattern.split() - splits a text into an array of strings, using the regular expression (the pattern) as delimiter
Output: The regex pattern is specified within [] brackets and a '+' sign at the end indicates 1 or more occurrence of the characters in brackets. The given text is searched with the regular expression and if a match occurs, then it is separated and stored in the array.
In the same example to split the content based on the digits we can use the metacharacter ‘\d’ or [0-9] as regular expression.
Output : When a digit occurs, the text is split and added into array.
Matcher Class: Is used to search through a text for multiple occurrences of a regular expression or search for the same regular expression in different texts. Below is the list of methods used frequently
Method | Description |
boolean matches() | matches the regular expression against the whole text. |
boolean find() | Searching multiple occurrences of the regular expression in the text. |
boolean find(int start) | Searching occurrences of the regular expression in the text starting from the given index. |
String group() | returns the matched subsequence. |
int start() | returns the starting index of the matched subsequence. |
int end() | returns the ending index of the matched subsequence. |
int groupCount() | returns the total number of the matched subsequence. |
String replaceAll(string replacevalue ) | replace all subsequences of the input sequence that match the pattern by given replacement string. |
String replaceFirst(string replacevalue) | replace the first matching subsequence of the input sequence by the specified replacement string. |
Output:
Replace method
Output:
Pattern Syntax Expression Class: Unchecked or Runtime Exception thrown when syntax error occurs in regular expression. The following methods help to determine what went wrong.
Method | Description |
String getDescription() | Retrieves the description of the error. |
int getIndex() | Retrieves the error index. |
String getPattern() | Retrieves the erroneous regular expression pattern. |
String getMessage() | Returns a multi-line string containing the description of the syntax error and its index, the erroneous regular expression pattern, and a visual indication of the error index within the pattern. |
Output:
Match Result Interface: Represents the result of a match operation. This interface contains query methods used to determine the results of a match against a regular expression. The match boundaries, groups and group boundaries can be seen but not modified through a MatchResult.
Method | Description |
Returns the offset after the last character matched. | |
Returns the offset after the last character of the subsequence captured by the given group during this match. | |
Returns the input subsequence matched by the previous match. | |
Returns the input subsequence captured by the given group during the previous match operation. | |
Returns the number of capturing groups in this match result's pattern. | |
Returns the start index of the match. | |
Returns the start index of the subsequence captured by the given group during this match. |
Rules for writing Regex:
The following shows the overview of available character classes, metacharacters and quantifiers which can be used in regular expressions.
Character Classes:
A character class is a set of characters enclosed within square brackets []. It specifies the characters that will successfully match a single character from a given input string.
Character class | Description |
[abc] | Set Definition, matches a or b or c |
[^abc] | Caret symbol at first inside the square brackets indicate negation. Any character except a or b or c |
[abc][vz] | Set definition, can match a or b or c followed by either v or z. |
[a-zA-Z] | Range: matches a through z or A through Z, inclusive |
[a-d[m-p]] | Union: a through d, or m through p: [a-dm-p] |
[a-z&&[def]] | Intersection: matches d, e, or f |
[a-z&&[^bc]] | Subtraction: a through z, except for b and c: [ad-z] |
[a-z&&[^m-p]] | Subtraction: a through z, and not m through p: [a-lq-z] |
Metacharacters:
Metacharacters are building blocks of regular expressions which has a pre-defined or special meaning and also transform literal characters into powerful expressions. For ex: we can use \d for [0..9]. These meta characters have the same first letter as their representation, e.g., digit, space, word, and boundary. Uppercase symbols define the opposite.
Regex | Description |
. | Any Single character except newline. Using m option allows it to match the newline as well. |
\d | Any digits, short of [0-9] |
\D | Any non-digit, short for [^0-9] |
\s | Any whitespace character, short for [\t\n\x0B\f\r] |
\S | Any non-whitespace character, short for [^\s] |
\w | Any word character, short for [a-zA-Z_0-9] |
\W | Any non-word character, short for [^\w] |
\b | A word boundary |
\B | A non-word boundary |
Quantifiers:
Quantifiers specify the number of occurrences of a character to match against the string. The symbols ?, *, + and {} are qualifiers.
Regex | Description |
X? | X occurs 0 or 1 times, short form for {0,1} |
X* | X occurs 0 or more times, short form for {0,} |
X+ | X occurs 1 or more times, short form for {1,} |
X{n} | X occurs n times only. For ex: \d{3} searches for three digits, .{10} for any character sequence of length 10. |
X{n,} | X occurs n or more times. For ex: \d{2,} means the digit must occur at least 2 Or more times. |
X{y,z} | X occurs at least y times but less than z times. For ex: \d{1,4} means the digit must occur at least once and at a maximum of four. |
*? | ? after a quantifier makes it a reluctant quantifier. It tries to find the smallest match. This makes the regular expression stop at the first match. |
Phone Number Validation using Regex
Pattern 1: Create a regular expression that accepts 10-digit numeric characters starting with 7, 8 or 9 only.
Output:
Pattern 2: Create a regular expression that accepts 10-digit numeric characters starting with 7, 8 or 9 and after first 3 digits there should be a '-' or white space. Then again after the second 3 digits there should be a '-' or white space.
Output:
Conclusion
Regex expressions looks complex, and they are complex, because they’re so useful. But seriously, I regret not jumping in and learning them sooner, because logically they are no more difficult than any coding language, and syntactically they end up being very intuitive and easy to look up.