This blog is all about Regular Expressions in Tableau and how these will help you to tune the data. Regular Expressions are shortened as REGEX or REGEXP. Have you ever wondered what REGEX functions are? I am always curious about REGEX as they will follow some patterns. They will help you find what is hidden inside a stream of data. Mainly REGEX is used for extracting data elements from a string of data. As per Wikipedia, REGEX is a sequence of characters that specifies a search pattern in the text.
REGEX is commonly used by websites and application developers, to validate whether the user input matches the required structure of the data. For example, in email ids, where we could use REGEXP_MATCH to validate if the user input for that data field was entered properly or correctly.
Regular Expression Metacharacters Classes
A Metacharacter is a character that has special meaning in the REGEX engine. These classes are used in defining rules to find a specific pattern in a string. In REGEX, backlash with certain letters can do something with certain characters. A capitalized letter will have a different meaning and action than a lowercase one.
A few examples of the Metacharacters classes are shown below :
Character | Description |
^ | Match at the beginning of a line; Start of string, or start of the line in a multi-line pattern |
$ | Match at the end of a line; End of string, or End of line in a multi-line pattern |
\ | Escape following pattern |
\w | Match a word Character |
\W | Match a non-word character |
\d | Match any character with the Unicode General Category of Nd (Number, Decimal Digit.) |
\D | Match any character that is not digital |
. | Match any character |
+ | It is used for ‘one or more occurrences |
* | It is used for ‘zero or more occurrences |
Note: For more reference on regular expression, here is the link to the cheat sheet:
In Tableau, we have four different REGEXP functions, which either match, extract, or replace strings and they are :
· REGEXP_EXTRACT (string, pattern) - Looks for a particular pattern within a string or substring and extracts that data element.
· REGEXP_EXTRACT_NTH (string, pattern, index) - Looks for a particular pattern within a string or substring, starting at the nth position in the string, and extracts that data element.
· REGEXP_MATCH (string, pattern) - Looks for a particular pattern within a string or substring and returns TRUE if there’s an exact match.
· REGEXP_REPLACE (string, pattern, replacement) - Looks for a particular pattern within a string or substring and replaces that data element with a different data element.
Before starting with examples, we will be familiarising on how to create a calculation field in Tableau. From the top left area of the Tableau sheet, besides the search field, on clicking the down arrow you will see the options for creating a new Calculated Field
On clicking "Create Calculated Field" you can generate new calculated fields with new names and when you enter "Reg" in the below text area. All the above-mentioned REGEX functions will appear as shown below.
Now let’s discuss how each REGEX function can be used and their examples. I will be using the Product ID field from the Sample - superstore data set for applying REGEX functions. Initially, you can drop the Product ID to the Text field and you will get the sheet as shown below :
Now you can apply each REGEXP function for the Product ID Data.
REGEXP_EXTRACT - As mentioned above It is looking for a particular pattern in the string. So I created a new calculated field with the name 'PRD-Reg1' with syntax as mentioned below.
REGEXP_EXTRACT([Product ID],'(\w+)')
The function of the pattern (\w+) here tells us the parenthesis acts as our capture group, the data we are trying to extract or match upon, and the backslash mentions the starting point to start with. Now the small “w” (as mentioned in the above table) is matching on a word character and alphanumeric values. Followed by the “+ “ symbol means to extract all characters once and stop when it finds the next character which isn’t a word.
.
Once the new calculated field is ready you can drag this 'PRD-Reg1' to the Text Field and we can see that it will extract the first part of each product Id, i.e, 'FUR'
2. REGEXP_EXTRACT_NTH - Will create a new calculated field with the name 'PRD-Reg2' with syntax as mentioned below. Now the substring is matched to the nth capturing group, where 'n' is the given index. With this syntax, it looks for a particular pattern within a string or substring, starting at the 3rd position in the string, and extracts that data element. Here '\d' will match on a decimal digit.
REGEXP_EXTRACT_NTH([Product ID],'(\w+)-(\w+)-(\d+)',3)
So with this new calculated field moved to Text, it will extract the last digit field of the Product Id.
3. REGEXP_MATCH - In this example, on creating a new calculated field 'PRD-Reg3', we will look for a particular pattern ( 'FUR' ) within the 'Product Id' and it will return a 'TRUE' value if it finds an exact match. The syntax used will be as shown
REGEXP_MATCH([Product ID],'FUR')
And the result once 'PRD-Reg3' moves to the Text field is as shown :
4. REGEXP_REPLACE - As the function name, it will search for a particular pattern and replace it with the new data pattern provided. For this, a new calculated field with the name 'PRD-Reg4' with the below syntax can be created.
REGEXP_REPLACE([Product ID],'BO','HOM')
Now all the Product Ids with the string 'BO' will be replaced with the string 'HOM'.
In the below image you can see FUR - CH strings are not replaced, whereas, all the FUR-BO strings were replaced with FUR-HOM.
I hope this post provided a good, Tableau-focused introduction to regular expressions. Hope you enjoyed the post. If you like these I will be posting the next level of regular expressions with new examples.