Regular Expressions
Regular Expressions can be used for more complex search and replace rules. Using regular expressions you can, for example, replace every word that starts with an ‘a’ with your replace string. You can even reformat the found strings, for example change HTML tags while keeping their content.
I really recommend using this Regex Tester while learning how regular expressions work.
Let’s use this pangram for the first example:
The quick brown fox jumps over the lazy dog
When matching it against
(quick|fox)
you will get 2 results: quick and fox.
The parentheses are called a capturing group. This is because everything they match against can be accessed in the replace string using $1 $2 and so on.
So if we perform a search and replace with above search string and the replace string
<$1>
we will get this result:
The <quick> brown <fox> jumps over the lazy dog
You can also search for whole words. Let’s use this as input text:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
The search string:
\b(\w+)\b
will simply match every word. The \w stands for word characters (a to z and A to Z). The + after the \w quantifies it, in this case that a word character has to be matched 1 or more times. The leading and trailing \b makes a word boundary, meaning it will look at the words as a whole unit.
(labor)
would match one time (in ‘labore’) while
\b(labor)\b
would not match as there is no ‘labor’ in the text.
If you don’t want to use the presets you can make your own character ranges using []. Put everything that should match between them. Some examples:
- [A-Z] matches A,B,C,…,Z
- [A-Za-z] matches A,B,C,…,Z,a,b,….,z
- [0-9_] matches 0,1,…,9,_ (like 123_456)
- [a] matches only a
Those can be put into a capturing group. Some examples:
- ([A-Z]) matches a single capital letter
- ([A-Z]+) matches 1 or more capital letters
- ([0-9]*) matches 0 or more numbers
Besides + and * you can also specify an exact number of times using {n} where n is a number >= 0 or you can specify a range using {n,m} where the character has to be matched n to m times. You can also set a minimum (like more than 5 times) using {5,}.
Let’s look at more complex examples. We now have
123-456 233+12345 -456
as input. The rule “the word should match one or more digits, followed by a – or +, followed by exactly 3 digits”:
\b([0-9]+[-+][0-9]{3})\b
will match only the first part (‘123-456’). When we modify this rule to (note the + changed to a *):
([0-9]*[-+][0-9]{3})\b
it will not only match the first, but also the last part, as the first digits can also have a length of 0.
In case you still don’t understand how all this works, I really recommend you to use the Regex Tester. Play a little bit with it and you’ll get it! Also take a look at their Quick-Reference:
. |
Any character except newline. |
\. |
A period (and so on for \* , \( , \\ , etc.) |
^ |
The start of the string. |
$ |
The end of the string. |
\d ,\w ,\s |
A digit, word character [A-Za-z0-9_] , or whitespace. |
\D ,\W ,\S |
Anything except a digit, word character, or whitespace. |
[abc] |
Character a, b, or c. |
[a-z] |
a through z. |
[^abc] |
Any character except a, b, or c. |
aa|bb |
Either aa or bb. |
? |
Zero or one of the preceding element. |
* |
Zero or more of the preceding element. |
+ |
One or more of the preceding element. |
{n} |
Exactly n of the preceding element. |
{n,} |
n or more of the preceding element. |
{m,n} |
Between m and n of the preceding element. |
?? ,*? ,+? ,{n}? , etc. |
Same as above, but as few as possible. |
( expr) |
Capture expr for use with $1 , etc. |
(?: expr) |
Non-capturing group. |
(?= expr) |
Followed by expr. |
(?! expr) |
Not followed by expr. |
Recent Comments