I was writing scripts for parsing a document using Python. I found it hard to find a good and quick tutorial on python regular expression package. This is my best effort of showing how to use Python regular expression.
The first thing you need to do is import the regular package from python
To build an regular expression object, you need to invoke a compile call to re. Here, I am trying to find expression in the form of Rice University net id, such as “yz17” (2-3 alphabetical letters followed by 2-3 digits number).
netidExpr = re.compile(' [a-z]*[0-9]* ')
Once you have a regular expression object, there are a few useful methods you can call to find the matching pattern, the following on official documentation in the python webpage
- The first one is re.search()
- documentation: Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
- The second method is re.match()
- If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObjectinstance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
For me, search() is a lot more useful as I am trying to extract the pattern from a line of text. But I can see where match() can come in handy as well. I will demonstrate how to use search() in this article
Assuming I have a textLine, ” Yunming Zhang, ( yz17) Active Student”, the way I would write a line of code to extract yz17 is the following
import re netidExpr = re.compile(' [a-z]*[0-9]* ') netid = netidExpr.search(" Yunming Zhang, ( yz17 ) Active Student") print netid.group()
I use netid.group() to read out the actual match. The search method returns a match object. The above code would return “yz17”
Hope this helps!
As a side note, apparently you can also split with multiple delimiters. Even though this might be a good case to consider using a regular expression instead. But this stack overflow post summarized it all very well,