regex - Python Unicode Regular Expression -
I am using Python 2.4 and I am facing some problems with Unicode regular expressions. Have tried to collect a very clear and concise example. It seems as if there is some problem with Python different character encoding, or I understand a problem in my understanding. Thank you very much for taking a look! This is a simple dragon program, which displays my problems with Python in regular expression and character encoding. # Brian J. Thank you for the # thank you written by Steiner.
Help! Import urllib # off Internet import chardet # files from the charactor encoding again to get #import ponyguruma # dragon Onyguruma regular expression # Python regular expressions to import identity - this is uncommented be if you feel like messing with it , But I am not the only issue which I am using rawdata = urlib.urlopen ('http://www.cs.unm.edu/~brian.stinar/legal.html') of RE .read ( ) Print (chardet.detect (rawdata)) # print (rawdata) as ISO_8859_2_encoded = rawdata.decode ('ISO-8859-2') # UTF_8_encoded = ISO_8859_2_encoded.encode seems to grab the ( 'utf-8') # and as UTF-8's print text encode (chardet.detect (UTF_8_encoded)) # # good it does not work perfectly, even though you can see the end HTML # uNSUBSCRIBE in, I want to fully understand the physical address and unsubscribe it re_UNSUB_amsterdam = re.compile in the above ( ". * uNSUBSCRIBE. *", re.UNICODE) Print (STR (Re_UNSUB_amsterdam.match (UTF_8_encoded)) "\ t \ t \ t \ t \ t ---" RE "print for UNSUBSCRIBE at UTF-8 (STR (re_UNSUB_am Ste Rdam.match (rawdata)) + "for raw UNSUBSCRIBE data" \ t \ t \ t \ t \ t --- RE) re_amsterdam = re.compile (". *. *. * ", Re.UNICODE) print (Str (re_amsterdam.match (rawdata)) +" \ t --- 'Adobe' on raw data for RE) # However, this work?!? Print (str (re_amsterdam.match (UTF_8_encoded)) for "Adobe" on UTF-8 "\ t --- RE") '' '#include addon, I used this regular expression library for a very unsatisfactory result try new_re = ponyguruma.Regexp ( ". * UNSUBSCRIBE *.") If new_re.match (UTF_8_encoded) = none: print ( "Ponyguruma RE did not match \ t \ t \ t --- on UTF-8 RE for the UNSUBSCRIBE! ") And: print (" No match for UNSUBSCRIBE on Poniguruma RE-UTF-8 \ t \ t --- RE was not eaten ") if new_line (raw data)! = None: Q ("Ponyguruma RE \ t \ t --- Unsubscribe on raw data") Newtest ("PaniGuruma RE \ t \ t --- unsubscribe on raw data".) And: print ("Panini ray Milan! \ T \ t \ t--- RA for UNSUBSCRIBE on raw data") and: new_re = ponyguruma.Regexp (RE * Adobe to "*") If not new_re not match Kmel (UTF_8_encoded)! = none: print was matched ( "Ponyguruma RE! \ T \ t \ t --- UTF- 8 for Adobe ") Other: Print (" Phenigurama RE did not match \ t \ t \ t --- "RE for Adobe on UTF-8") new_re = ponyguruma.Regexp (". *. *. * ") If new_re.match (rawdata)! = None: print (" Panini ray matching! \ T \ t \ t --- RA for Adobe on raw data ") and: print (" Ponyguruma does not match RE I'm working on a replacement project, and there is a hard time running non-ASCII. This problem with encoded files is part of a larger project - in the end I want to substitute the text with other text once again (thanks. I have been working in this ASCII, but I found I can not yet identify events in other encodings).
- Brian J. Steiner -
You probably want to either enable the DotALL flag or match
instead of the search
The method you want to use is:
# DOTALL creates matches Newline re_UNSUB_amsterdam = re.compile (". * UNSUBSCRIBE.", "UNUNODE | RE.DOTALL"
Or:
Find # of search matches, even if they are strings Do not be at the beginning of c ... re_UNSUB_amsterdam.search (foo) ...
These will give you different results, but both of you should give the match. (See how you want it to be one.)
On one side: You are receiving encoded text (which bytes) and decode text (characters) are confusing. It is not uncommon, especially in the former 3.x Python. Specifically, it is very suspicious:
ISO_8859_2_encoded = rawdata.decode ('ISO-8859-2')
You are < Strong> D - Coding with ISO-8859-2, no-coding N , so call this variable "decode". (Why not "ISO_8859_2_decoded"? Because ISO_8859_2 is an encoding. A decoded string encoding is not anymore.) To
Try to do match the rest of your code on rawdata and UTF_8_encoded (both Encoded strings) when it should probably be used rather than decoded Unicode string.
Comments
Post a Comment