regex - Python Unicode Regular Expression -

- September 15, 2013

I am using Python 2.4 and I am facing some problems with Unicode regular expressions. Have tried to collect a very clear and concise example. It seems as if there is some problem with Python different character encoding, or I understand a problem in my understanding. Thank you very much for taking a look! This is a simple dragon program, which displays my problems with Python in regular expression and character encoding. # Brian J. Thank you for the # thank you written by Steiner.

Help! Import urllib # off Internet import chardet # files from the charactor encoding again to get #import ponyguruma # dragon Onyguruma regular expression # Python regular expressions to import identity - this is uncommented be if you feel like messing with it , But I am not the only issue which I am using rawdata = urlib.urlopen ('http://www.cs.unm.edu/~brian.stinar/legal.html') of RE .read ( ) Print (chardet.detect (rawdata)) # print (rawdata) as ISO_8859_2_encoded = rawdata.decode ('ISO-8859-2') # UTF_8_encoded = ISO_8859_2_encoded.encode seems to grab the ( 'utf-8') # and as UTF-8's print text encode (chardet.detect (UTF_8_encoded)) # # good it does not work perfectly, even though you can see the end HTML # uNSUBSCRIBE in, I want to fully understand the physical address and unsubscribe it re_UNSUB_amsterdam = re.compile in the above ( ". * uNSUBSCRIBE. *", re.UNICODE) Print (STR (Re_UNSUB_amsterdam.match (UTF_8_encoded)) "\ t \ t \ t \ t \ t ---" RE "print for UNSUBSCRIBE at UTF-8 (STR (re_UNSUB_am Ste Rdam.match (rawdata)) + "for raw UNSUBSCRIBE data" \ t \ t \ t \ t \ t --- RE) re_amsterdam = re.compile (". *. *. * ", Re.UNICODE) print (Str (re_amsterdam.match (rawdata)) +" \ t --- 'Adobe' on raw data for RE) # However, this work?!? Print (str (re_amsterdam.match (UTF_8_encoded)) for "Adobe" on UTF-8 "\ t --- RE") '' '#include addon, I used this regular expression library for a very unsatisfactory result try new_re = ponyguruma.Regexp ( ". * UNSUBSCRIBE *.") If new_re.match (UTF_8_encoded) = none: print ( "Ponyguruma RE did not match \ t \ t \ t --- on UTF-8 RE for the UNSUBSCRIBE! ") And: print (" No match for UNSUBSCRIBE on Poniguruma RE-UTF-8 \ t \ t --- RE was not eaten ") if new_line (raw data)! = None: Q ("Ponyguruma RE \ t \ t --- Unsubscribe on raw data") Newtest ("PaniGuruma RE \ t \ t --- unsubscribe on raw data".) And: print ("Panini ray Milan! \ T \ t \ t--- RA for UNSUBSCRIBE on raw data") and: new_re = ponyguruma.Regexp (RE * Adobe to "*") If not new_re not match Kmel (UTF_8_encoded)! = none: print was matched ( "Ponyguruma RE! \ T \ t \ t --- UTF- 8 for Adobe ") Other: Print (" Phenigurama RE did not match \ t \ t \ t --- "RE for Adobe on UTF-8") new_re = ponyguruma.Regexp (". *. *. * ") If new_re.match (rawdata)! = None: print (" Panini ray matching! \ T \ t \ t --- RA for Adobe on raw data ") and: print (" Ponyguruma does not match RE I'm working on a replacement project, and there is a hard time running non-ASCII. This problem with encoded files is part of a larger project - in the end I want to substitute the text with other text once again (thanks. I have been working in this ASCII, but I found I can not yet identify events in other encodings).

- Brian J. Steiner -

You probably want to either enable the DotALL flag or match instead of the search The method you want to use is:

  # DOTALL creates matches Newline re_UNSUB_amsterdam = re.compile (". * UNSUBSCRIBE.", "UNUNODE | RE.DOTALL"

Or:

  Find # of search matches, even if they are strings Do not be at the beginning of c ... re_UNSUB_amsterdam.search (foo) ...

These will give you different results, but both of you should give the match. (See how you want it to be one.)

On one side: You are receiving encoded text (which bytes) and decode text (characters) are confusing. It is not uncommon, especially in the former 3.x Python. Specifically, it is very suspicious:

  ISO_8859_2_encoded = rawdata.decode ('ISO-8859-2')

You are < Strong> D - Coding with ISO-8859-2, no-coding N , so call this variable "decode". (Why not "ISO_8859_2_decoded"? Because ISO_8859_2 is an encoding. A decoded string encoding is not anymore.) To

Try to do match the rest of your code on rawdata and UTF_8_encoded (both Encoded strings) when it should probably be used rather than decoded Unicode string.

Search This Blog

IDEA SSL

regex - Python Unicode Regular Expression -

Comments

Post a Comment

Popular posts from this blog

c# - ListView onScroll event -

PHP - get image from byte array -

Linux Terminal Problem with Non-Canonical Terminal I/O app -