grep - unexpected agrep() results related to max.distance in R -


edit: This bug was found in 32-bit versions of R. R. version 2.9 Was fixed in 2..


I was tweeted today by @Lenyudu and I do not have any answer for that, so I thought I would post it here.

I have read the documentation for agrep () (fuzzy string mailing) and it appears that I can not fully understand the maximum distance parameter. Here's an example:

  pattern & lt; - "Staatssekretar im Bundeskanzleramt" X & lt; - "Bundeskanzleramt" agrep (Pattern, X, max.distance = 18) agrep (Pattern, X, Max. Distance = 19)  

It behaves properly as I expected There were 18 letters separated between the strings, so I hope that the threshold of a match will be. Who is confusing me here:

  agrep (pattern, x, max.distance = 30) agrep (pattern, x, max.distance = 31) agrep (pattern, x, max. Distance = 32) Agrep (pattern, x, max.distance = 33)  

Why are 30 and 33 matches, but not 31 and 32? To save you a few counts,

  & gt; Nchar ("Stacets without bundscalllemate") [1] 34 & gt; Nchar ("bundsecolmite") [1] 16  

I posted it on the list Some time ago and in R-Bug List I was reported as a bug, I did not have any useful feedback, so to tell me that the bug was reproducible or I was just remembering something. Jedi Long was able to reproduce it and posted the question here.

Note, at least in R, Agrép is a misnomer because it matches not regular expression, while grep "global search for global expression and print " stands for. There should be no problem in the pattern compared to the target vector. (I think!)

In my Linux server, all is fine but this is not so in my Mac and Windows machines.

Mac: sessionInfo () R Version 2.9.1 (2009- 06-26) i386- Apple's darwin8.11.1 Location: en_US.UTF-8 / en_US.UTF-8 / c / c / en_US .UTF-8 / en_US.UTF-8

agrep (pattern, X, max.distance = 30) [1] 1

agrep (pattern, x, max .distance = 31) Integer (0) agrep (Pattern, x, max.distance = 32 Linux: R Version 2.9.1 (2009-06-26) x86_64- Unknown- (1) Linux GNU

Location: LC_CTYPE = en_US.UTF-8; LC_NUMERIC = C; LC_TIME = en_US.UTF-8; LC_COLLATE = en_US.UTF-8; LC_MONETARY = C; LC_MESSAGES = en_US.UTF-8; LC_PAPER = en_US.UTF-8; LC_NAME = C; LC_ADDRESS = C; LC_TELEPHONE = C; LC_MEASUREMENT = en_US.UTF-8; LC_IDENTIFICATION = C

agrep (pattern, x, max .distance = 30) [1] 1 agrote (Pattern, x, m [1] 1 degree (pattern, x, max distance = 33) [1] 1

Div>

Comments

Popular posts from this blog

c# - ListView onScroll event -

PHP - get image from byte array -

Linux Terminal Problem with Non-Canonical Terminal I/O app -