Search for a pattern in numpy array

Question

Is there a simple way to find all relevant elements in NumPy array according to some pattern?

For example, consider the following array:

a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
       'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
       'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
       'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)

And I need to to find all combinations which contain '**dd'.

I basically need a function, which receives the array as input and returns a smaller array with all relevant elements:

>> b = func(a, pattern='**dd')
>> b = array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)

Show source
| numpy   | python   2017-01-05 19:01 5 Answers

Answers ( 5 )

  1. 2017-01-05 19:01
    import fnmatch
    import numpy as np
    a = ['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
           'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
           'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
           'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd']
    
    
    b=[]
    for item in a:
        if fnmatch.fnmatch(item, "z*dd"):
            b.append(item)
    print b
    

    output

    ['zzdd', 'zddd', 'zndd']
    
  2. 2017-01-05 19:01

    Python has a built in function named .endswith(). The clue is in the name, it finds any value in a string that ends with the value in the brackets. To do this in your case however you could do the following:

    i = 0
    while i < len(a) :
       if a[i].endswith("dd") :
          print(a[i])
       i += 1
    
  3. 2017-01-05 19:01

    I'm not a numpy specialist. However, I understand that you want to create a filtered numpy array, not a standard python array, and converting from python array to numpy array takes time and memory, so bad option.

    Not sure that you mean regex, but rather wildcard, in which case the correct choice is fnmatch module with ??dd pattern (any 2 chars + dd in the end)

    (alternate solution would involve re.match() with ..dd$ as a pattern).

    I would compute the indices matching your criteria, then would use take to extract a sublist:

    from numpy import array
    import fnmatch
    
    a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
           'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
           'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
           'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)
    
    def func(ar,pattern):
        indices = [i for i,x in enumerate(ar) if fnmatch.fnmatch(x,pattern)]
        return ar.take(indices)
    
    print(func(a,"??dd"))
    

    result:

    ['zzdd' 'zddd' 'zndd' 'nddd' 'nndd' 'dddd']
    

    regex version (same result in the end of course):

    from numpy import array
    import re
    
    def func(ar,pattern):
        indices = [i for i,x in enumerate(ar) if re.match(pattern,x)]
        return ar.take(indices)
    
    print(func(a,"..dd$"))
    
  4. 2017-01-05 19:01

    Here's an approach using numpy.core.defchararray.rfind to get us the last index of a match and then we check if that index is 2 minus the length of each string. Now, the length of each string is 4 here, so we would look for the last indices that are 4 - 2 = 2.

    Thus, an implementation would be -

    a[np.core.defchararray.rfind(a.astype(str),'dd')==2]
    

    If the strings are not of equal lengths, we need to get the lengths, subtract 2 and then compare -

    len_sub = np.array(list(map(len,a)))-len('dd')
    a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
    

    To test this out, let's add a longer string ending with dd at the end of the given sample -

    In [121]: a = np.append(a,'ewqjejwqjedd')
    
    In [122]: len_sub = np.array(list(map(len,a)))-len('dd')
    
    In [123]: a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
    Out[123]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd',\
                     'ewqjejwqjedd'], dtype=object)
    
  5. 2017-01-05 20:01

    Since it turns out you're actually working with pandas, there are simpler ways to do it at the Series level instead of just an ndarray, using the vectorized string operations:

    In [32]: s = pd.Series(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
        ...:        'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
        ...:        'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
        ...:        'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'])
    
    In [33]: s[s.str.endswith("dd")]
    Out[33]: 
    2     zzdd
    3     zddd
    10    zndd
    11    nddd
    20    nndd
    29    dddd
    dtype: object
    

    which produces a Series, or if you really insist on an ndarray:

    In [34]: s[s.str.endswith("dd")].values
    Out[34]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)
    

    You can also use regular expressions, if you prefer:

    In [49]: s[s.str.match(".*dd$")]
    Out[49]: 
    2     zzdd
    3     zddd
    10    zndd
    11    nddd
    20    nndd
    29    dddd
    dtype: object
    
◀ Go back