## Search for a pattern in numpy array

Question

Is there a simple way to find all relevant elements in NumPy array according to some pattern?

For example, consider the following array:

``````a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)
``````

And I need to to find all combinations which contain '**dd'.

I basically need a function, which receives the array as input and returns a smaller array with all relevant elements:

``````>> b = func(a, pattern='**dd')
>> b = array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)
``````

Show source

## Answers to Search for a pattern in numpy array ( 5 )

1. ``````import fnmatch
import numpy as np
a = ['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd']

b=[]
for item in a:
if fnmatch.fnmatch(item, "z*dd"):
b.append(item)
print b
``````

output

``````['zzdd', 'zddd', 'zndd']
``````
2. Python has a built in function named `.endswith()`. The clue is in the name, it finds any value in a string that ends with the value in the brackets. To do this in your case however you could do the following:

``````i = 0
while i < len(a) :
if a[i].endswith("dd") :
print(a[i])
i += 1
``````
3. I'm not a `numpy` specialist. However, I understand that you want to create a filtered `numpy` array, not a standard python array, and converting from python array to `numpy` array takes time and memory, so bad option.

Not sure that you mean regex, but rather wildcard, in which case the correct choice is `fnmatch` module with `??dd` pattern (any 2 chars + `dd` in the end)

(alternate solution would involve `re.match()` with `..dd\$` as a pattern).

I would compute the indices matching your criteria, then would use `take` to extract a sublist:

``````from numpy import array
import fnmatch

a = array(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'], dtype=object)

def func(ar,pattern):
indices = [i for i,x in enumerate(ar) if fnmatch.fnmatch(x,pattern)]
return ar.take(indices)

print(func(a,"??dd"))
``````

result:

``````['zzdd' 'zddd' 'zndd' 'nddd' 'nndd' 'dddd']
``````

regex version (same result in the end of course):

``````from numpy import array
import re

def func(ar,pattern):
indices = [i for i,x in enumerate(ar) if re.match(pattern,x)]
return ar.take(indices)

print(func(a,"..dd\$"))
``````
4. Here's an approach using `numpy.core.defchararray.rfind` to get us the last index of a match and then we check if that index is 2 minus the length of each string. Now, the length of each string is `4` here, so we would look for the last indices that are `4 - 2 = 2`.

Thus, an implementation would be -

``````a[np.core.defchararray.rfind(a.astype(str),'dd')==2]
``````

If the strings are not of equal lengths, we need to get the lengths, subtract `2` and then compare -

``````len_sub = np.array(list(map(len,a)))-len('dd')
a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
``````

To test this out, let's add a longer string ending with `dd` at the end of the given sample -

``````In [121]: a = np.append(a,'ewqjejwqjedd')

In [122]: len_sub = np.array(list(map(len,a)))-len('dd')

In [123]: a[np.core.defchararray.rfind(a.astype(str),'dd')==len_sub]
Out[123]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd',\
'ewqjejwqjedd'], dtype=object)
``````
5. Since it turns out you're actually working with pandas, there are simpler ways to do it at the Series level instead of just an ndarray, using the vectorized string operations:

``````In [32]: s = pd.Series(['zzzz', 'zzzd', 'zzdd', 'zddd', 'dddn', 'ddnz', 'dnzn', 'nznz',
...:        'znzn', 'nznd', 'zndd', 'nddd', 'ddnn', 'dnnn', 'nnnz', 'nnzn',
...:        'nznn', 'znnn', 'nnnn', 'nnnd', 'nndd', 'dddz', 'ddzn', 'dznn',
...:        'znnz', 'nnzz', 'nzzz', 'zzzn', 'zznn', 'dddd', 'dnnd'])

In [33]: s[s.str.endswith("dd")]
Out[33]:
2     zzdd
3     zddd
10    zndd
11    nddd
20    nndd
29    dddd
dtype: object
``````

which produces a Series, or if you really insist on an ndarray:

``````In [34]: s[s.str.endswith("dd")].values
Out[34]: array(['zzdd', 'zddd', 'zndd', 'nddd', 'nndd', 'dddd'], dtype=object)
``````

You can also use regular expressions, if you prefer:

``````In [49]: s[s.str.match(".*dd\$")]
Out[49]:
2     zzdd
3     zddd
10    zndd
11    nddd
20    nndd
29    dddd
dtype: object
``````