Index out of bounds when replacing NaNs through a function in Pandas

Question

I have created a function that replaces the NaNs in a Pandas dataframe with the means of the respective columns. I tested the function with a small dataframe and it worked. When I applied it though to a much larger dataframe (30,000 rows, 9 columns) I got the error message: IndexError: index out of bounds

The function is the following:

# The 'update' function will replace all the NaNs in a dataframe with the mean of the respective columns

def update(df):   # the function takes one argument, the dataframe that will be updated
      ncol = df.shape[1]  # number of columns in the dataframe
      for i in range(0 , ncol):  # loops over all the columns
             df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean()[i]  # subsets the df using the isnull() method, extracting the positions
                                                        # in each column where the 
      return(df)

The small dataframe I used to test the function is the following:

     0   1   2  3
0   NaN NaN  3  4
1   NaN NaN  7  8
2   9.0 10.0 11 12

Could you explain the error? Your advice will be appreciated.


Show source
| function   | pandas   | python   | nan   | indexoutofboundsexception   2017-01-06 01:01 2 Answers

Answers ( 2 )

  1. 2017-01-06 01:01

    I would use DataFrame.fillna() method in conjunction with DataFrame.mean() method:

    In [130]: df.fillna(df.mean())
    Out[130]:
         0     1   2   3
    0  9.0  10.0   3   4
    1  9.0  10.0   7   8
    2  9.0  10.0  11  12
    

    Mean values:

    In [138]: df.mean()
    Out[138]:
    0     9.0
    1    10.0
    2     7.0
    3     8.0
    dtype: float64
    
  2. 2017-01-06 01:01

    The reason you are getting "index out of bounds" is because you are assigning the value df.mean()[i] when i is one iteration of what are supposed to be ordinal positions. df.mean() is a Series whose indices are the columns of df. df.mean()[something] implies something better be a column name. But they aren't and that's why you get your error.

    your code... fixed

    def update(df):   # the function takes one argument, the dataframe that will be updated
          ncol = df.shape[1]  # number of columns in the dataframe
          for i in range(0 , ncol):  # loops over all the columns
                 df.iloc[:,i][df.isnull().iloc[:, i]]=df.mean().iloc[i]  # subsets the df using the isnull() method, extracting the positions
                                                            # in each column where the 
          return(df)
    

    Also, your function is altering the df directly. You may want to be careful. I'm not sure that's what you intended.


    All that said. I'd recommend another approach

    def update(df):
        return df.where(df.notnull(), df.mean(), axis=1)
    

    You could use any number of methods to fill missing with the mean. I'd suggest using @MaxU's answer.

    df.where
    takes df when first arg is True otherwise second argument

    df.where(df.notnull(), df.mean(), axis=1)
    

    df.combine_first with awkward pandas broadcasting

    df.combine_first(pd.DataFrame([df.mean()], df.index))
    

    np.where

    pd.DataFrame(
        np.where(
            df.notnull(), df.values,
            np.nanmean(df.values, 0, keepdims=1)),
        df.index, df.columns)
    
◀ Go back