Do the individual series contained in the DataFrame have their own index?

Consider a df data block

 df = pd.DataFrame(dict(A=[1, 2, 3])) df A 0 1 1 2 2 3 

Now I will assign the variable a the df.A series

 a = df.A a 0 1 1 2 2 3 Name: A, dtype: int64 

Now I will add a index

 a.index = a.index + 1 print(a) print() print(df) 1 1 2 2 3 3 Name: A, dtype: int64 A 0 1 1 2 2 3 

Nothing is visible here. All as expected ...
But now I reassign a = df.A

 a = df.A print(a) print() print(df) 1 1 2 2 3 3 Name: A, dtype: int64 A 0 1 1 2 2 3 

I just reassigned a directly from df . df index is what it was, but a index is different. This is what happened after I enlarged it and before I rewrote it.

Of course, if I build df , everything will be reset.

 df = pd.DataFrame(dict(A=[1, 2, 3])) a = df.A print(a) print() print(df) 0 1 1 2 2 3 Name: A, dtype: int64 A 0 1 1 2 2 3 

But this should mean that the pd.Series object, which is tracked inside the pd.DataFrame object, tracks its own index, which is not accurately displayed at the pd.DataFrame level.

Question
Am I interpreting this correctly?

It even leads to this weirdness:

 pd.concat([df, df.A], axis=1) AA 0 1.0 NaN 1 2.0 1.0 2 3.0 2.0 3 NaN 3.0 
+5
source share
2 answers

It looks like an error or an unintended consequence of the identifiers of a python object, before the destination we can see that the indices are the same:

 In [175]: df = pd.DataFrame(dict(A=[1, 2, 3])) df Out[175]: A 0 1 1 2 2 3 In [176]: print(id(df.index)) print(id(df['A'])) print(id(df['A'].index)) a = df.A a 132848496 135123240 132848496 Out[176]: 0 1 1 2 2 3 Name: A, dtype: int64 

Now, if we change our link, the indices will now become different objects, and both a and df['A'] same:

 In [177]: a.index = a.index + 1 print(a) print(id(a)) print(id(df.A)) print() print(df) print(id(df.A.index)) print(id(a.index)) 1 1 2 2 3 3 Name: A, dtype: int64 135123240 135123240 A 0 1 1 2 2 3 135125144 135125144 

but now df.index is different from df['A'].index and a.index :

 In [181]: print(id(df.index)) print(id(a.index)) print(id(df['A'].index)) 132848496 135124808 135124808 

Personally, I think this is an unintended consequence, since it is difficult if you take the link a in the column 'A' , which the original df should do after you start mutating the link, and I'm sure it is even harder to catch than usual Warning Setting on copy

To avoid this, it is best to call copy() to make a deep copy so that any mutations do not affect orig df:

 In [183]: df = pd.DataFrame(dict(A=[1, 2, 3])) a = df['A'].copy() a.index = a.index+1 print(a) print(df['A']) print(df['A'].index) print(df.index) print() print(id(df['A'])) print(id(a)) print(id(df['A'].index)) print(id(a.index)) 1 1 2 2 3 3 Name: A, dtype: int64 0 1 1 2 2 3 Name: A, dtype: int64 RangeIndex(start=0, stop=3, step=1) RangeIndex(start=0, stop=3, step=1) 135125984 135165376 135165544 135125816 
+4
source

this is a game of links (pointers), each DataFrame has its own index array, the series in the DataFrame has links to the same index array

when a.index = a.index + 1 is a.index = a.index + 1 , the link in the series has been changed, so a.index is the same as df.A.index, which is different from df.index

Now, if you try to clear the df cache, this will reset the series:

 print(df.A.index) df._clear_item_cache() print(df.A.index) 

by default, series indexes inside a DataFrame are immutable, but copying a series reference allows a workaround to edit the index link

+1
source

Source: https://habr.com/ru/post/1267026/


All Articles