How to create an edge frame list from an adjacency matrix in Python?

I have a pandas framework (think about how in the form of a weighted adjacency matrix of nodes in a network) of the form df ,

ABCD A 0 0.5 0.5 0 B 1 0 0 0 C 0.8 0 0 0.2 D 0 0 1 0 

I want to get a dataframe that instead represents a list of edges. for the above example, I will need something like the form edge_list_df ,

  Source Target Weight 0 AB 0.5 1 AC 0.5 2 AD 0 3 BA 1 4 BC 0 5 BD 0 6 CA 0.8 7 CB 0 8 CD 0.2 9 DA 0 10 DB 0 11 DC 1 

What is the most efficient way to create this?

+5
source share
4 answers

Marking the diagonal as nan , then we stack

 df.values[[np.arange(len(df))]*2] = np.nan df Out[172]: ABCD A NaN 0.5 0.5 0.0 B 1.0 NaN 0.0 0.0 C 0.8 0.0 NaN 0.2 D 0.0 0.0 1.0 NaN df.stack().reset_index() Out[173]: level_0 level_1 0 0 AB 0.5 1 AC 0.5 2 AD 0.0 3 BA 1.0 4 BC 0.0 5 BD 0.0 6 CA 0.8 7 CB 0.0 8 CD 0.2 9 DA 0.0 10 DB 0.0 11 DC 1.0 
+4
source

Using rename_axis + reset_index + melt :

 df.rename_axis('Source')\ .reset_index()\ .melt('Source', value_name='Weight', var_name='Target')\ .query('Source != Target')\ .reset_index(drop=True) Source Target Weight 0 BA 1.0 1 CA 0.8 2 DA 0.0 3 AB 0.5 4 CB 0.0 5 DB 0.0 6 AC 0.5 7 BC 0.0 8 DC 1.0 9 AD 0.0 10 BD 0.0 11 CD 0.2 

melt was introduced as a function of the DataFrame object as 0.20 , but for older versions pd.melt needed pd.melt :

 v = df.rename_axis('Source').reset_index() df = pd.melt( v, id_vars='Source', value_name='Weight', var_name='Target' ).query('Source != Target')\ .reset_index(drop=True) 

Delay

 x = np.random.randn(1000, 1000) x[[np.arange(len(x))] * 2] = 0 df = pd.DataFrame(x) 

 %%timeit df.index.name = 'Source' df.reset_index()\ .melt('Source', value_name='Weight', var_name='Target')\ .query('Source != Target')\ .reset_index(drop=True) 1 loop, best of 3: 139 ms per loop 

 # Wen solution %%timeit df.values[[np.arange(len(df))]*2] = np.nan df.stack().reset_index() 10 loops, best of 3: 45 ms per loop 
+6
source

Two approaches using NumPy tools -

Approach No. 1

 def edgelist(df): a = df.values c = df.columns n = len(c) c_ar = np.array(c) out = np.empty((n, n, 2), dtype=c_ar.dtype) out[...,0] = c_ar[:,None] out[...,1] = c_ar mask = ~np.eye(n,dtype=bool) df_out = pd.DataFrame(out[mask], columns=[['Source','Target']]) df_out['Weight'] = a[mask] return df_out 

Run Example -

 In [155]: df Out[155]: ABCD A 0.0 0.5 0.5 0.0 B 1.0 0.0 0.0 0.0 C 0.8 0.0 0.0 0.2 D 0.0 0.0 1.0 0.0 In [156]: edgelist(df) Out[156]: Source Target Weight 0 AB 0.5 1 AC 0.5 2 AD 0.0 3 BA 1.0 4 BC 0.0 5 BD 0.0 6 CA 0.8 7 CB 0.0 8 CD 0.2 9 DA 0.0 10 DB 0.0 11 DC 1.0 

Approach # 2

 # https://stackoverflow.com/a/46736275/ @Divakar def skip_diag_strided(A): m = A.shape[0] strided = np.lib.stride_tricks.as_strided s0,s1 = A.strides return strided(A.ravel()[1:], shape=(m-1,m), strides=(s0+s1,s1)) # https://stackoverflow.com/a/48234170/ @Divakar def combinations_without_repeat(a): n = len(a) out = np.empty((n,n-1,2),dtype=a.dtype) out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1)) out.shape = (n-1,n,2) out[:,:,1] = onecold(a) out.shape = (-1,2) return out cols = df.columns.values.astype('S1') df_out = pd.DataFrame(combinations_without_repeat(cols)) df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel() 

Runtime test

Using @cα΄ΚŸα΄…sα΄˜α΄‡α΄‡α΄… timing setup :

 In [704]: x = np.random.randn(1000, 1000) ...: x[[np.arange(len(x))] * 2] = 0 ...: ...: df = pd.DataFrame(x) # @cα΄ΚŸα΄…sα΄˜α΄‡α΄‡α΄… soln In [705]: %%timeit ...: df.index.name = 'Source' ...: df.reset_index()\ ...: .melt('Source', value_name='Weight', var_name='Target')\ ...: .query('Source != Target')\ ...: .reset_index(drop=True) 10 loops, best of 3: 67.4 ms per loop # @Wen soln In [706]: %%timeit ...: df.values[[np.arange(len(df))]*2] = np.nan ...: df.stack().reset_index() 100 loops, best of 3: 19.6 ms per loop # Proposed in this post - Approach #1 In [707]: %timeit edgelist(df) 10 loops, best of 3: 24.8 ms per loop # Proposed in this post - Approach #2 In [708]: %%timeit ...: cols = df.columns.values.astype('S1') ...: df_out = pd.DataFrame(combinations_without_repeat(cols)) ...: df_out['Weight'] = skip_diag_strided(df.values.copy()).ravel() 100 loops, best of 3: 17.4 ms per loop 
+4
source

Using NetworkX 2.x API :

 import networkx as nx In [246]: G = nx.from_pandas_adjacency(df, create_using=nx.MultiDiGraph()) In [247]: G.edges(data=True) Out[247]: OutMultiEdgeDataView([('A', 'B', {'weight': 0.5}), ('A', 'C', {'weight': 0.5}), ('B', 'A', {'weight': 1.0}), ('C', 'A', {'weight': 0.8}), ('C', 'D', { 'weight': 0.2}), ('D', 'C', {'weight': 1.0})]) In [248]: nx.to_pandas_edgelist(G) Out[248]: source target weight 0 AB 0.5 1 AC 0.5 2 BA 1.0 3 CA 0.8 4 CD 0.2 5 DC 1.0 
0
source

Source: https://habr.com/ru/post/1274648/


All Articles