Indexing and data columns in Pandas / PyTables

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing

I am really confused by this concept of data columns in Pandas HDF5 IO. Plus there is very little information about this that can be found in googling. As I immerse myself in Pandas in a large project that includes HDF5 storage, I would like to clearly understand such concepts.

The docs say:

You can assign (and index) specific columns that you want to use to execute queries (except for indexable columns, which you can always query). For example, you want to perform this general operation on a disk and return only the frame matching this request. You can specify data_columns = True to force all data_columns columns

This is confusing:

  • other than the indexable columns, which you can always query : What are indexed columns? aren't all indexes indexed by columns? What does this term mean?

  • For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. How is this different from a regular Pytable request; with or without any data_columns indexes?

  • What is the fundamental difference between a column without indexing, indexing, and a data_column column?

+6
source share
1 answer

You should just give it a try.

 In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B']) In [23]: store = pd.HDFStore('test.h5',mode='w') In [24]: store.append('df_only_indexables',df) In [25]: store.append('df_with_data_columns',df,data_columns=True) In [26]: store.append('df_no_index',df,data_columns=True,index=False) In [27]: store Out[27]: <class 'pandas.io.pytables.HDFStore'> File path: test.h5 /df_no_index frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B]) /df_only_indexables frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index]) /df_with_data_columns frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B]) In [28]: store.close() 
  • you automatically get the index of the saved frame as the query column. By default, no other columns can be requested.

  • If you specify data_columns=True or data_columns=list_of_columns , they will be stored separately and then can then be requested.

  • If you specify index=False , then the PyTables index PyTables not be created automatically for the query column (for example, index and / or data_columns ).

To see the actual indexes being created ( PyTables indexes), see the output below. colindexes determines which columns have the actual PyTables index. (I cut it a few).

 /df_no_index/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "A": Float64Col(shape=(), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) /df_no_index/table._v_attrs (AttributeSet), 15 attributes: [A_dtype := 'float64', A_kind := ['A'], B_dtype := 'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'A', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer'] /df_only_indexables/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_csi=False} /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes: [CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'values_block_0', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer', values_block_0_dtype := 'float64', values_block_0_kind := ['A', 'B']] /df_with_data_columns/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "A": Float64Col(shape=(), dflt=0.0, pos=1), "B": Float64Col(shape=(), dflt=0.0, pos=2)} byteorder := 'little' chunkshape := (2730,) autoindex := True colindexes := { "A": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "B": Index(6, medium, shuffle, zlib(1)).is_csi=False} /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes: [A_dtype := 'float64', A_kind := ['A'], B_dtype := 'float64', B_kind := ['B'], CLASS := 'TABLE', FIELD_0_FILL := 0, FIELD_0_NAME := 'index', FIELD_1_FILL := 0.0, FIELD_1_NAME := 'A', FIELD_2_FILL := 0.0, FIELD_2_NAME := 'B', NROWS := 5, TITLE := '', VERSION := '2.7', index_kind := 'integer'] 

So, if you want to query a column, make it data_column . If you do not, they will be saved in blocks by dtype (faster / less space).

Usually you want to index the column to be extracted, BUT, if you create and then add several files to one store, you usually turn off index creation and do it at the end (since it is quite expensive to create as you go).

See the cookbook for the questioner.

+5
source

Source: https://habr.com/ru/post/974945/


All Articles