Matlab data structure for mixed type - what time + space is effective?

I need to process a large amount of tabular data of mixed types - rows and doubles. I think a standard problem. What is the best data structure in Matlab to work with this?

Cellarray is definitely not the answer. This is extremely inefficient. (tests shown below). A data set (from a set of statistics tools) is terribly inefficient in time and space. This leaves me with a structarray or array structure. I checked all four different options for both time and memory below, and it seems to me that the structure of arrays is the best option for what I tested.

I'm relatively new to Matlab and it is a little disappointing, to be honest. In any case - looking for advice that I'm missing something, or if my tests are accurate / reasonable. I am losing sight of other considerations besides accessing / converting / using memory that may arise as I use this stuff more code. (fyi am using R2010b)

** Test # 1: Access Speed ​​Access to an item.

cellarray:0.002s dataset:36.665s %<<< This is horrible structarray:0.001s struct of array:0.000s 

** Test # 2: conversion speed and memory usage. I reset the dataset from this test.

 Cellarray(doubles)->matrix:d->m: 0.865s Cellarray(mixed)->structarray:c->sc: 0.268s Cellarray(doubles)->structarray:d->sd: 0.430s Cellarray(mixed)->struct of arrays:c->sac: 0.361s Cellarray(doubles)->struct of arrays:d->sad: 0.887s Name Size Bytes Class Attributes c 100000x10 68000000 cell d 100000x10 68000000 cell m 100000x10 8000000 double sac 1x1 38001240 struct sad 1x1 8001240 struct sc 100000x1 68000640 struct sd 100000x1 68000640 struct 

=================== CODE: TEST # 1

  %% cellarray c = cell(100000,10); c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5)); c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5); cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))'; te = tic; for iii=1:1000 x = c(1234,5); end te = toc(te); fprintf('cellarray:%0.3fs\n', te); %% dataset ds = dataset( { c, cols{:} } ); te = tic; for iii=1:1000 x = ds(1234,5); end te = toc(te); fprintf('dataset:%0.3fs\n', te); %% structarray s = cell2struct( c, cols, 2 ); te = tic; for iii=1:1000 x = s(1234).Var5; end te = toc(te); fprintf('structarray:%0.3fs\n', te); %% struct of arrays for iii=1:numel(cols) if iii/2==floor(iii/2) % even => string sac.(cols{iii}) = c(:,iii); else sac.(cols{iii}) = cell2mat(c(:,iii)); end end te = tic; for iii=1:1000 x = sac.Var5(1234); end te = toc(te); fprintf('struct of array:%0.3fs\n', te); 

=================== CODE: TEST # 2

 %% cellarray % c - cellarray containing mixed type c = cell(100000,10); c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5)); c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5); cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))'; % c - cellarray containing doubles only d = num2cell( zeros( 100000, 10 ) ); %% matrix % doubles only te = tic; m = cell2mat(d); te = toc(te); fprintf('Cellarray(doubles)->matrix:d->m: %0.3fs\n', te); %% structarray % mixed te = tic; sc = cell2struct( c, cols, 2 ); te = toc(te); fprintf('Cellarray(mixed)->structarray:c->sc: %0.3fs\n', te); % doubles te = tic; sd = cell2struct( d, cols, 2 ); te = toc(te); fprintf('Cellarray(doubles)->structarray:d->sd: %0.3fs\n', te); %% struct of arrays % mixed te = tic; for iii=1:numel(cols) if iii/2==floor(iii/2) % even => string sac.(cols{iii}) = c(:,iii); else sac.(cols{iii}) = cell2mat(c(:,iii)); end end te = toc(te); fprintf('Cellarray(mixed)->struct of arrays:c->sac: %0.3fs\n', te); % doubles te = tic; for iii=1:numel(cols) sad.(cols{iii}) = cell2mat(d(:,iii)); end te = toc(te); fprintf('Cellarray(doubles)->struct of arrays:d->sad: %0.3fs\n', te); %% clear iii cols te; whos 
+6
source share
2 answers

A way to make Matlab code spatio-temporal is to work with large arrays of primitives, that is, arrays of doubles, ints or characters. This gives you a tighter layout in memory and allows you to perform vectorized operations.

For tabular data, each column will be uniform in type, but different columns can be of different types, and usually you will have a lot more rows than columns. And you will often perform operations - comparisons or mathematical calculations on all elements of a column or a masked selection of a column that lends itself to vectorized operations. So save each column as an array of columns, hopefully with primitives. You can insert these columns in the fields of the structure or elements of the cell vector; It doesn’t matter much in terms of performance, and the shape of the structure will be more readable and more like a table. An array of 2 cells or another data structure that breaks all the elements into its own small mxarrays will not execute accepatbly.

That is, if you have a column of 10,000 rows of 10 columns, you want to have a 10-long array of cells or 10-field structure, and each of these fields or elements contains a 100-thousand primitive column vector.

The dataset object object is basically a wrapper around the structure of columnar arrays, as described above, stuck in the object. But objects in Matlab have more overhead than regular structures and cells; you pay for one or more method calls each time you access it. Have a look at Is MATLAB OOP Slow or Am I Something Wrong? (full disclosure: this is one of my answers).

The testing you set up does not indicate how good Matlab code will work, as it performs scalar singleton access. That is, it pays for access to the column, and then to the row element on each pass through the loop. If your Matlab code does this, you're out of luck. To be fast, you need to push the columns outside the loop, that is, raise the expensive operation of accessing the column in an external loop or installation code, and then either perform vectorized operations (e.g. + , == , '<;', ismember , etc. e.) for whole column vectors or a loop over primitive numeric vectors (which JIT can optimize). If you do, then dataset and other object-based table structures can have decent performance.

Lines in Matlab seem to suck, unfortunately. You want to get away from cellstrs. You have a couple of options.

  • If the rows in the column are the same length and you do not have long rows, you can save the row vector as an array of 2-D char . This is a single continuous array in memory and is more efficient in area than an array of cells, and can be faster for comparison operations and so on. This is also one of Matlab's own string representations, so regular string functions will work with it.
  • If the strings have low power (that is, the number of different values ​​is small relative to the total number of elements), you can encode them as "characters", storing them as an array of primitive ints, which are indexes to a list of different string values. The unique and ismember can help implement these encodings. As long as you just run equality tests and don't sort, these encoded string columns will work at numerical speed.
  • I believe that one of the Toolboxes, possibly with dataset , has support for “classifiers” or “categorical” variables, which are basically a ready-made implementation of this low-level encoding.
  • Do not worry about Java Strings; the overhead of crossing the Matlab-to-Java barrier will make it a net loss.
  • Hope someone came up with something else.

If you need to stick to cellstrs, save them as cellstr column vectors inside the structure, as described above; this way you only need to pay the cell access price when you are actually working in a row column.

+3
source

I would say that if you need to manage a large amount of data, then MATLAB is not the best choice to start with. Go for proper db and end up importing the data you need in MATLAB.

However, if you plan to use MATLAB anyway, I would choose cellarrays , that is, if you do not need syntactic links to your data in the form of fieldnames as in structures .

When using cellarrays, keep in mind that each cell overlays 112 bytes of overhead. Therefore, I would create a cell for each column (and not a cell for each scalar double):

 c = cell(1,10); c(1,1:2:10) = num2cell(rand(1e5,5),1); c(1,2:2:10) = {cellstr(repmat('asdf', 100000, 1))}; 

and from memory (no change in time):

 Name Size Bytes Class Attributes c 1x10 38000600 cell 

Also, what you call an array structure usually refers to a scalar structure, not a struct array (or non-scalar structure).

If I remember correctly, structures tend to degrade in read / write performance when you start nested fields (I need to find a specific stream, though).

+1
source

Source: https://habr.com/ru/post/943744/


All Articles