I need to process a large amount of tabular data of mixed types - rows and doubles. I think a standard problem. What is the best data structure in Matlab to work with this?
Cellarray is definitely not the answer. This is extremely inefficient. (tests shown below). A data set (from a set of statistics tools) is terribly inefficient in time and space. This leaves me with a structarray or array structure. I checked all four different options for both time and memory below, and it seems to me that the structure of arrays is the best option for what I tested.
I'm relatively new to Matlab and it is a little disappointing, to be honest. In any case - looking for advice that I'm missing something, or if my tests are accurate / reasonable. I am losing sight of other considerations besides accessing / converting / using memory that may arise as I use this stuff more code. (fyi am using R2010b)
** Test # 1: Access Speed Access to an item.
cellarray:0.002s dataset:36.665s %<<< This is horrible structarray:0.001s struct of array:0.000s
** Test # 2: conversion speed and memory usage. I reset the dataset from this test.
Cellarray(doubles)->matrix:d->m: 0.865s Cellarray(mixed)->structarray:c->sc: 0.268s Cellarray(doubles)->structarray:d->sd: 0.430s Cellarray(mixed)->struct of arrays:c->sac: 0.361s Cellarray(doubles)->struct of arrays:d->sad: 0.887s Name Size Bytes Class Attributes c 100000x10 68000000 cell d 100000x10 68000000 cell m 100000x10 8000000 double sac 1x1 38001240 struct sad 1x1 8001240 struct sc 100000x1 68000640 struct sd 100000x1 68000640 struct
=================== CODE: TEST # 1
%% cellarray c = cell(100000,10); c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5)); c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5); cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))'; te = tic; for iii=1:1000 x = c(1234,5); end te = toc(te); fprintf('cellarray:%0.3fs\n', te); %% dataset ds = dataset( { c, cols{:} } ); te = tic; for iii=1:1000 x = ds(1234,5); end te = toc(te); fprintf('dataset:%0.3fs\n', te); %% structarray s = cell2struct( c, cols, 2 ); te = tic; for iii=1:1000 x = s(1234).Var5; end te = toc(te); fprintf('structarray:%0.3fs\n', te); %% struct of arrays for iii=1:numel(cols) if iii/2==floor(iii/2) % even => string sac.(cols{iii}) = c(:,iii); else sac.(cols{iii}) = cell2mat(c(:,iii)); end end te = tic; for iii=1:1000 x = sac.Var5(1234); end te = toc(te); fprintf('struct of array:%0.3fs\n', te);
=================== CODE: TEST # 2
%% cellarray % c - cellarray containing mixed type c = cell(100000,10); c(:,[1,3,5,7,9]) = num2cell(zeros(100000,5)); c(:,[2,4,6,8,10]) = repmat( {'asdf'}, 100000, 5); cols = strcat('Var', strtrim(cellstr(num2str((1:10)'))))'; % c - cellarray containing doubles only d = num2cell( zeros( 100000, 10 ) ); %% matrix % doubles only te = tic; m = cell2mat(d); te = toc(te); fprintf('Cellarray(doubles)->matrix:d->m: %0.3fs\n', te); %% structarray % mixed te = tic; sc = cell2struct( c, cols, 2 ); te = toc(te); fprintf('Cellarray(mixed)->structarray:c->sc: %0.3fs\n', te); % doubles te = tic; sd = cell2struct( d, cols, 2 ); te = toc(te); fprintf('Cellarray(doubles)->structarray:d->sd: %0.3fs\n', te); %% struct of arrays % mixed te = tic; for iii=1:numel(cols) if iii/2==floor(iii/2) % even => string sac.(cols{iii}) = c(:,iii); else sac.(cols{iii}) = cell2mat(c(:,iii)); end end te = toc(te); fprintf('Cellarray(mixed)->struct of arrays:c->sac: %0.3fs\n', te); % doubles te = tic; for iii=1:numel(cols) sad.(cols{iii}) = cell2mat(d(:,iii)); end te = toc(te); fprintf('Cellarray(doubles)->struct of arrays:d->sad: %0.3fs\n', te); %% clear iii cols te; whos