Split a large SAS dataset into smaller datasets

I need help splitting a large SAS dataset into smaller datasets.

Each month I will have a data set containing several million records. This number will vary from month to month. I need to split this data set into several smaller data sets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset, then I need the final result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.

From what I looked at it, this will require the use of macros. Unfortunately, I'm pretty new to SAS (unfamiliar with using macros), and I don't have much time for this. Any help would be greatly appreciated.

+5
source share
6 answers

Based on Joe's answer, maybe you could try something like this:

%MACRO SPLIT(DATASET);

%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);

DATA 
  %DO I=1 %TO &NB_DATASETS.;
    WANT&I. 
  %END;;

  SET WANT;

  %DO I=1 %TO &NB_DATASETS.;

    %IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;

  %END;
RUN;
%MEND SPLIT;    
+3
source

You can do this without macros at all, if you don't mind asking for a dataset that might not exist and have a reasonable binding to things.

data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;

Macros will make it more efficient for programming and cleaning to read, but would not change how it actually works in reality.

+1
source

, CALL EXECUTE(). SAS- " " .

data _null_;
    if 0 then set have nobs=n;
    do i=1 to ceil(n/250000);
        call execute (cats("data want",i)||";");
        call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
        call execute ("run;");
    end;
run;
+1

Google - SAS User Group International (SUGI) - .

: http://www2.sas.com/proceedings/sugi27/p083-27.pdf

:

%macro split(ndsn=2); 
data %do i = 1 %to &ndsn.; dsn&i. %end; ; 
 retain x; 
 set orig nobs=nobs; 
 if _n_ eq 1 
 then do; 
 if mod(nobs,&ndsn.) eq 0 
 then x=int(nobs/&ndsn.); 
 else x=int(nobs/&ndsn.)+1; 
 end; 
 if _n_ le x then output dsn1; 
 %do i = 2 %to &ndsn.; 
 else if _n_ le (&i.*x) 
 then output dsn&i.; 
 %end; 
 run; 
%mend split; 

%split(ndsn=10);

, , 10 "% split (ndsn = 10)"; . 4 "set orig nobs = nobs", orig .

, !

0

, , -. , , :

data in_data;
  do recid = 1 to 1.000001e7;
    datavar = 1;
    output;
  end;
run;


data _null_;
  if 0 then set in_data;
  declare hash h_out();
  h_out.defineKey('_n_');
  h_out.defineData('recid','datavar');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

- 250 . . - , , " 250 . ", , - , 250 . .

, , :

data _null_;
  if 0 then set in_data;
  declare hash h_out(dataset:'in_data(obs=0)');
  h_out.defineKey('recid');
  h_out.defineData(all:'y');
  h_out.defineDone();

  do filenum = 1 by 1 until (eof);
    do _n_ = 1 to 250000 until (eof);
      set in_data end=eof;
      h_out.add();
    end;
    h_out.output(dataset:cats('file_',filenum));
    h_out.clear();
  end;
  stop;
run;

Since we can no longer use the _n_hash identifier due to the use of the option datasetin the constructor (necessary for the function all:'y'), we must have a record identifier. Hope there is such a variable, or can be added with a view.

0
source

Here is the basic approach. This requires manual spacing but is easy to understand.

* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;


data output2;
set df;
if 5 <= _N_ < 10 then output;
run;


data output3;
set df;
if 10 <= _N_ < 15 then output;
run;


data output4;
set df;
if 15 <= _N_ < 22 then output;
run;
0
source

Source: https://habr.com/ru/post/1524349/


All Articles