Economize disk space when using proc transpose ? or alternatives

SIgnificatif · Posted 08-20-2019 03:54 AM

Hi all

I have a table that I need to transpose, the format of the table is like this

ID Varname value
5642121 age 50
5642121 city New york
5642121 firstname john
5642121 lastname smith
5642125 age 40
5642125 city New jersey
5642125 firstname Paul
5642125 lastname lennon
5642125 children yes
5642125 car ferrari
............
......

and so on

each ID have different number of variables , one subject could have 10, the other one 30 etc,

I use sorting and transposing successfully.

But after the creation of the table, I found that the space was running low: the initial table was 2GB when the 'transposed' table was 20-25 GB. ( because there are many 'empty variable values' ....

I format the variables after the transposition to preserve the lenght

I cut the big table into small ones for better analysis.

are there other ways ( use queries, / SQL in this case ? ) or the transposition is the only way in this situation,
the only purpose is to economize the disk space ( even if I can just delete the table after creation and treatment..)

interesting, when I export data into excel, it takes only 5 MB.

Thank you for answer.

andreas_lds · Posted 08-20-2019 04:44 AM

@SIgnificatif wrote:

Hi all

[...]

each ID have different number of variables , one subject could have 10, the other one 30 etc,

I use sorting and transposing successfully.

But after the creation of the table, I found that the space was running low: the initial table was 2GB when the 'transposed' table was 20-25 GB. ( because there are many 'empty variable values' ....

I format the variables after the transposition to preserve the lenght

I cut the big table into small ones for better analysis.

are there other ways ( use queries, / SQL in this case ? ) or the transposition is the only way in this situation,
the only purpose is to economize the disk space ( even if I can just delete the table after creation and treatment..)

interesting, when I export data into excel, it takes only 5 MB.

Thank you for answer.

Please post a small excerpt of the dataset before and after transposing using datalines-statement and with the types and lengths you have your datasets.

Which analysis are performed, that require transposing the data at all? If after transposing many null-values exist, transposing should be questioned.

Kurt_Bremser · Posted 08-20-2019 08:39 AM

@andreas_lds the source dataset is a name-value list, and the transposition probably looks like this:

data have;
input ID $ Varname :$32. value & :$32.;
datalines;
5642121 age 50
5642121 city New york
5642121 firstname john
5642121 lastname smith
5642125 age 40
5642125 city New jersey
5642125 firstname Paul
5642125 lastname lennon
5642125 children yes
5642125 car ferrari
;
run;

proc transpose data=have out=want (drop=_name_ compress=yes);
by id;
var value;
id varname;
run;

If the number of values in varname is limited, it would be better to write a data step where variable types can be set:

data want (compress=yes);
set have;
by id;
format firstname lastname city $32. age 3. car $32. children 1.;
retain firstname lastname city age car children;
if first.id
then call missing (firstname,lastname,city,age,car,children);
select (varname);
  when ('firstname') firstname = value;
  when ('lastname') lastname = value;
  when ('city') city = value;
  when ('age') age = input(value,32.);
  when ('car') car = value;
  when ('children') children = ifn(value='yes',1,0);
end;
if last.id;
run;

The compress option should take care of space wasted by empty strings. The "manual" method also lets one define correct lengths for short string values (eg if value is defined as $254, but contains only Y or N for a certain varname, you'd waste 253 bytes with the automatic transpose.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 08-20-2019 06:20 AM

Try adding this statement at the beginning of your program:

options compress=yes;

Tom · Posted 08-20-2019 09:12 AM

Try using the COMPRESS option when creating the dataset. Without that option each record takes the full amount of space defined for each variable for every observation. With compression you can frequently reduce the size by a lot.

An XLSX file is essentially a ZIP file and so has compression already. Plus other features of how it stores the data can also reduce the size as empty cells are not stored and repeated values are stored just once and referenced instead.

Note also that changing the FORMAT attached to a variable does nothing to change the LENGTH used to store the variable. Make sure that your process for defining the new variables is defining them with the appropriate length. Also if you are attaching $ formats to character variables make sure the widths match the lengths of the variables otherwise it can cause a lot of confusion and possible wrong analyses.

RichardDeVen · Posted 08-21-2019 02:56 PM

The original table design is called an "entity attribute value table" (EAV data). Very suitable for cases of needing additional attributes, a little more difficult when you want the transpose. Presuming a TRANSPOSE that is

    BY ID;
    ID VARNAME;
    VAR VALUE;

the caveats are that TRANSPOSE does not minimize the width of the columns created for each attribute, nor does it evaluate values for conversion to useful types and formats. So worst case, VALUE is something like $4096 and you have hundreds of attributes over all the IDs... the result will be hundreds of columns that are $4096.

Kurt's reply is the best for the tightest pivot transformation of your EAV data.

Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Re: Economize disk space when using proc transpose ? or alternatives

Ready to join fellow brilliant minds for the SAS Hackathon?