Re: Reduce a dataset

Anita_n · Posted 07-02-2019 05:19 AM

Hello all,

I don't really know if this question makes sense but I will just ask it.

Is there anyway to reduce a sas dataset of about 9 million observations

to lets say 20 000 and still maintain major variables. There are some duplicates

but I will exclude that. I am only familiar with proc summary/proc freq. I don't know

if there is something better

PeterClemmensen · Posted 07-02-2019 05:21 AM

I don't understand this question. Do you simply want to remove duplicates? Or do you want to remove some specific variabels?

The DATA to DATA Step Macro
Blog: SASnrd

Anita_n · Posted 07-02-2019 05:50 AM

The duplicates have been removed but the dataset is still too large. I want to reduce this

PeterClemmensen · Posted 07-02-2019 06:06 AM

Too large to do what?

There are several ways to reduce the size of your data set. You can:

- Drop variables you do not need

- Compress your data set.

- Delete observations you do not need

.....

The DATA to DATA Step Macro
Blog: SASnrd

Anita_n · Posted 07-02-2019 06:26 AM

I need to check the plausibility of data in a databank according to some given standards using a specific program.

To that, I have to do some preformating/and reduce the dataset in sas before exporting the data to the other program.

The extern program cannot handle large datasets as sas does. That is why I need to reduce the data.

I have two ids in the original data set, one is the patient id and the other is the ailment id (the ailment id is reponsible for the duplicates)

that means a patients can have two or more different kinds of ailment.

I used proc freq/proc summary to remove this, so I have the freq in the new dataset.

My problem is that I also need to get rid of the patientid because the data is still very huge, but I still need to get some variables from original data. and use may left join to merge then to new data again. But if I do that then I don't have any id to use as identifier.

Is there any solution to that?

KachiM · Posted 07-02-2019 06:49 AM

It depends what you are going to do with "extern" program.

For instance, the first record for a given patient id may be sufficient to do further work as it may have all variables of interest except "ailment". Data Step programming knowledge may be required in addition to PROCs.

Better think of a small SAS data set. Build from there what you want to take out of it.

It is vague now and hence practical suggestion can't be given

Anita_n · Posted 07-02-2019 06:56 AM

ok thanks , I will try using a datastep

Patrick · Posted 07-02-2019 07:59 AM

@Anita_n

It sounds like you need to take a sample of the data for your further analysis. That's something you can use Proc Surveyselect for or if SAS/Stat is not licensed then there are also SAS Datastep approaches for the same.

http://support.sas.com/kb/24/722.html

It depends of course if a sample is suitable for your downstream data assessment as it won't contain all the records with outliers.

Anita_n · Posted 07-02-2019 08:51 AM

Hi Patrick, thanks for the reply, I used data step to extract a certain number of obs. I will first work with that and see how I will progress later

Kurt_Bremser · Posted 07-02-2019 08:58 AM

Extract every Nth observation, and determine N in a separate step:

data _null_;
call symputx('factor',int(9000000/20000));
run;

data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Anita_n · Posted 07-02-2019 09:11 AM

@Kurt_Bremser

this really reduces the dataset to about 1900 obs. I only wonder what happend here

Kurt_Bremser · Posted 07-02-2019 01:59 PM

What does the log say?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 07-03-2019 01:35 AM

Just to give you an easy proof:

data have;
do i = 1 to 9000000;
  output;
end;
run;

data _null_;
call symputx('factor',int(9000000/20000));
run;

data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;

Log of the final data step:

34         data want;
35         set have;
36         if not mod(_n_,&factor); /* not true = 0 */
37         run;

NOTE: There were 9000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 20000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           1.27 seconds
      cpu time            0.72 seconds

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away