BookmarkSubscribeRSS Feed
Anita_n
Pyrite | Level 9

Hello all,

I don't really know if this question makes sense but I will just ask it.

Is there anyway to reduce a sas dataset of about 9 million observations

to lets say 20 000 and still maintain major variables. There are some duplicates

but I will exclude that. I am only familiar with proc summary/proc freq. I don't know 

if there is something better

12 REPLIES 12
PeterClemmensen
Tourmaline | Level 20

I don't understand this question. Do you simply want to remove duplicates? Or do you want to remove some specific variabels?

Anita_n
Pyrite | Level 9

The duplicates have been removed but the dataset is still too large. I want to reduce this

PeterClemmensen
Tourmaline | Level 20

Too large to do what?

 

There are several ways to reduce the size of your data set. You can:

 

- Drop variables you do not need

- Compress your data set.

- Delete observations you do not need

 

.....

Anita_n
Pyrite | Level 9

I need to check the plausibility of data in a databank according to some given standards using a specific program.

To that, I have to do some preformating/and reduce the dataset in sas before exporting the data to the other program.

The extern program cannot handle large datasets as sas does. That is why I need to reduce the data.

 

I have two ids in the original data set, one is the patient id and the other is the ailment id (the ailment id is reponsible for the duplicates)

that means a patients can have two or more different kinds of ailment.

 

I used proc freq/proc summary to remove this, so I have the freq in the new dataset.

 

My problem is that I also need to get rid of the patientid because the data is still very huge, but I still need to get some variables from original data. and use may left join to merge then to new data again. But if I do that then I don't have any id to use as identifier.

 

Is there any solution to that?

KachiM
Rhodochrosite | Level 12

It depends what you are going to do with "extern" program.

 

For instance, the first record for a given patient id may be sufficient to do further work as it may have all variables of interest except "ailment". Data Step programming knowledge may be required in addition to PROCs.

 

Better think of a small SAS data set. Build from there what you want to take out of it. 

 

It is vague now and hence practical suggestion can't be given

Anita_n
Pyrite | Level 9

ok thanks , I will try using a datastep

Patrick
Opal | Level 21

@Anita_n 

It sounds like you need to take a sample of the data for your further analysis. That's something you can use Proc Surveyselect for or if SAS/Stat is not licensed then there are also SAS Datastep approaches for the same.

http://support.sas.com/kb/24/722.html 

 

It depends of course if a sample is suitable for your downstream data assessment as it won't contain all the records with outliers.

Anita_n
Pyrite | Level 9

Hi Patrick, thanks for the reply, I used data step to extract a certain number of obs. I will first work with that and see how I will progress later

Kurt_Bremser
Super User

Extract every Nth observation, and determine N in a separate step:

data _null_;
call symputx('factor',int(9000000/20000));
run;

data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;
Anita_n
Pyrite | Level 9

@Kurt_Bremser 

this really reduces the dataset to about 1900 obs. I only wonder what happend here

Kurt_Bremser
Super User

Just to give you an easy proof:

data have;
do i = 1 to 9000000;
  output;
end;
run;

data _null_;
call symputx('factor',int(9000000/20000));
run;

data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;

Log of the final data step:

34         data want;
35         set have;
36         if not mod(_n_,&factor); /* not true = 0 */
37         run;

NOTE: There were 9000000 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 20000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           1.27 seconds
      cpu time            0.72 seconds

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1198 views
  • 0 likes
  • 5 in conversation