Hello all,
I don't really know if this question makes sense but I will just ask it.
Is there anyway to reduce a sas dataset of about 9 million observations
to lets say 20 000 and still maintain major variables. There are some duplicates
but I will exclude that. I am only familiar with proc summary/proc freq. I don't know
if there is something better
I don't understand this question. Do you simply want to remove duplicates? Or do you want to remove some specific variabels?
The duplicates have been removed but the dataset is still too large. I want to reduce this
Too large to do what?
There are several ways to reduce the size of your data set. You can:
- Drop variables you do not need
- Compress your data set.
- Delete observations you do not need
.....
I need to check the plausibility of data in a databank according to some given standards using a specific program.
To that, I have to do some preformating/and reduce the dataset in sas before exporting the data to the other program.
The extern program cannot handle large datasets as sas does. That is why I need to reduce the data.
I have two ids in the original data set, one is the patient id and the other is the ailment id (the ailment id is reponsible for the duplicates)
that means a patients can have two or more different kinds of ailment.
I used proc freq/proc summary to remove this, so I have the freq in the new dataset.
My problem is that I also need to get rid of the patientid because the data is still very huge, but I still need to get some variables from original data. and use may left join to merge then to new data again. But if I do that then I don't have any id to use as identifier.
Is there any solution to that?
It depends what you are going to do with "extern" program.
For instance, the first record for a given patient id may be sufficient to do further work as it may have all variables of interest except "ailment". Data Step programming knowledge may be required in addition to PROCs.
Better think of a small SAS data set. Build from there what you want to take out of it.
It is vague now and hence practical suggestion can't be given
ok thanks , I will try using a datastep
It sounds like you need to take a sample of the data for your further analysis. That's something you can use Proc Surveyselect for or if SAS/Stat is not licensed then there are also SAS Datastep approaches for the same.
http://support.sas.com/kb/24/722.html
It depends of course if a sample is suitable for your downstream data assessment as it won't contain all the records with outliers.
Hi Patrick, thanks for the reply, I used data step to extract a certain number of obs. I will first work with that and see how I will progress later
Extract every Nth observation, and determine N in a separate step:
data _null_;
call symputx('factor',int(9000000/20000));
run;
data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;
this really reduces the dataset to about 1900 obs. I only wonder what happend here
Just to give you an easy proof:
data have;
do i = 1 to 9000000;
output;
end;
run;
data _null_;
call symputx('factor',int(9000000/20000));
run;
data want;
set have;
if not mod(_n_,&factor); /* not true = 0 */
run;
Log of the final data step:
34 data want; 35 set have; 36 if not mod(_n_,&factor); /* not true = 0 */ 37 run; NOTE: There were 9000000 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 20000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 1.27 seconds cpu time 0.72 seconds
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.