Hi all,
I have a dataset from which I wish to drop (for subsequent analysis) those values of the variables with only one observation, that is with a Freq count of 1. In following table there is a similar example. The values in bold are those I do not want to consider in my analysis. I appreciate any suggestion. Thanks
Trait Var1 Var2 ...Var300
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
Hello,
It need a little bit of refinement (not creating the macro variable allvars manually), but may be a solution:
data have;
infile datalines missover;
input Trait $ Var1 :$7. Var2 :$7. Var300 :$7.;
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
run;
%let allvars=Var1 Var2 Var300;
%macro h;
%let i=1;
%do %while (%scan(&allvars,&i) ne);
%let var=%scan(&allvars,&i);
proc sql;
create table int as
select &var, count(&var) as no
from have
group by &var
having no eq 1;
quit;
data have;
set have;
if _n_=1 then
do;
declare hash i(dataset:'int');
i.definekey("&var");
i.definedone();
end;
rc=i.check();
if rc=0 then call missing(&var);
run;
%let i=%eval(&i+1);
%end;
%mend h;
%h
Hello,
It need a little bit of refinement (not creating the macro variable allvars manually), but may be a solution:
data have;
infile datalines missover;
input Trait $ Var1 :$7. Var2 :$7. Var300 :$7.;
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
run;
%let allvars=Var1 Var2 Var300;
%macro h;
%let i=1;
%do %while (%scan(&allvars,&i) ne);
%let var=%scan(&allvars,&i);
proc sql;
create table int as
select &var, count(&var) as no
from have
group by &var
having no eq 1;
quit;
data have;
set have;
if _n_=1 then
do;
declare hash i(dataset:'int');
i.definekey("&var");
i.definedone();
end;
rc=i.check();
if rc=0 then call missing(&var);
run;
%let i=%eval(&i+1);
%end;
%mend h;
%h
Hi Loko, thank you for your macro, it worked well, one more thing, if I want to remove those variable with only one class, e.g. suppose a variable (look below) with only two classes, this one is going to stay with an unique class (170/170) after running the macro, how I could drop those variables from my data using the same macro?
Var4
170/170
170/170
170/170
178/178
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
170/170
Look at proc sort with nounique option.
You'll have to run multiple iterations for each variable so maintaining a macro loop is probably still required.
Hi @Fersal,
First of all, you should be sure that your analysis will still be valid after excluding values just because they happen to occur only once within a trait.
If your dataset contains only 7-8 observations per trait (as your sample data), isn't it quite likely that some valid values (in some of the 300 variables) will occur only once? On the other hand, if your real dataset has thousands of observations, some "improbable" values might occur 2, 3 or 4 times, or only once, just by chance.
What I'm trying to say is that the rule "exclude values with frequency count 1" is not necessarily sensible.
Maybe you can specify exclusion criteria based on the values themselves and apply these by means of (possibly user-defined) functions or formats.
Also, it could simplify the code if you transposed the data from wide to long format (esp. if the 300 variables have identical types and lengths).
Hi FreelanceReinhard,
Thanks for your comments, actually the original dataset is pretty big, and the number of observation is much higher than number of variables (not like the example I showed). I have had a convergence problem with this data when I try to fit them to a logit model (even after apply the Firth option), so there is not likelihood estimates for these variables, for that reason I want to know what happen if I remove those classes.
Regards
I thought it was 0 cells that caused the issue, not 1's?
If you still want to do it:
data have;
infile datalines truncover;
input Trait (Var1 - Var3) ($);
datalines;
1 200/200 120/120 100/100
1 150/150 140/140 99/99
1 100/100 180/180 135/135
1 150/150 120/120 100/100
1 100/100 180/180 160/160
1 150/150 119/119 160/160
1 200/200 140/140 135/135
0 150/150 180/180
0 149/149 140/140 160/160
0 100/100 120/120 133/133
0 100/100 140/140 100/100
0 200/200 180/180 135/135
0 150/150 140/140 160/160
0 200/200 180/180 100/100
0 100/100 120/120 135/135
;
data list;
set have;
obs = _n_;
array a{*} var:;
do i = 1 to dim(a);
if not missing(a{i}) then do;
value = a{i};
output;
end;
end;
keep Trait obs value;
run;
proc sql;
create table newList as
select Trait, obs, value
from list
group by Trait, value
having count(*) > 1
order by Trait, obs, value;
quit;
proc transpose data=newList out=want(drop=_name_ obs) prefix=var;
var value;
by Trait obs;
run;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.