BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
righcoastmike
Quartz | Level 8

Hi Everyone, 

 

I have a dataset that looks roughly like this (except there are about 30 variables and 5.5 million records, with each study ID having between 1 and 60 records each.) The binary variable is the only one pertinent to this question so it's the only one I've included here. Bi_var will never vary between records for an individual studyID - so if person #2 is assigned a 0, they will get a zero for every record in the dataset : 

 

data have;
input studyID Bi_var ;
cards;
1 1
1 1
1 1
2 0
2 0
2 0
2 0
2 0
3 1
3 1
4 0 
5 0
5 0
5 0
5 0
run;

And I have it split into 2 separate datasets using this code: 

 

data want1;
set have;
where Bi_var = 1;
run: 

data want2;
set have;
where Bi_var = 0;
run: 

 

So far so good. Now, I need to know how many unique study ID's are in each dataset. To do this I have tried both of the following methods: 

 

ods select nlevels;

proc freq data=have nlevels;
   tables studyid;
run; 

ods select nlevels;

proc freq data=want1 nlevels;
   tables studyid;
run; 

ods select nlevels;

proc freq data=want2 nlevels;
   tables studyid;
run; 

and 

 

proc sql;
   create table new as 
     select count(distinct(studyid)) as IDcount
            from have;
quit;

proc sql;
   create table new as 
     select count(distinct(studyid)) as IDcount
            from want1;
quit;

proc sql;
   create table new as 
     select count(distinct(studyid)) as IDcount
            from want2;
quit;

Here's the issue, for some reason in both methods,  'want2' (which should be just the records where bi_var=0) has the same number of distinct values as the 'have' dataset. This should be impossible because want1 and want2 are just subsets of the main 'have' dataset. The code seems to have worked fine for the want1 dataset, but I don't know if I can trust it because of this other weirdness.

 

Other things I have checked: 

1. 'Want2' has a different (smaller) number of records than 'have', which makes sense. 

2. just to check if the study id's were the same, i summed up the studyid columns in 'want2' and 'have' - if they were the same then their total should have been identical. it is not. 

 

If anyone can give me an idea of what could be happening that results in 'want2' having the same number of unique values as 'have' it would be much appreciated. In essence I have multiple records per person in my dataset, and I'm just trying to figure out how many unique people (studyid) I have.

 

Any thoughts would be really helpful, I'm at a total loss. 

 

Thanks so much.

 

Mike 

 

1 ACCEPTED SOLUTION

Accepted Solutions
righcoastmike
Quartz | Level 8

Hi @VDD , @novinosrin  and @Tom 

 

First off, thanks so much for your help. I went through and did the tests that you recommended and found out that the problem was actually in the original dataset.

 

I had been told that although there were multiple records for each studyid, the binary variable would stay the same for every record of a given studyid. For instance, if studyid #1 had ten records, then all ten of those records would be a 1, or all ten would be a zero.

 

 

It turns out that wasn't the case. When I went back and looked at the data more closely, It turns out that the same study ID would have either all zero's, or one "1" and then the rest zeros.  This is the reason why my nlevels for the Bi-Var=0 dataset was the same as the original "have" dataset, every studyid had at least one 0 in it. 

 

Once I figured that out (and wrote a strongly worded email to the person who provided me with the data) it was easy to sort out. 

 

I want to give "solve" credit to all of you for pointing me in the right direction but unfortunately I don't think the forum will let me do that. 

 

As always, I am humbled at the amazing support I get on here. Thank you all so much. 

 

Mike 

 

 

View solution in original post

4 REPLIES 4
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13
data want1;
set sashelp.class;
where sex = 'F';
run;

data want2;
set sashelp.class;
where sex = 'M';
run;
proc sql;
	create table want1_count as
	select count(distinct(age) ) as age_count
	from want1;
quit;
proc sql;
	create table want2_count as
	select count(distinct(age) ) as age_count
	from want2;
quit;

 

or 

data want1;
set sashelp.class;
where sex = 'F';
run;

data want2;
set sashelp.class;
where sex = 'M';
run;
proc sql;
	create table have_count as
	select count(age) as total_age_records
	from sashelp.class;
quit;
proc sql;
	create table want1_count as
	select count(distinct(age) ) as age_count,
	count(age) as total_age_records
	from want1;
quit;
proc sql;
	create table want2_count as
	select count(distinct(age) ) as age_count,
		count(age) as total_age_records
	from want2;
quit;



or if we want to check our work

data want1;
set sashelp.class;
where sex = 'F';
run;

data want2;
set sashelp.class;
where sex = 'M';
run;
proc sql;
	create table have_count as
	select count(age) as total_age_records
	from sashelp.class
	order by total_age_records;
quit;
proc sql;
	create table want1_count as
	select count(distinct(age) ) as age_count,
	count(age) as total_age_records
	from want1
	order by total_age_records;
quit;
proc sql;
	create table want2_count as
	select count(distinct(age) ) as age_count,
		count(age) as total_age_records
	from want2
	order by total_age_records;
quit;
data counts(drop=age_count);
	merge have_count(rename=(total_age_records=have_records))
		want1_count(rename=(total_age_records=female_records))
		want2_count(rename=(total_age_records=male_records));
	misscount = have_records - (female_records+male_records);
	
run;



novinosrin
Tourmaline | Level 20

Hi @righcoastmike 

 

Take a step back and do this test

 

proc sql;
create table test as
select *
from have
group by studyid
having count(distinct bi_var)>1;
quit;

You should get zero records. If not , we got take one step back further to test the have.

 

One step at a time.

Tom
Super User Tom
Super User

Having the same counts doesn't seem that strange. Perhaps every STUDYID has both types of records.  Or perhaps the data was designed to have the same number of STUDYID values in each group.

 

Why not just count the original group and see?

proc sql;
  create table full_count as 
    select Bi_var,count(distinct studyid) as IDcount
    from want2
    group by Bi_var
  ;
quit;

Note: The DISTINCT keyword is not a function, so no need for ( ) after it.

righcoastmike
Quartz | Level 8

Hi @VDD , @novinosrin  and @Tom 

 

First off, thanks so much for your help. I went through and did the tests that you recommended and found out that the problem was actually in the original dataset.

 

I had been told that although there were multiple records for each studyid, the binary variable would stay the same for every record of a given studyid. For instance, if studyid #1 had ten records, then all ten of those records would be a 1, or all ten would be a zero.

 

 

It turns out that wasn't the case. When I went back and looked at the data more closely, It turns out that the same study ID would have either all zero's, or one "1" and then the rest zeros.  This is the reason why my nlevels for the Bi-Var=0 dataset was the same as the original "have" dataset, every studyid had at least one 0 in it. 

 

Once I figured that out (and wrote a strongly worded email to the person who provided me with the data) it was easy to sort out. 

 

I want to give "solve" credit to all of you for pointing me in the right direction but unfortunately I don't think the forum will let me do that. 

 

As always, I am humbled at the amazing support I get on here. Thank you all so much. 

 

Mike 

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1346 views
  • 3 likes
  • 4 in conversation