BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mscarboncopy
Pyrite | Level 9

I need to be able to create a smaller data file that would only have the variables when they were answered  yes (1) and then be able to also select only the date variables that would be linked to them. 

 

I have a file with 200 observations and 300 variables and I only need the following variables:

X (25 variables X) 

Y (25 variables Y)

these are the ones I should only have in my data set - but only  if they were answered YES - in this data file no = 0.

 

and Date variables (25 variables) that would match the yes, let's call them dtevar

 

For example X1 and Y1 both have a corresponding (and the same) date variable  dtevar1 , X2 and Y2 both have dtevar2 etc.

but if I don't select only the variables that are yes (1) I end up with a lot of dates I do not need. 

 

I thought of doing an array but nothing I am trying to do works. I would really appreciate if anyone could help me figure out how to work with this.

 

Thank you in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

First, let's clear up the obvious piece.

 

The array statements are using  double-dash to indicate the last of variable names:

 

array outage {25} outVISIT1_AGE -- outVISIT25_AGE ;
array outcode {25} outantibi1--outantibi25;

 

The double-dash does not respect the numeric suffix at the end of the variable names.  You could just as easily code a list that reads cat--dog.  That would mean all variable names beginning with CAT and ending with DOG, in the order they were defined when creating the original data.  

 

One of the fixes is easy:

 

array outcode {25} outantibi1-outantibi25;

 

The single dash tells SAS that the list of variable names has a numeric suffix, with the same prefix.  The other ARRAY statement is more problematic.  There isn't a numeric suffix.  To fix it, you will need to spell out the full list of 25 names:

 

array outage {25} outVISIT1_AGE outVISIT2_AGE ... outVISIT24_AGE outVISIT25_AGE ;

There just are no shortcuts available when the variable names are in this form.

 

As always when debugging, we'll have to see if clearing this up resolves all the issues, or reveals additional issues.

View solution in original post

17 REPLIES 17
Astounding
PROC Star

I'm not sure this plan will work.  What should happen to a variable that is "yes" one one observation, but "no" on another observation?  

 

Whatever decision you make about a variable ... keep it or drop it ... has to apply to all the observations.

ballardw
Super User

@Mscarboncopy wrote:

I need to be able to create a smaller data file that would only have the variables when they were answered  yes (1) and then be able to also select only the date variables that would be linked to them. 

 

I have a file with 200 observations and 300 variables and I only need the following variables:

X (25 variables X) 

Y (25 variables Y)

these are the ones I should only have in my data set - but only  if they were answered YES - in this data file no = 0.

 

and Date variables (25 variables) that would match the yes, let's call them dtevar

 

For example X1 and Y1 both have a corresponding (and the same) date variable  dtevar1 , X2 and Y2 both have dtevar2 etc.

but if I don't select only the variables that are yes (1) I end up with a lot of dates I do not need. 

 

I thought of doing an array but nothing I am trying to do works. I would really appreciate if anyone could help me figure out how to work with this.

 

Thank you in advance.


Your description is tad confusing and it isn't clear if you want 1) all of the x variables to be "yes" 2) at least one of the x variables to be 'yes'.

 

If you can provide an example of your data, perhaps with 5 instead of 25 "variables x" and "variables y", including the actual names of variables with maybe 10 or fifteen records and what the result should be you my get a better answer. Your example data should include some where the observation is not included in the output.

 

The DROP or KEEP statements can reduce the variables but every record will have all of the variables kept.

Data new;

    set old;

    keep <list names of variables to keep>;

run;

for example.

Mscarboncopy
Pyrite | Level 9

Thank you. Yes that is my issue... I use the keep statement often, to create subsets, but this time I need to be able to select only the variables that have at least 1 yes for 1 ID. This is because of the 25 variables I mentioned we don't get a lot in the beginning so say from X1 to X10 I might not get any Yes.

 

Example pretend my data file has only 6 obs (A-F) and 5 Vars (X1-X5)

OBS                        X1  X2 X3 X4 X5

A                             1    0    0    0    0

B                             0    1    0    1    1
C                             0    0    0    0    0
D                             1   0     0    1    1
E                              1   1     0    1     1
F                              0   0     0     0    0

 

so the code would somehow get rid of X3 because X3 did not get any yes, It would keep X1 X2 X4 and X5 and then also get rid of dtevar3.

Or is there a way to create a table maybe ?

 

FREQ DATA = datasetname;

TABLES X1/dtevar1;

WHERE X1 =1;

 

Would that work ? I have not worked with Where statements much.

If this works, how could I create a statement that would allow me to run all of the X variables at once and not have a FREQ statement for each variable ?

 

 

 

 

 

 

Astounding
PROC Star

Let's see if we can agree on a simpler approach.  Don't count, instead get the mean of each X variable:

 

proc summary data=have;

var x1-x25;

output out=means mean=;

run;

 

That gives you a data set named MEANS, where X1 is the mean value of the original X1 variable, etc.  From that point, we're looking to remove variables (and the matching dtevar) when the mean is 0.

 

Does that sound right?

Mscarboncopy
Pyrite | Level 9

Hi Astounding,

 

Yes this works ! Thank you. 

Now what if I needed to select variables that were coded up to a certain age only ? I have a  data file that has variables coded for ages 0 - 15 years.

I have DOB and I want the variables only before the subject turned 1 year old. How can I create a subset that only has the variables coded from 0-1 years ? Can I use the DOB for this in a data command ? I have the dates that match the variables coded as in the example above and I would only need the variables up to 1 year. However each subject has different DOB so I don't know which dates I would need to keep to make sure the data for 0-1 is used for all. 

 

Similarly, how would I select only the most recent (independent of age)  coded if the final dates are all different for each observation ? 

 

Using my variable X from the prior example: say X was a medical diagnosis such as "cold" so I now can select the times when "cold" was coded (X=1) using your suggestion in the entire data set. But now I need to know from 0-1 year only and then also know when it was the last time (in my data set) the subject had a cold.

 

The last dates are all different just as DOB is different.

Thank you again.

M

Astounding
PROC Star

Again, this new result is problematic because you might not be able to eliminate variables.  If a single patient has all 25 values filled in before age 1, you need to keep all the variables for every patient.  SAS does not let you pick and choose which variables to keep on each observation.

 

You might have been better advised to keep a narrow data set, with many observations per patient:

 

patient, dob, age, diagnosis, date

 

That would make your life a lot easier now that you are on to the analysis phase.  Because of the wide structure to your data, you will need to learn about arrays:

 

data want;

set have;

array diags {25} x1-x25;

array dates {25} datevar1-datevar25;

do k=1 to 25;

   if /* x{k} indicates a cold diagnosis */

   and /* dates{k} indicates < 1 year old */

   then earliest = max(earliest, dates{k});

end;

run;

 

That's the idea of it, with the commented out conditions needing to be replaced with the actual logic.

Mscarboncopy
Pyrite | Level 9

Hi Astounding,

 

Sorry it took me awhile to get back to this.

 

I ended up calculating my ages and I am trying an array for antibiotic use for age up to 12 months (1 year). Trying to find out which ids got antibiotic coded before age 1 year.

 

I am running into errors and can't figure out why. 

 

Data want;

set have;

 


array inage {25} VISIT1_AGE VISIT2_AGE VISIT3_AGE VISIT4_AGE VISIT5_AGE VISIT6_AGE VISIT7_AGE VISIT8_AGE VISIT9_AGE VISIT10_AGE VISIT11_AGE
VISIT12_AGE VISIT13_AGE VISIT14_AGE VISIT15_AGE VISIT16_AGE VISIT17_AGE VISIT18_AGE VISIT19_AGE VISIT20_AGE VISIT21_AGE VISIT22_AGE VISIT23_AGE VISIT24_AGE VISIT25;
array incode {25} antibi1 antibi2 antibi3 antibi4 antibi5 antibi6 antibi7 antibi8 antibi9 antibi10 antibi11 antibi12 antibi13 antibi14 antibi15 antibi16
antibi17 antibi18 antibi19 antibi20 antibi21 antibi22 antibi23 antibi24 antibi25;
array outage {25} outVISIT1_AGE -- outVISIT25_AGE ;
array outcode {25} outantibi1--outantibi25;
j = 1;

do i = 1 to 25;
if (. < inage{i} le age1) and incode{i} in('antibi1' 'antibi2' 'antibi3' 'antibi4' 'antibi5' 'antibi6' 'antibi7'
'antibi8' 'antibi9' 'antibi10' 'antibi11' 'antibi12' 'antibi13' 'antibi14' 'antibi15' 'antibi16' 'antibi17'
'antibi18' 'antibi19' 'antibi20' 'antibi21' 'antibi22' 'antibi23' 'antibi24' 'antibi25') then do; /* if age is not missing and in time frame and you have one or more of the asthma codes */
outage{j} = inage{i}; /* move image to outage */
outcode{j} = 1; /* found antibi code and in time range*/
j = j + 1; /* when you find one then increment the j index */
end;

 

These are the errors

 


ERROR: Variable outVISIT1_AGE cannot be found on the list of previously defined variables.
ERROR: Too few variables defined for the dimension(s) specified for the array outage.
197 array inage {25} VISIT1_AGE VISIT2_AGE VISIT3_AGE VISIT4_AGE VISIT5_AGE VISIT6_AGE VISIT7_AGE
-----
124
197! VISIT8_AGE VISIT9_AGE VISIT10_AGE VISIT11_AGE
ERROR: Variable outantibi1 cannot be found on the list of previously defined variables.
ERROR: Too few variables defined for the dimension(s) specified for the array outcode.
ERROR 124-185: The variable inage has already been defined.

198 VISIT12_AGE VISIT13_AGE VISIT14_AGE VISIT15_AGE VISIT16_AGE VISIT17_AGE VISIT18_AGE
198! VISIT19_AGE VISIT20_AGE VISIT21_AGE VISIT22_AGE VISIT23_AGE VISIT24_AGE VISIT25;
199 array incode {25} antibi1 antibi2 antibi3 antibi4 antibi5 antibi6 antibi7 antibi8 antibi9
------
124
199! antibi10 antibi11 antibi12 antibi13 antibi14 antibi15 antibi16
ERROR 124-185: The variable incode has already been defined.

200 antibi17 antibi18 antibi19 antibi20 antibi21 antibi22 antibi23 antibi24 antibi25;
201 array outage {25} outVISIT1_AGE -- outVISIT25_AGE ;
------
124
ERROR 124-185: The variable outage has already been defined.

202 array outcode {25} outantibi1--outantibi25;
-------
124
ERROR 124-185: The variable outcode has already been defined.

203 j = 1;
204
205 do i = 1 to 25;
206 if (. < inage{i} le age1) and incode{i} in('antibi1' 'antibi2' 'antibi3' 'antibi4'
206! 'antibi5' 'antibi6' 'antibi7'
207 'antibi8' 'antibi9' 'antibi10' 'antibi11' 'antibi12' 'antibi13' 'antibi14' 'antibi15'
207! 'antibi16' 'antibi17'
208 'antibi18' 'antibi19' 'antibi20' 'antibi21' 'antibi22' 'antibi23' 'antibi24' 'antibi25') then
208! do; /* if age is not missing and in time frame and you have one or more of antibi coded
208! */
209 outage{j} = inage{i}; /* move image to outage */
210 outcode{j} = 1; /* found antibi code and in time range*/
211 j = j + 1; /* when you find one then increment the j index */
212 end;

 

I have been trying to figure it out unsuccessfully. I appreciate any input to point me in the right direction.

 

Thank you again

 

Astounding
PROC Star

First, let's clear up the obvious piece.

 

The array statements are using  double-dash to indicate the last of variable names:

 

array outage {25} outVISIT1_AGE -- outVISIT25_AGE ;
array outcode {25} outantibi1--outantibi25;

 

The double-dash does not respect the numeric suffix at the end of the variable names.  You could just as easily code a list that reads cat--dog.  That would mean all variable names beginning with CAT and ending with DOG, in the order they were defined when creating the original data.  

 

One of the fixes is easy:

 

array outcode {25} outantibi1-outantibi25;

 

The single dash tells SAS that the list of variable names has a numeric suffix, with the same prefix.  The other ARRAY statement is more problematic.  There isn't a numeric suffix.  To fix it, you will need to spell out the full list of 25 names:

 

array outage {25} outVISIT1_AGE outVISIT2_AGE ... outVISIT24_AGE outVISIT25_AGE ;

There just are no shortcuts available when the variable names are in this form.

 

As always when debugging, we'll have to see if clearing this up resolves all the issues, or reveals additional issues.

Mscarboncopy
Pyrite | Level 9

ok so when I do that this is what I get

array inage {25} VISIT1_AGE VISIT2_AGE VISIT3_AGE VISIT4_AGE VISIT5_AGE VISIT6_AGE VISIT7_AGE
-----
180
233! VISIT8_AGE VISIT9_AGE VISIT10_AGE VISIT11_AGE
ERROR 180-322: Statement is not valid or it is used out of proper order.

234 VISIT12_AGE VISIT13_AGE VISIT14_AGE VISIT15_AGE VISIT16_AGE VISIT17_AGE VISIT18_AGE
234! VISIT19_AGE VISIT20_AGE VISIT21_AGE VISIT22_AGE VISIT23_AGE VISIT24_AGE VISIT25;

235 array incode {25} antibi1 antibi2 antibi3 antibi4 antibi5 antibi6 antibi7 antibi8 antibi9
-----
180
235! antibi10 antibi11 antibi12 antibi13 antibi14 antibi15 antibi16
ERROR 180-322: Statement is not valid or it is used out of proper order.

236 antibi17 antibi18 antibi19 antibi20 antibi21 antibi22 antibi23 antibi24 antibi25;

237 array outage {25} VISIT1_AGE VISIT2_AGE VISIT3_AGE VISIT4_AGE VISIT5_AGE VISIT6_AGE
-----
180
237! VISIT7_AGE VISIT8_AGE VISIT9_AGE VISIT10_AGE VISIT11_AGE
ERROR 180-322: Statement is not valid or it is used out of proper order.

238 VISIT12_AGE VISIT13_AGE VISIT14_AGE VISIT15_AGE VISIT16_AGE VISIT17_AGE VISIT18_AGE
238! VISIT19_AGE VISIT20_AGE VISIT21_AGE VISIT22_AGE VISIT23_AGE VISIT24_AGE VISIT25;

239 array outcode {25} outantibi1- outantibi25;
-----
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

240 j = 1;
-
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

241
242 do i = 1 to 25;
--
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

243 if (. < inage{i} le age1) and incode{i} in('antibi1' 'antibi2' 'antibi3' 'antibi4'
--
180
243! 'antibi5' 'antibi6' 'antibi7'
ERROR 180-322: Statement is not valid or it is used out of proper order.

244 'antibi8' 'antibi9' 'antibi10' 'antibi11' 'antibi12' 'antibi13' 'antibi14' 'antibi15'
244! 'antibi16' 'antibi17'
245 'antibi18' 'antibi19' 'antibi20' 'antibi21' 'antibi22' 'antibi23' 'antibi24' 'antibi25') then
245! do; /* if age is not missing and in time frame and you have one or more of the asthma codes

245! /* if age is not missing and in time frame and you have one or more of the asthma codes */
246 outage{j} = inage{i};
------
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

247 outcode{j} = 1;
-------
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

248 j = j + 1;
-
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

249 end;
---
180

ERROR 180-322: Statement is not valid or it is used out of proper order.

 

 

Mscarboncopy
Pyrite | Level 9

Disregard my previous comment.

It worked as it showed no errors.

However, when I create an SPSS file this is what it says

 

 age1 is uninitialized so it is creating a var age1
 
from here  if (. < inage{i} le age1) and incode{i} in('antibi1' 'antibi2' 'antibi3' 'antibi4' 'antibi5' 'antibi6' 'antibi7

 

 

 

Astounding
PROC Star

So your programming depends on there being an existing variable named AGE1.  But SAS is finding that there is no such variable.  Did you mean to refer to AGE1 or did you mean to refer to something else?

Mscarboncopy
Pyrite | Level 9

yes I just realized that right after I  posted my last note. This is what I add to create age1 


age1 = intnx(year,dob,1); /* add 1 year to date of birth */

and kept the rest of the code the same and the variables are created but all empty ?

 

these are new errors

 

Argument 1 to function INTNX(' .',15803,1) at line 102 column 8 is invalid

 

and It looks like it is only for a few ids ? but it would be able to create for others ? wouldn't it ? 

 

and at the end it says

 

Mathematical operations could not be performed at the following places. The results of the
operations have been set to missing values.

 

 

This is what my code looks like now

 

is there anything wrong that I am missing ? Thank you for your help ! 

 

Data want;

Set have;

 

age1 = intnx(year,dob,1); /* add 1 year to date of birth */

array inage {25} VISIT1_AGE VISIT2_AGE VISIT3_AGE VISIT4_AGE VISIT5_AGE VISIT6_AGE VISIT7_AGE VISIT8_AGE VISIT9_AGE VISIT10_AGE VISIT11_AGE
VISIT12_AGE VISIT13_AGE VISIT14_AGE VISIT15_AGE VISIT16_AGE VISIT17_AGE VISIT18_AGE VISIT19_AGE VISIT20_AGE VISIT21_AGE VISIT22_AGE VISIT23_AGE VISIT24_AGE VISIT25_AGE;
array incode {25} antibi1 antibi2 antibi3 antibi4 antibi5 antibi6 antibi7 antibi8 antibi9 antibi10 antibi11 antibi12 antibi13 antibi14 antibi15 antibi16
antibi17 antibi18 antibi19 antibi20 antibi21 antibi22 antibi23 antibi24 antibi25;
array outage {25} outVISIT1_AGE outVISIT2_AGE outVISIT3_AGE outVISIT4_AGE outVISIT5_AGE outVISIT6_AGE outVISIT7_AGE outVISIT8_AGE outVISIT9_AGE outVISIT10_AGE outVISIT11_AGE
VISIT12_AGE outVISIT13_AGE outVISIT14_AGE outVISIT15_AGE outVISIT16_AGE outVISIT17_AGE outVISIT18_AGE outVISIT19_AGE outVISIT20_AGE outVISIT21_AGE outVISIT22_AGE outVISIT23_AGE outVISIT24_AGE outVISIT25_AGE;
array outcode {25} outantibi1 - outantibi25;
j = 1;

do i = 1 to 25;
if (. < inage{i} le age1) and incode{i} in('antibi1' 'antibi2' 'antibi3' 'antibi4' 'antibi5' 'antibi6' 'antibi7'
'antibi8' 'antibi9' 'antibi10' 'antibi11' 'antibi12' 'antibi13' 'antibi14' 'antibi15' 'antibi16' 'antibi17'
'antibi18' 'antibi19' 'antibi20' 'antibi21' 'antibi22' 'antibi23' 'antibi24' 'antibi25') then do; /* if age is not missing and in time frame and you have one or more of the asthma codes */
outage{j} = inage{i}; /* move image to outage */
outcode{j} = 1; /* found antibi code and in time range*/
j = j + 1; /* when you find one then increment the j index */
end;
end;

Astounding
PROC Star

If you want INTNX to use the year interval, you have to put year in quotes:

 

age1 = intnx("year", dob, 1);

 

Otherwise, SAS thinks YEAR is the name of a variable and its value contains the proper date interval.

Mscarboncopy
Pyrite | Level 9

that works !! Thank you.

When I create the spss file age is in a weird format 16071 I am assuming sas format ?

 

It should have been 0.1  0.5   0.6  etc. What to do ? format it ? how?

 

And all of the other outvars I am trying to create are still empty. 

 

This error remains Invalid numeric data, 'antibi1' , so I have to figure this one out. It is weird because I ran frequencies and the var antibi 1 is fine it is 0 and 1 like all the other ones. 

 

Really weird but so close though. 

Thanks again

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 1086 views
  • 0 likes
  • 3 in conversation