Hello ,
I am trying to loop through multiple rows for a same ID and date in order to create a derived variable.
My dataset is a follows:
ID DATE SUBJECT DERIVED VARIABLE
A Jan1 2015 Chemistry
Physics
Biology
A Jan 10 2015 Swimming
Track
Math
A Jan20 2015 History
Social Science
English
B Jan1 2015 Swimming
Track
Math
B Jan20 2015 History
Social Science
English
etc...
I want to assign a the value of the derived variable based on the all the records for one subject one date.
ID DATE SUBJECT DERIVED VARIABLE
A Jan1 2015 Chemistry
Physics Science
Biology
A Jan 10 2015 Swimming
Track Sports
Math
A Jan20 2015 History
Social Science Humanities
English
B Jan1 2015 Swimming
Track Sports
Math
B Jan20 2015 History
Social Science
English
How can I loop through all the records for one particular subject for each date and then assign the derived value?
The program should check all observation for the same id and day and if the records have "physics,chemistry, biology" assign the value Science. If the records have "Simming, track" assign the value Sport.
I did a sort by id and date and then tried with first.date but only the very first observation is taken into account.
thanks
KC
Is you subject value one variable that contains 3 values?
What happens with Math when associated with Swimming and Track?
The dataset is one row per observation so the subject variable is on each row. I presented the way I did for legibility purpose. This is how it looks.
ID DATE SUBJECT DERIVED VARIABLE
A Jan1 2015 Chemistry
A Jan1 2015 Physics
A Jan1 2015 Biology
Regarding Math, becasue two out or three fall inthe same category math is ignored.
If all three were different for example:
A Jan 10 2015 Swimming
A Jan 10 2015 English
A Jan 10 2015 Math
then the derived variabel would be empty.
I don't really see a way other than if/then statements with a combination of by group processing.
Here's a partial way. You'll need to then look at the final counts on last.id and determine what the derived variable is.
data want;
set have;
by id;
array subject(3) science sport humanities;
retain science sport humanities;
if first.id then call missing(science, sport, humanities);
if subject="Chemistry" then science+1;
else if subject="Swimming" then sport+1;
etc...
run;
Looping through the data is not difficult ... but you will need to loop through it twice. The first loop lets you assign your derived variable. The second loop lets you read/output the same records with the derived variable attached. For example:
data want;
science = 0;
humanities = 0;
sports = 0;
do until (last.date);
set have;
by id date;
if subject in ('Chemistry', 'Physics', 'Biology') then science + 1;
else if subject in ('Swimming', 'Track') then sports + 1;
else if subject in ('English', 'Social Science', 'History') then humanities + 1;
end;
*** Add your logic here to create DERIVED_VARIABLE based on SCIENCE, HUMANITIES and SPORTS;
do until (last.date);
set have;
by id date;
output;
end;
**** optionally, drop science sports humanities;
run;
You may have to account for additional SUBJECT values, and you may have to deal with variations in spelling as well.
The logic between the loops is up to you ... how to handle ties, and various mixes. But the data that you would need is all there.
Good luck.
None of the solution above seems to work.
I need the loop to go through all the observations per id and date and then assign the derived variable values.
I do not know which derived values to assign at the start it dpends on what the program will find in the dataset.
kc
None of the above are complete solutions. What code did you use that "didn't work"?
You will need to do some of the work, if you want to use my solution. There are too many rules that I cannot possibly know, and don't even want to ask. For example, what should happen if a student has 3 science courses and 1 sports? What if it is 2 of each? Instead of trying to answer those types of questions, I came up with 3 variables:
SCIENCE = # of science courses
HUMANITIES = # of humanities courses
SPORTS = # of SPORTS courses
It is up to you to make up the rules and write the code that creates DERIVED_VALUE based on these variables. I'm not going to ask for the rules for all the possible variations that might be found. I indicated in the program the spot where you would have to add your logic.
Good luck.
I suggest that you define a small dataset to specify which subjects are in which disciplines and then use a majority rule to set your derived variable:
data disciplines;
length SUBJECT Discipline $20;
input SUBJECT & Discipline ;
datalines;
Chemistry Science
History Humanities
Math Science
Track Sports
Biology Science
English Humanities
Physics Science
Social Science Humanities
Track Sports
Swimming Sports
;
data myDataset;
length ID DATE SUBJECT $20;
input ID DATE SUBJECT &;
datalines;
A Jan012015 Chemistry
A Jan012015 Physics
A Jan012015 Biology
A Jan102015 Swimming
A Jan102015 Track
A Jan102015 Math
A Jan202015 History
A Jan202015 Social Science
A Jan202015 English
B Jan012015 Swimming
B Jan012015 Track
B Jan012015 Math
B Jan202015 History
B Jan202015 Social Science
B Jan202015 English
C Jan302015 English
C Jan302015 Swimming
C Jan302015 Math
;
proc sql;
create table counts as
select ID, date, discipline, count(*) as n
from
myDataset as a inner join
disciplines as b on a.subject=b.subject
group by ID, date, discipline
having n > 1
order by ID, date, n descending;
quit;
data counts;
set counts; by ID date;
if first.date;
run;
proc sql;
create table want as
select a.*, b.discipline
from
myDataset as a left join
counts as b on a.ID=b.ID and a.date=b.date;
select * from want;
quit;
PG
I used "Astounding's" code". In the first loop,
do until (last.date);
set have;
by id date;
if subject in ('Chemistry', 'Physics', 'Biology') then science + 1;
else if subject in ('Swimming', 'Track') then sports + 1;
else if subject in ('English', 'Social Science', 'History') then humanities + 1;
end;
if takes only the first values and assign all the other flags to 0.
Thank you for you feedabck Reeza and Astounding. I will try to work with the hints you have provided.
KC
In theory, that shouldn't happen. If it does happen, it reflects what is in the data. For example:
(1) There might be only one SUBJECT for an ID/DATE combination.
(2) The spelling of SUBJECT might be slightly different from the values being counted.
(3) The list of values being counted might be incomplete.
But it would be right to look at the data, not the programming logic. This approach should definitely generate better counts. Perhaps start with a PROC FREQ on SUBJECT, so you are certain of what is there.
Good luck.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.