Hi SAS forum,
I have two data tables that look like this:
data file1; input date $ weight; datalines; 07FEB20:08:10:08 68.9 07FEB20:09:10:08 68.2 07FEB20:10:10:08 68.4 07FEB20:11:10:08 68.7 07FEB20:12:10:08 68.1 07FEB20:13:10:08 67.4 07FEB20:14:10:08 67.9 07FEB20:15:10:08 80.2 07FEB20:16:10:08 81.2; run;
data file2;
input date $;
datalines;
07FEB20:10:40:08
10FEB20:16:45:08
14FEB20:12:20:08
20FEB20:11:54:08
02JAN20:12:58:08;
run;
As you can see, file1 has hourly values of weight, whereas file2 just has particular date-times. What I want to do is, for each row (date-time) in file2, look up the first date-time in file1 that precedes this date-time, and pull this date-time and the associated weight into file2. So, for this example I would want file2 to be modified to be file2_new where the first row of data would read as:
data file2_new; input date $ date_pulled $ weight_pulled; datalines; 07FEB20:10:40:08 07FEB20:10:10:08 68.4 run;
Any tips on how to do this would be really appreciated!
Thanks
linlin87
A simple way is to interleave the observations and retain the WEIGHT value in a new variable.
First let's make some more complex test data.
data file1;
input id date :datetime. weight;
format date datetime19.;
datalines;
1 07FEB2020:08:10:08 68.9
1 07FEB2020:09:10:08 68.2
1 10FEB2020:10:10:08 68.4
1 13FEB2020:11:10:08 68.7
1 14FEB2020:12:10:08 68.1
1 16FEB2020:13:10:08 67.4
1 18FEB2020:14:10:08 67.9
2 21FEB2020:15:10:08 80.2
2 07MAR2020:16:10:08 81.2
;
data file2;
input id date :datetime. ;
format date datetime19.;
datalines;
1 07FEB2020:10:40:08
1 10FEB2020:16:45:08
1 14FEB2020:12:20:08
2 20FEB2020:11:54:08
2 02MAR2020:12:58:08
;
So now just use a BY statement with SET to interleave the records. Use the IN= dataset option to figure out which observations are from FILE2.
data want;
set file1 file2(in=in2);
by id date;
if first.id then last_wt=.;
last_wt = coalesce(weight, last_wt);
retain last_wt;
if in2;
drop weight;
run;
Results
Obs id date last_wt 1 1 07FEB2020:10:40:08 68.2 2 1 10FEB2020:16:45:08 68.4 3 1 14FEB2020:12:20:08 68.1 4 2 20FEB2020:11:54:08 . 5 2 02MAR2020:12:58:08 80.2
First step:
Do not use character values for datetimes. With character values such as you show 14Feb would come before 20Jan.
Second: a warning bad things can happen to folks that habitually use 2-digit years.
Third: define in a bit more detail what "look up the first date-time in file1 that precedes this date-time" means. It would appear to me that except for the Jan date that 07FEB20:08:10:08 is the "first date that precedes" the values in file_2. There is something missing about your definition of "first". You really should supply more than one result example. That we can check whether any suggestion is correct.
Fourth: make sure that the semicolon that ends datalines is on it's own line.
data file1; input date :datetime. weight; format date datetime19.; datalines; 07FEB20:08:10:08 68.9 07FEB20:09:10:08 68.2 07FEB20:10:10:08 68.4 07FEB20:11:10:08 68.7 07FEB20:12:10:08 68.1 07FEB20:13:10:08 67.4 07FEB20:14:10:08 67.9 07FEB20:15:10:08 80.2 07FEB20:16:10:08 81.2 ; run; data file2; input date :datetime. ; format date datetime19.; datalines; 07FEB20:10:40:08 10FEB20:16:45:08 14FEB20:12:20:08 20FEB20:11:54:08 02JAN20:12:58:08 ; run;
If your values are character you will need to convert them to numeric values similar to above so any comparisons resolve correctly.
Apologies, ballardw. You are correct in that I incorrectly wrote down this problem.
The problem should have been: for each date-time in file2, find the LAST date-time and weight that precedes it. Saying "find the first that precedes it" is a bit ambiguous.
I also did have the date-time variable formatted as Tom states below (not as a character variable as I originally posted).
Also, thank you for correcting the semicolon on datalines.
How would you tackle this problem? I would be very grateful for your input!
Kind regards,
linlin87
A simple way is to interleave the observations and retain the WEIGHT value in a new variable.
First let's make some more complex test data.
data file1;
input id date :datetime. weight;
format date datetime19.;
datalines;
1 07FEB2020:08:10:08 68.9
1 07FEB2020:09:10:08 68.2
1 10FEB2020:10:10:08 68.4
1 13FEB2020:11:10:08 68.7
1 14FEB2020:12:10:08 68.1
1 16FEB2020:13:10:08 67.4
1 18FEB2020:14:10:08 67.9
2 21FEB2020:15:10:08 80.2
2 07MAR2020:16:10:08 81.2
;
data file2;
input id date :datetime. ;
format date datetime19.;
datalines;
1 07FEB2020:10:40:08
1 10FEB2020:16:45:08
1 14FEB2020:12:20:08
2 20FEB2020:11:54:08
2 02MAR2020:12:58:08
;
So now just use a BY statement with SET to interleave the records. Use the IN= dataset option to figure out which observations are from FILE2.
data want;
set file1 file2(in=in2);
by id date;
if first.id then last_wt=.;
last_wt = coalesce(weight, last_wt);
retain last_wt;
if in2;
drop weight;
run;
Results
Obs id date last_wt 1 1 07FEB2020:10:40:08 68.2 2 1 10FEB2020:16:45:08 68.4 3 1 14FEB2020:12:20:08 68.1 4 2 20FEB2020:11:54:08 . 5 2 02MAR2020:12:58:08 80.2
Thank you Tom, that is brilliant. It certainly looks to be doing what I was hoping. But why have you introduced this id variable? Because that is not how my data is structured. I do not have this ID variable in my dataset, and I am not sure how this variable you have generated relates to the original data structures.
The extra ID variable was to demonstrate what to do when you have data from multiple people/places/things/sample. In those cases you do not want the last WEIGHT value to be taken from a different person.
If you don't have multiple sampling units it will be very difficult to do any statistical analysis.
I see, thanks Tom. That is indeed very useful and needed.
How would I pull over the datetime from file1, as well as the weight?
Just extend the process to include that variable.
You probably want to only consider the dates where there is actually a WEIGHT value.
data want;
set file1 file2(in=in2);
by id date;
if first.id then call missing(last_wt,last_dt);
if not missing(weight) then do;
last_wt = coalesce(weight, last_wt);
last_dt = date;
end;
retain last_wt last_dt;
format last_dt datetime19. ;
if in2;
drop weight;
run;
Thank you Tom I really appreciate your help. I want to accept this, but I still don't understand how it works. What does in=in2 do? What is the argument in2? What is the coalesce command doing?
Read about dataset options. IN= allows you to provide the NAME of a variable that will be TRUE when that dataset contributed to the current observation. So because of the subsetting IF statement only the observations read from TABLE2 will be make to the output dataset.
Read about functions. COALESCE() just means take the first non missing value in the argument list. So if WEIGTH is not empty it replaces LAST_WT. With the extra IF NOT MISSING(weight) test in the last code I posted it is no longer needed.
Hi Tom,
Thanks for this. It is working but not quite how I expect, because of one difference in the datafile to how I described in my original post. The files are as follows:
data file1; input id date :datetime. weight; format date datetime19.; datalines; 1 07FEB2020:08:10:08 68.9 1 07FEB2020:09:10:08 68.2 1 10FEB2020:10:10:08 68.4 1 13FEB2020:11:10:08 68.7 1 14FEB2020:12:10:08 68.1 1 16FEB2020:13:10:08 67.4 1 18FEB2020:14:10:08 67.9 2 21FEB2020:15:10:08 80.2 2 08MAR2020:16:10:08 81.2 2 07FEB2020:08:10:08 68.9 2 07FEB2020:09:10:08 68.2 2 10FEB2020:10:10:08 68.4 2 13FEB2020:09:10:08 68.7 3 14FEB2020:12:10:08 68.1 3 16FEB2020:10:10:08 67.4 3 18FEB2020:14:10:08 67.9 3 21FEB2020:11:10:08 80.2 3 10FEB2020:23:10:08 82.9 ; run; data file2; input id refdate :datetime. ; format date datetime19.; datalines; 1 07FEB2020:10:40:08 1 10FEB2020:16:45:08 1 14FEB2020:12:20:08 1 20FEB2020:11:54:08 1 02MAR2020:12:58:08 1 07FEB2020:10:40:08 1 10FEB2020:16:45:08 1 14FEB2020:12:20:08 2 20FEB2020:11:54:08 2 02MAR2020:12:58:08 2 07FEB2020:13:40:08 2 10FEB2020:17:40:08 2 14FEB2020:12:20:08 2 20FEB2020:09:54:08 2 02MAR2020:12:08:08 3 07FEB2020:10:40:08 3 10FEB2020:14:40:08 3 14FEB2020:12:20:08 3 20FEB2020:19:54:08 3 02MAR2020:12:58:08 3 07FEB2020:11:00:08 3 10FEB2020:16:45:08 ; run;
Now obviously the code doesn't work because the variable name in file2 is refdtm not dtm. How would I make your code work for this? I could just change the name of refdtm to dtm, but wondering if you have a better fix.
Kind regards,
linlin87
Did you read the documentation on dataset options I sent the link for? For this you need the RENAME= dataset option.
merge table1(rename=(date=refdate)) .....
Make sure to modify the references to DATE to use new REFDATE name instead in the rest of the data step.
PS Using DATE for a variable that does not actually contain DATE (number of days) values will cause you confusion later on. Your variables contain DATETIME (number of seconds) values instead of DATE values.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.