Hi Guys,
I am new to SAS and I would love someone to crack for me this logic?
I have a dataset with patients and weight taken on several days. However, some patients have multiple readings on same day. I would like to keep one row per patient per day. I have concluded that entries with more decimal places are likely to be incorrect. Therefore, where one patient has multiple weights on same day, I would take the one with the lowest number of decimal places. However, where a patient has only one weight on a specific day, I would like to keep the the available weight (regardless of the decimal places). The data I have is table_1 and the data i want is table 2
table_1
patient |
mental_score |
date |
weight |
weight_type |
19000067 |
34 |
15-02-10 |
65 |
1 |
19000067 |
28 |
15-02-10 |
66.1 |
2 |
19000068 |
28 |
13-03-08 |
70.12 |
1 |
19000068 |
34 |
13-03-08 |
70.123 |
1 |
19000068 |
31 |
13-03-08 |
70.02262443 |
2 |
19000099 |
31 |
23-10-09 |
81.36 |
1 |
19100069 |
34 |
08-12-10 |
71 |
2 |
19100069 |
23 |
08-12-10 |
71.2 |
2 |
19100069 |
28 |
08-12-10 |
86.26 |
1 |
1910070 |
45 |
07-06-99 |
70 |
1 |
1910070 |
45 |
07-06-99 |
71.9 |
3 |
1910071 |
45 |
29-01-98 |
79.97737557 |
1 |
1910076 |
45 |
07-06-99 |
79.97737557 |
2 |
1910076 |
45 |
07-06-99 |
70.0226 |
1 |
1910567 |
45 |
09-02-00 |
100 |
2 |
1910567 |
45 |
09-02-00 |
99.022 |
1 |
1910567 |
45 |
09-02-00 |
99.2 |
2 |
table_2
patient |
mental_score |
date |
weight |
weight_type |
19000067 |
34 |
15-02-10 |
65 |
1 |
19000068 |
28 |
13-03-08 |
70.12 |
1 |
19000099 |
31 |
23-10-09 |
81.36 |
1 |
19100069 |
34 |
08-12-10 |
71 |
2 |
1910070 |
45 |
07-06-99 |
70 |
1 |
1910071 |
45 |
29-01-98 |
79.97737557 |
1 |
1910076 |
45 |
07-06-99 |
70.0226 |
1 |
1910567 |
45 |
09-02-00 |
100 |
2 |
Thank you, a lot,
Dathan Bonanebye
Actual example input data in the form of a data step.
Or at least print a before and after result.
Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.
Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.
This data set:
data want; set have; by patient lab_d; decimal=lengthn (scan(put(lab_v,best32. -L),2,'.')); run;
does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.
Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.
Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?
Some possible causes:
Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.
data example; x="123456 "; /* the "space" below is an ASCII null*/ /* entered by holding down the ALT key and typing 255 on the numeric keypad in windows */ y="123456 "; if x=y then put "same"; else put "different"; run;
Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output
Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.
data example; x=22456; y=22456.0001; put x=date9. y= date9.; if x=y then put "same"; else put "different"; run;
Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.
Here is an example of one way to get the number of decimal places.
data junk; x= 70.02262443; decimals = length(scan(put(x,best32. -L),2,'.')); run;
Working from the inside out on that equation: Put (x.best32 -L) creates a left justified (no leading spaces) character value that look like the numeric X. Scan(string,2,'.') says to treat the decimal point as a delimiter and the 2 is to get the bit after the decimal, i.e. 02262773. The Length function then counts the number of characters, i.e. digits, in that string of "decimal" values.
Sort the data by patient, date, decimals.
Then:
Proc sort data=have;/* this data set needs to have the added decimals variable*/ by patient date decimals; run; data want; set have; by patient date; if first.date; run;
When you use BY processing in a data step SAS creates automatic variables First. and Last. for each variable on the by statement. These automatic variables take on values of 1 (true) or 0(false) indicating whether the current record is the first or last of the group for the given variable. Note the dot between the keyword and variable name and no space allowed.
The If First.date ; keeps only records where the First.date is true. Since we sorted first the smaller number of decimals would be the first for each date within the patients records. If there is only one record it is first and kept.
Caution: if you have missing values for your weight variable the number of decimals will be 0. So if you have such in your data you will need to add logic about how to handle that situation.
If there is only a single record for by variable then both First and Last are true.
Hi Ballardw;
For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be. My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.
data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;
data want2;
set want;
by patient date;
if first.date;
run;
Any other approah?
Thanks
Dathan
Actual example input data in the form of a data step.
Or at least print a before and after result.
Subset the input to some that has the problem. Use a where statement to subset the data to some of the Ids.
Largish sorts may run quicker using the TAGSORT option but that would not affect the First. processing.
This data set:
data want; set have; by patient lab_d; decimal=lengthn (scan(put(lab_v,best32. -L),2,'.')); run;
does not need By processing and appears to likely have a spelling issue between lab_d and lab_v.
Best approach for sharing code, especially if there may be questions about behavior, is to copy the steps or procedures in question from LOG with all the notes and warnings. Then on the forum open a text box using the </> icon that appears at the top of the message window and paste. This will preserve formatting of SAS warning and error messages. The main message windows reformat pasted text if you hadn't noticed.
Does your log for the data step where you add the decimals or select the First.date records show any warning or notes other than "data x was created with xxx observations and yyy varibles"?
Some possible causes:
Trailing characters other than spaces that don't display with your font, i.e the ASCII null or 255, which would make them look sorted but the first/last uses the value differently.
data example; x="123456 "; /* the "space" below is an ASCII null*/ /* entered by holding down the ALT key and typing 255 on the numeric keypad in windows */ y="123456 "; if x=y then put "same"; else put "different"; run;
Or a FORMAT applied to a character variable that is shorter than the actual values. This can make different values look the same in output
Decimal portions in your dates. If the dates are numeric and you use a SAS date format to make them legible the decimal portion is ignored by the format so the value looks the same to a human but the First/Last uses the numeric value. This is more likely to happen if the data were ever in a spreadsheet, possibly with a datetime value.
data example; x=22456; y=22456.0001; put x=date9. y= date9.; if x=y then put "same"; else put "different"; run;
Your dates aren't actually SAS dates but character values and perhaps you have some confusion from data entry of 0 and O characters, or one, 1, lower case L, l, and or upper case i , I .Or different types of dashes between the date components.
@DathanMD wrote:
Hi Ballardw;
For some reason this doesnt work-I triend same approach earlier before (and I have retried). The output dataset still has multiple entries. I am not sure what the reason could be. My dataset is quite heavy (1 million rows with over 12 columns). Does the by statement become innefficient when sorting by more than 3 variables? I hacve tried also sorting in SAS enterprise guide (keeping first entry for every unique patient and date) and botha approaches still have some patients with multiple entries. Below is my code.
data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have;
by patient date decimal;
run;data want2;
set want;
by patient date;
if first.date;
run;
Any other approah?
Thanks
Dathan
This code won't work as intended:
data want;
set have;
by patient lab_d;
decimal=lengthn (scan(put(lab_v,best32. -L),2,'.'));
run;
proc sort data= have; /* you sort the original dataset, not the "want" dataset that contains the new variable */
by patient date decimal;
run;
data want2;
set want;
by patient date;
if first.date;
run;
This code was tested with your data as posted and worked:
data with_length;
set have;
l = lengthn(scan(put(weight,best32.),2,"."));
run;
proc sql;
create table want as
select
patient,
mental_score,
date,
weight,
weight_type
from with_length
group by patient, date
having l = min(l)
;
quit;
This worked also, creating the same result:
proc sort data=with_length;
by patient date l;
run;
data want2;
set with_length;
by patient date;
if first.date;
run;
If this does not do it for you, post the complete log from your code by copy/pasting into a window opened with this button:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.