I have a large table that looks something like this:
ID | CrossRef |
10001 | . |
10002 | . |
10003 | . |
... | ... |
12005 | . |
12006 | 10003 |
12007 | 10042 |
... | ... |
13015 | 10003 |
... | ... |
Any row that is missing a CrossRef is an original entry, and any row with a CrossRef is modifying the row with the matching ID. Larger ID values indicate later entries.
What I want is to sort this so that the original entry is immediately followed by any that modify it, e.g.:
ID | CrossRef |
10001 | . |
10002 | . |
10003 | . |
12006 | 10003 |
13015 | 10003 |
10005 | . |
... | ... |
10042 | . |
12007 | 10042 |
... | ... |
This does what I understand you want:
data need / view=need;
set have;
_sortvar=coalesce(crossref,id);
run;
proc sort data=need out=want (drop=_:);
by _sortvar id;
run;
This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.
@osbornejo wrote:
This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.
That does not make sense.
Try it.
data have;
input ID CrossRef ;
cards;
10001 .
10002 .
10003 .
12005 .
12006 10003
12007 10042
13015 10003
;
data step1;
set have;
orig_id = coalesce(crossref,id);
run;
proc sort;
by orig_id id;
run;
Results
Cross Obs ID Ref orig_id 1 10001 . 10001 2 10002 . 10002 3 10003 . 10003 4 12006 10003 10003 5 13015 10003 10003 6 12007 10042 10042 7 12005 . 12005
I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the sort off, because when I tried it, I ended up with
Obs ID CrossRef orig_id 1 10001 0 0 2 10002 0 0 3 10003 0 0 4 12005 0 0 5 12006 10003 10003 6 13015 10003 10003 7 12007 10042 10042
Also, ID and CrossRef are character variables, not numeric.
Is the issue that the CROSS_REF is not to the ORIGINAL id but to just the PREVIOUS id? Which might have itself been a CROSS_REF before?
@osbornejo wrote:
This almost works - the problem I'm having is that each original entry (e.g. id = 10003) always has a CrossRef of . (i.e. missing), so the coalesce gets confused. What you're doing makes sense though.
The task is "What I want is to sort this so that the original entry is immediately followed by any that modify it,".
which is what happens in the code I provided, given that "larger ID values indicate later entries.".
I would need some additional explanation of what you want, and some sample data and desired results in which the submitted code produces different results.
I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the coalesce off, because when I tried it, I ended up with
Obs ID CrossRef orig_id 1 10001 0 0 2 10002 0 0 3 10003 0 0 4 12005 0 0 5 12006 10003 10003 6 13015 10003 10003 7 12007 10042 10042
Also, ID and CrossRef are character variables, not numeric.
@osbornejo wrote:
I just checked my actual data again and I misspoke - the original entries have a CrossRef of 0, not missing. I think that's what is throwing the coalesce off
Yes, a zero is not a missing value, so coalesce will not skip over it to look for a subsequent non-missing value. This can be dealt with by some minor code modification.
But possibly more significant you also tell us
ID and CrossRef are character variables, not numeric.
Now the coalesce function will automatically convert arguments of entirely numeric characters to actual numeric values, with a note to that effect on your log. This is probably good, because any ID or CROSSREF values of less than 5 digits (forget the "0" for now), will be sorted in numeric order, rather than in lexicographic order if they had remained character. Otherwise "2345" would sort AFTER "12345".
So I would stick with the coalesce function, but change the first argument from CROSSREF to IFC(crossref='0','.',crossref), allowing a character "0" to be replaced by a ".", and treated by coalesce as a missing value.
Example:
data have;
infile datalines truncover;
input ID :$5. CrossRef :$5.;
cards;
10001 0
10002 0
10003 0
12005 0
12006 10003
12007 10042
13015 10003
;
data need / view=need;
set have;
_sortvar=coalesce(ifc(crossref='0','.',crossref),id);
run;
proc sort data=need out=want (drop=_:);
by _sortvar id;
run;
Edit note: actually the problem of character variables is not entirely solved above. The minor sort key (ID) can cause a problem it you apply the code above to the data below:
data have;
infile datalines truncover;
input ID :$5. CrossRef :$5.;
cards;
2345 0
10001 0
10002 0
10003 0
12005 0
12006 10003
12007 10042
13015 10003
19999 2345
33333 2345
;
The fix would be:
data need / view=need;
set have;
_sortvar=coalesce(ifc(crossref='0','.',crossref),id);
_sortvar2=right(id);
run;
proc sort data=need out=want (drop=_:);
by _sortvar _sortvar2;
run;
Only strings that contain all spaces will be treated as missing. Perhaps you are confused because the normal character informat $ will convert '.' into ' '.
@osbornejo wrote:
Ok, I tried this and when I look at _sortvar in the need table, it is equal to . when the CrossRef is 0. I don't understand why that is happening - I thought coalesce() was supposed to skip missing values. Does the period work as a missing value when it is a character variable?
If CrossRef="0" then the only time _sortvar should be missing is when ID has a non-numeric character or is completely blank. And in that case there would be an error message on your SAS log.
So I can't respond usefully to this result without relevant information. Please provide (1) the original data of the offending observations, and (2) the log of the data step that generated the _sortvar=. result. Otherwise, I am unable to reproduce the problem you describe.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.