Hi All,
I have one query could you please help me on this?
Input Dataset :-
ID Jbcode1 Jbcode2 Jbcode3
101 AA
101 BB
101 CC
I need output like below
output dataset ;
ID Jbcode1 Jbcode2 Jbcode3
101 AA
101 AA BB
101 AA BB CC
Thank you very much
data have;
input ID (Jbcode1 Jbcode2 Jbcode30) ($);
cards;
101 AA . .
101 . BB .
101 . . CC
;
data want;
update have(obs=0) have;
by id;
output;
run;
proc print;
run;
Thank you very much
I have seen people use this type of code on a number of occasions.
update have(obs=0) have;
Can someone explain how/why it works?
UPDATE basics:
It takes two data sets. The first is a master data set (meaning no duplicate values for BY values). The second is a set of transactions to apply to the master.
There can be mismatches ... BY values that appear in one data set but not the other.
UPDATE takes all the transactions, and uses the nonmissing values only to replace values in the master data set.
It outputs just one observation per BY group, once all the transactions have been applied.
UPDATE feature:
You can add an OUTPUT statement, since this is part of a DATA step. UPDATE then outputs each observation, instead of waiting until all the transactions have been applied.
Paper presentation at next SGF.
Thanks for the explanation 🙂
for supposing id values different if any values are missing particular variable so I need to replace with above non-missing value to below the missing value
Example:-
input :-
Id JB1 JB2 JB3
101 A B C
102 . D F
103 C . .
I need output like below
Id JB1 JB2 JB3
101 A B C
102 A D F
103 C D F
Is UPDATE a new statement?
@Ubai wrote:
Is UPDATE a new statement?
No. I used it in 1983.
@PaigeMiller wrote:
I have seen people use this type of code on a number of occasions.
update have(obs=0) have;
Can someone explain how/why it works?
The UPDATE statement requires two datasets. The first one is the original source dataset and the second one is the transactions to be applied. In addition the source dataset must have unique observations for the BY variables.
By using the same dataset for both and use the OBS= dataset option you are starting with a empty source dataset treating all of the data as transactions. So for the first observation in the BY group the values are just copied from the transaction dataset as a new output record. For the 2nd, 3rd, etc observation in the transaction dataset the non-missing values will overwrite the current value and the missing values will be ignored.
The other trick that the usage in this thread added as to include an explicit OUTPUT statement. Without the explicit OUTPUT the UPDATE function will only output the final result of all of the transactions for the by group.
Thanks for the explanations, @Tom and @Astounding
Good afternoon Sir @PaigeMiller FWIW, I tried to pen the concept though not eloquent and didn't even proof read once, so apologies for the spelling errors and lack of flow having written too fast. Anyways, you can treat it as Garbage in-out 🙂
When you have time:
Since we know that UPDATE statement typically uses two datasets namely master dataset on the left and transaction dataset on the right to update non missing values from transaction dataset based on the value of the BY group. This essentially means it’s yet another LOOK-UP technique that does an update based on matches.
As usual BY GROUP values in both datasets requires to be ordered or indexed for Datastep processing though the exception being Modify statement that does not require either of the datasets to be ordered or indexed. This is for the obvious reason that LOOK-UP by means of Modify statement does a rather “dynamic WHERE processing using values of Transaction dataset”. We shall delve into the fun of MODIFY statement some other time in order to not deviate from the scope limiting to just UPDATE statement.
The functionality of UPDATE statement has similarities with MERGE and MODIFY statement making it a very interesting statement having flavors of both.
WRT MERGE, This perspective of the nomenclature could be called MATCH-UPDATE akin to MATCH-MERGE based on a BY GROUP. In other words, the matches By GROUP values that contribute can be identified using the IN= dataset option that flags the matches and NON-MATCHES.
WRT MODIFY, The functional similarity is that both reads observations from Transaction dataset(2nd dataset) and does a LOOK UP with a Master dataset(1st dataset). The distinction from merge being a combine not based on position of by group values.
Rule of Thumb for UPDATE:
Let’s take a simple example with NON MISSING Values to understand how UPDATE stuff works.
Dataset ONE has unique ID and Dataset two has multiples.
data one;
input id v2;
cards;
1 2
;
data two;
input id v2;
cards;
1 4
1 5
2 5
;
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
run;
|
|
|
|
||||||||||||
|
|
|
|
||||||||||||
|
|
|
|
||||||||||||
|
|
|
|
What happened is that for ID=1 in one dataset, SAS found a match from two for the BY GROUP value ID= 1.
So WHAT-IF we want to keep all the records of update occurrences?
This makes it obvious that we would need the explicit OUTPUT statement in the code. So adding an explicit statement makes sure, all records read sequentially from two into the PDV are written to the output datset WANT.
/*Adding an output statement*/
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
output;
run;
|
|
|
|
||||||||||||||||
|
|
|
|
||||||||||||||||
Therefore, it is increasingly clear to discern that how observations read from Transaction dataset (two) update data values matching By-groups.
In both the above outputs, we could notice that if there is non-matching BY GROUP value in two, the by group gets appended to the output dataset. Or in other words, whenever there is a non-matching By GROUP in transaction (two) dataset, the sequentially read observations read into the PDV is also written to the output dataset.
So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in two?
Let’s modify two to include one more record for ID=2 (2 6) below and comprehend what happens during update
data two;
input id v2;
cards;
1 4
1 5
2 5
2 6
;
data one;
input id v2;
cards;
1 2
;
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
run;
id |
v2 |
one |
two |
1 |
5 |
1 |
1 |
2 |
6 |
0 |
1 |
Let’s iterate how stuff works again:
Intuitively, including an explicit OUTPUT statement will write each time a record that is read into the PDV from TWO to the output dataset.
/*Adding an output statement*/
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
output;
run;
id |
v2 |
one |
two |
1 |
4 |
1 |
1 |
1 |
5 |
1 |
1 |
2 |
5 |
0 |
1 |
2 |
6 |
0 |
1 |
|
|
|
|
|
|
|
|
So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in one and two?
This is where crux of situation or rather the UPDATE statement apexes. We can recall that only the 1st observation of the BY GROUP in the master dataset (one) will be updated with values from matching BY GROUP of the transaction dataset (two) and a warning note is written to the LOG.
data one;
input id v2;
cards;
1 2
1 3
1 7
;
data two;
input id v2;
cards;
1 4
1 5
2 5
2 6
;
/*Adding an output statement*/
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
*output;
run;
WARNING: The MASTER data set contains more than one observation for a BY group.
a=1 b=0 id=1 v2=3 FIRST.id=0 LAST.id=0 one=1 two=0 _ERROR_=1 _N_=3
WARNING: The MASTER data set contains more than one observation for a BY group.
a=1 b=0 id=1 v2=7 FIRST.id=0 LAST.id=1 one=1 two=0 _ERROR_=1 _N_=4
The above elucidation unvaryingly makes us to expect only the record 1 2 would be updated as that is the 1st record of the BY GROUP in one (master dataset). Uncommenting the output statement would of course write all the records read from Transaction dataset (two) to the output dataset want.
MISSING Values Consideration:
The default action for UPDATE statement is to only update master dataset (one) values with non-missing values from the transaction dataset.
Suppose we have a missing value in 2nd dataset. The UPDATEMODE Missingvalue default check prevents the missing value from replacing a non-missing value in a PDV variable when a match Is found or when a match isn’t found is not appended as it would otherwise had the value been non-missing.
data one;
input id v2;
cards;
1 2
1 3
1 7
;
data two;
input id v2;
cards;
1 4
1 .
2 5
2 6
;
/*Adding an output statement*/
data want;
update one(in=a) two(in=b);
by id;
one=a;
two=b;
*output;
run;
id |
v2 |
one |
two |
1 |
4 |
1 |
1 |
1 |
3 |
1 |
0 |
1 |
7 |
1 |
0 |
2 |
6 |
0 |
1 |
V2=2 is replaced by 4 however v2=4 is not replaced by the sequential missing value.
UPDATE HAVE (OBS=0) HAVE; seems like LOCF (Last observation carried forward) with an OUTPUT statement or collapsing all records into one complete record with NON-MISSING Values without an OUTPUT statement?
data one;
input id v2 v3 v4;
cards;
1 . . 4
1 . 3 .
1 2 . .
;
An Intuitive thought looking into the dataset based upon what has been expounded so far could be
UPDATE HAVE (OBS=0) HAVE
Let’s test:
data one;
input id v2 v3 v4;
cards;
1 . . 4
1 . 3 .
1 2 . .
;
data want;
update one(obs=0) one;
by id;
run;
/*Adding an output*/
data want;
update one(obs=0 in=a) one(in=b);
by id;
one=a;
two=a;
output;
run;
Now, to make it seem like a LOCF, the obvious would have to be the inclusion of explicit OUTPUT statement i.e. to output every occurrence of read from trans dataset. The only difference it makes here is because the values are missing, the logic works well otherwise, the meaning of UPDATE with real values holds true significance. 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.