BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ganeshsas764
Obsidian | Level 7

Hi All,

 

I have one query could you please help me on this?

 

Input Dataset :- 

 

ID Jbcode1 Jbcode2 Jbcode3

101 AA          

101             BB

101                              CC

 

I need output like below 

output dataset ;

 

ID  Jbcode1 Jbcode2 Jbcode3

101  AA          

101  AA           BB

101  AA           BB          CC

 

1 ACCEPTED SOLUTION

Accepted Solutions
ganeshsas764
Obsidian | Level 7

Thank you very much 

View solution in original post

10 REPLIES 10
Ksharp
Super User
data have;
input ID (Jbcode1 Jbcode2 Jbcode30) ($);
cards;
101 AA     . .     
101   .          BB .
101   .              .             CC
;
data want;
 update have(obs=0) have;
 by id;
 output;
 run;
 
 proc print;
 run;
 
ganeshsas764
Obsidian | Level 7

Thank you very much 

PaigeMiller
Diamond | Level 26

I have seen people use this type of code on a number of occasions.


update have(obs=0) have;

Can someone explain how/why it works?

 

 

--
Paige Miller
Astounding
PROC Star

UPDATE basics:

 

It takes two data sets.  The first is a master data set (meaning no duplicate values for BY values).  The second is a set of transactions to apply to the master.

 

There can be mismatches ... BY values that appear  in one data set but not the other.

 

UPDATE takes all the transactions, and uses the nonmissing values only to replace values in the master data set.

 

It outputs just one observation per BY group, once all the transactions have been applied.

 

UPDATE feature:

 

You can add an OUTPUT statement, since this is part of a DATA step.  UPDATE then outputs each observation, instead of waiting until all the transactions have been applied.

 

Paper presentation at next SGF.

ganeshsas764
Obsidian | Level 7

Thanks for the explanation 🙂 

 

for supposing id values different if any values are missing particular variable so  I need to replace with above non-missing value to below the missing value

 

Example:-  

input :- 

 

 Id    JB1    JB2  JB3

101   A         B     C

102   .          D      F

103    C       .         .

 

I need output like below 

 

 Id    JB1    JB2  JB3

101   A         B     C

102   A         D      F

103    C       D       F

Ubai
Quartz | Level 8

Is UPDATE a new statement?

Tom
Super User Tom
Super User

@Ubai wrote:

Is UPDATE a new statement?


No.  I used it in 1983.

Tom
Super User Tom
Super User

@PaigeMiller wrote:

I have seen people use this type of code on a number of occasions.


update have(obs=0) have;

Can someone explain how/why it works?


The UPDATE statement requires two datasets. The first one is the original source dataset and the second one is the transactions to be applied.  In addition the source dataset must have unique observations for the BY variables.

 

By using the same dataset for both and use the OBS= dataset option you are starting with a empty source dataset treating all of the data as transactions.  So for the first observation in the BY group the values are just copied from the transaction dataset as a new output record.  For the 2nd, 3rd, etc observation in the transaction dataset the non-missing values will overwrite the current value and the missing values will be ignored.

 

The other trick that the usage in this thread added as to include an explicit OUTPUT statement.  Without the explicit OUTPUT the UPDATE function will only output the final result of all of the transactions for the by group.

PaigeMiller
Diamond | Level 26

Thanks for the explanations, @Tom and @Astounding 

--
Paige Miller
novinosrin
Tourmaline | Level 20

Good afternoon Sir @PaigeMiller  FWIW, I tried to pen the concept though not eloquent and didn't even proof read once, so apologies for the spelling errors and lack of flow having written too fast. Anyways, you can treat it as Garbage in-out 🙂

 

When you have time:

 

Since we know that UPDATE statement typically uses two datasets namely master dataset on the left and transaction dataset on the right to update non missing values from transaction dataset based on the value of the BY group. This essentially means it’s yet another LOOK-UP technique that does an update based on matches.

As usual BY GROUP values in both datasets requires to be ordered or indexed for Datastep processing though the exception being Modify statement that does not require either of the datasets to be ordered or indexed. This is for the obvious reason that LOOK-UP by means of Modify statement does a rather “dynamic WHERE processing using values of Transaction dataset”. We shall delve into the fun of MODIFY statement some other time in order to not deviate from the scope limiting to just UPDATE statement.

The functionality of UPDATE statement has similarities with MERGE and MODIFY statement making it a very interesting statement having flavors of both.

WRT MERGE, This perspective of the nomenclature could be called MATCH-UPDATE akin to MATCH-MERGE based on a BY GROUP.   In other words, the matches By GROUP values that contribute can be identified using the IN= dataset option that flags the matches and NON-MATCHES.

WRT MODIFY, The functional similarity is that both reads observations from Transaction dataset(2nd dataset) and does a LOOK UP with a Master dataset(1st dataset). The distinction from merge being a combine not based on position of by group values.

Rule of Thumb for UPDATE:

  1. Master Dataset (1st dataset) cannot have duplicate values of the BY GROUP or in other words must be unique. Hmm, what’s the big deal about this restriction? Well, the matches from the Transaction dataset (2nd dataset) updates only the 1st value of the BY GROUP. Should 1st dataset have multiple values for a BY GROUP, SAS writes a WARNING note to the LOG stating “WARNING: The MASTER data set contains more than one observation for a BY group.”

 

  1. Transaction dataset (2nd dataset) can have more or values of BY GROUP and there is no restriction. If there’s just one, no biggie as the update becomes ONE to ONE update. However, when there are duplicates, there poses a question? I guess this is where pure diligence of business user’s knowledge of data and requirement applies at best. Do we want to update 1st records, 2nd record or the nth record of the matching By-GROUP from the transaction dataset or all records of the matching by-group?

 

Let’s take a simple example with NON MISSING Values to understand how UPDATE stuff works.

Dataset ONE has unique ID and Dataset two has multiples.

data one;

input id v2;

cards;

1 2

;

 

data two;

input id v2;

cards;

1 4

1 5

2 5

;

 

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

run;

id

v2

one

two

1

5

1

1

2

5

0

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

What happened is that for ID=1 in one dataset, SAS found a match from two for the BY GROUP value ID= 1.

  • So during the 1st Datastep iteration, SAS read the 1st values from one, found a match in two. In the very same datastep iteration, SAS continues to read all values of matching BY GROUP for ID=1 from dataset two.
  • So basically, it almost seems like a one to many match occurring within just one datastep iteration where v2=2 in one is replaced by v2=4 from two in the PDV. Then since continues to read from two, v2=4 is replaced by v2=5 in the PDV
  • The automatic output and return at the end of the datastep that we know writes the values of the PDV to the output buffer and eventually to the output dataset want
  • In essence, we have done two updates for v2 in ID=1 in one using v2 in two implicitly. Why I call implicit is because of the 2nd update v2=5 replacing v2=4 is not explicitly written to the output dataset in the above example.

 

So WHAT-IF we want to keep all the records of update occurrences?

 This makes it obvious that we would need the explicit OUTPUT statement in the code. So adding an explicit statement makes sure, all records read sequentially from two into the PDV are written to the output datset WANT.

/*Adding an output statement*/

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

 output;

run;

 

id

v2

one

two

1

4

1

1

1

5

1

1

2

5

0

1

 

 

 

 

 

 

 

       
 

 

Therefore, it is increasingly clear to discern that how observations read from Transaction dataset (two) update data values matching By-groups.

 

In both the above outputs, we could notice that if there is non-matching BY GROUP value in two, the by group gets appended to the output dataset.  Or in other words, whenever there is a non-matching By GROUP in transaction (two) dataset, the sequentially read observations read into the PDV is also written to the output dataset.

 

So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in two?

Let’s modify two to include one more record for ID=2 (2  6) below and comprehend what happens during update

data two;

input id v2;

cards;

1 4

1 5

2 5

2 6

;

data one;

input id v2;

cards;

1 2

;

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

run;

id

v2

one

two

1

5

1

1

2

6

0

1

 

Let’s iterate how stuff works again:

  • A match is found for ID=1 from both datasets one and two
  • Observations from two are read sequentially replacing v2=2 with v2=4 and v2=6 in the PDV.
  • Since no explicit OUTPUT statement is used, the PDV variable V2 value is over written every time a record is read from two for the BY GROUP.
  • Finally, The implicit OUTPUT statement writes the record into the output dataset WANT

Intuitively, including an explicit OUTPUT statement will write each time a record that is read into the PDV from TWO to the output dataset.

/*Adding an output statement*/

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

 output;

run;

id

v2

one

two

1

4

1

1

1

5

1

1

2

5

0

1

2

6

0

1

 

 

 

 

 

 

 

 

 

So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in one and two?

This is where crux of situation or rather the UPDATE statement apexes. We can recall that only the 1st observation of the BY GROUP in the master dataset (one) will be updated with values from matching BY GROUP of the transaction dataset (two) and a warning note is written to the LOG.

 

data one;

input id v2;

cards;

1 2

1 3

1 7

;

 

data two;

input id v2;

cards;

1 4

1 5

2 5

2 6

;

/*Adding an output statement*/

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

 *output;

run;

 

 

WARNING: The MASTER data set contains more than one observation for a BY group.

a=1 b=0 id=1 v2=3 FIRST.id=0 LAST.id=0 one=1 two=0 _ERROR_=1 _N_=3

WARNING: The MASTER data set contains more than one observation for a BY group.

a=1 b=0 id=1 v2=7 FIRST.id=0 LAST.id=1 one=1 two=0 _ERROR_=1 _N_=4

 

The above elucidation unvaryingly makes us to expect only the record 1 2 would be updated as that is the 1st record of the BY GROUP in one (master dataset). Uncommenting the output statement would of course write all the records read from Transaction dataset (two) to the output dataset want.

 

 

MISSING Values Consideration:

 

The default action for UPDATE statement is to only update master dataset (one) values with non-missing values from the transaction dataset.

 

Suppose we have a missing value in 2nd dataset. The UPDATEMODE Missingvalue default check prevents the missing value from replacing a non-missing value in a PDV variable when a match Is found or when a match isn’t found is not appended as it would otherwise had the value been non-missing.

 

data one;

input id v2;

cards;

1 2

1 3

1 7

;

 

data two;

input id v2;

cards;

1 4

1 .

2 5

2 6

;

/*Adding an output statement*/

data want;

 update one(in=a) two(in=b);

 by id;

 one=a;

 two=b;

 *output;

run;

 

id

v2

one

two

1

4

1

1

1

3

1

0

1

7

1

0

2

6

0

1

V2=2 is replaced by 4 however v2=4 is not replaced by the sequential missing value.

 

UPDATE HAVE (OBS=0) HAVE seems like LOCF (Last observation carried forward) with an OUTPUT statement or collapsing all records into one complete record with NON-MISSING Values without an OUTPUT statement?

data one;

input id v2 v3 v4;

cards;

1 . . 4

1 . 3 .

1 2 . .

;

 

An Intuitive thought looking into the dataset based upon what has been expounded so far could be

UPDATE HAVE (OBS=0) HAVE

  • No observation is read from the master dataset HAVE
  • Like a SELF-MERGE/JOIN= that we are used to and are familiar with, HAVE here is used as Transaction dataset
  • LOOK UP should fail because of OBS=0 used in master
  • Therefore records read from Transaction for each BY GROUP should append to the new dataset WANT
  • 1st record . . 4 is read into the PDV from Trans(one) beginning with the 1st by group
  • With MISSINGVALUE default check in place, 4 is read into the PDV while initial missing values remain as-is
  • Next record . 3 . is yet again checked for the default MISSINGVALUE and so only 3, replaces the missing value in PDV. 4 remains as-is. Now we have . 3 4 in the PDV.
  • The next trans record is check for missing and this time, 2 replaces the missing value in PDV resulting with values 2 3 4
  • The implicit OUTPUT writes the 2 3 4 to the output dataset.

 

Let’s test:

data one;

input id v2 v3 v4;

cards;

1 . . 4

1 . 3 .

1 2 . .

;

 

data want;

 update one(obs=0) one;

 by id;

run;

/*Adding an output*/

data want;

 update one(obs=0 in=a) one(in=b);

 by id;

 one=a;

 two=a;

 output;

run;

 

Now, to make it seem like a LOCF, the obvious would have to be the inclusion of explicit OUTPUT statement i.e. to output every occurrence of read from trans dataset. The only difference it makes here is because the values are missing, the logic works well otherwise, the meaning of UPDATE with real values holds true significance. 🙂

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1495 views
  • 8 likes
  • 7 in conversation