Solved: Re: Data restructure in sas

ganeshsas764 · Posted 12-21-2019 06:07 AM

Hi All,

I have one query could you please help me on this?

Input Dataset :-

ID Jbcode1 Jbcode2 Jbcode3

101 AA

101 BB

101 CC

I need output like below

output dataset ;

ID Jbcode1 Jbcode2 Jbcode3

101 AA

101 AA BB

101 AA BB CC

ganeshsas764 · Posted 12-21-2019 08:59 AM

Thank you very much

View solution in original post

Ksharp · Posted 12-21-2019 06:41 AM

data have;
input ID (Jbcode1 Jbcode2 Jbcode30) ($);
cards;
101 AA     . .     
101   .          BB .
101   .              .             CC
;
data want;
 update have(obs=0) have;
 by id;
 output;
 run;
 
 proc print;
 run;

ganeshsas764 · Posted 12-21-2019 08:59 AM

Thank you very much

PaigeMiller · Posted 12-21-2019 10:05 AM

I have seen people use this type of code on a number of occasions.


update have(obs=0) have;

Can someone explain how/why it works?

--
Paige Miller

Astounding · Posted 12-21-2019 10:40 AM

UPDATE basics:

It takes two data sets. The first is a master data set (meaning no duplicate values for BY values). The second is a set of transactions to apply to the master.

There can be mismatches ... BY values that appear in one data set but not the other.

UPDATE takes all the transactions, and uses the nonmissing values only to replace values in the master data set.

It outputs just one observation per BY group, once all the transactions have been applied.

UPDATE feature:

You can add an OUTPUT statement, since this is part of a DATA step. UPDATE then outputs each observation, instead of waiting until all the transactions have been applied.

Paper presentation at next SGF.

ganeshsas764 · Posted 12-21-2019 01:31 PM

Thanks for the explanation 🙂

for supposing id values different if any values are missing particular variable so I need to replace with above non-missing value to below the missing value

Example:-

input :-

Id JB1 JB2 JB3

101 A B C

102 . D F

103 C . .

I need output like below

Id JB1 JB2 JB3

101 A B C

102 A D F

103 C D F

Ubai · Posted 12-21-2019 02:26 PM

Is UPDATE a new statement?

Tom · Posted 12-21-2019 02:50 PM

@Ubai wrote:

Is UPDATE a new statement?

No. I used it in 1983.

Tom · Posted 12-21-2019 10:46 AM

@PaigeMiller wrote:

I have seen people use this type of code on a number of occasions.
update have(obs=0) have;
Can someone explain how/why it works?

The UPDATE statement requires two datasets. The first one is the original source dataset and the second one is the transactions to be applied. In addition the source dataset must have unique observations for the BY variables.

By using the same dataset for both and use the OBS= dataset option you are starting with a empty source dataset treating all of the data as transactions. So for the first observation in the BY group the values are just copied from the transaction dataset as a new output record. For the 2nd, 3rd, etc observation in the transaction dataset the non-missing values will overwrite the current value and the missing values will be ignored.

The other trick that the usage in this thread added as to include an explicit OUTPUT statement. Without the explicit OUTPUT the UPDATE function will only output the final result of all of the transactions for the by group.

PaigeMiller · Posted 12-21-2019 12:22 PM

Thanks for the explanations, @Tom and @Astounding

--
Paige Miller

novinosrin · Posted 12-21-2019 02:40 PM

Good afternoon Sir @PaigeMiller FWIW, I tried to pen the concept though not eloquent and didn't even proof read once, so apologies for the spelling errors and lack of flow having written too fast. Anyways, you can treat it as Garbage in-out 🙂

When you have time:

Since we know that UPDATE statement typically uses two datasets namely master dataset on the left and transaction dataset on the right to update non missing values from transaction dataset based on the value of the BY group. This essentially means it’s yet another LOOK-UP technique that does an update based on matches.

As usual BY GROUP values in both datasets requires to be ordered or indexed for Datastep processing though the exception being Modify statement that does not require either of the datasets to be ordered or indexed. This is for the obvious reason that LOOK-UP by means of Modify statement does a rather “dynamic WHERE processing using values of Transaction dataset”. We shall delve into the fun of MODIFY statement some other time in order to not deviate from the scope limiting to just UPDATE statement.

The functionality of UPDATE statement has similarities with MERGE and MODIFY statement making it a very interesting statement having flavors of both.

WRT MERGE, This perspective of the nomenclature could be called MATCH-UPDATE akin to MATCH-MERGE based on a BY GROUP. In other words, the matches By GROUP values that contribute can be identified using the IN= dataset option that flags the matches and NON-MATCHES.

WRT MODIFY, The functional similarity is that both reads observations from Transaction dataset(2^nd dataset) and does a LOOK UP with a Master dataset(1^st dataset). The distinction from merge being a combine not based on position of by group values.

Rule of Thumb for UPDATE:

Master Dataset (1^st dataset) cannot have duplicate values of the BY GROUP or in other words must be unique. Hmm, what’s the big deal about this restriction? Well, the matches from the Transaction dataset (2^nd dataset) updates only the 1^st value of the BY GROUP. Should 1^st dataset have multiple values for a BY GROUP, SAS writes a WARNING note to the LOG stating “WARNING: The MASTER data set contains more than one observation for a BY group.”

Transaction dataset (2^nd dataset) can have more or values of BY GROUP and there is no restriction. If there’s just one, no biggie as the update becomes ONE to ONE update. However, when there are duplicates, there poses a question? I guess this is where pure diligence of business user’s knowledge of data and requirement applies at best. Do we want to update 1^st records, 2^nd record or the nth record of the matching By-GROUP from the transaction dataset or all records of the matching by-group?

Let’s take a simple example with NON MISSING Values to understand how UPDATE stuff works.

Dataset ONE has unique ID and Dataset two has multiples.

data one;

input id v2;

cards;

1 2

;

data two;

input id v2;

cards;

1 4

1 5

2 5

;

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

run;

id	v2	one	two
1	5	1	1
2	5	0	1

What happened is that for ID=1 in one dataset, SAS found a match from two for the BY GROUP value ID= 1.

So during the 1^stDatastep iteration, SAS read the 1^st values from one, found a match in two. In the very same datastep iteration, SAS continues to read all values of matching BY GROUP for ID=1 from dataset two.
So basically, it almost seems like a one to many match occurring within just one datastep iteration where v2=2 in one is replaced by v2=4 from two in the PDV. Then since continues to read from two, v2=4 is replaced by v2=5 in the PDV
The automatic output and return at the end of the datastep that we know writes the values of the PDV to the output buffer and eventually to the output dataset want
In essence, we have done two updates for v2 in ID=1 in one using v2 in two implicitly. Why I call implicit is because of the 2^nd update v2=5 replacing v2=4 is not explicitly written to the output dataset in the above example.

So WHAT-IF we want to keep all the records of update occurrences?

This makes it obvious that we would need the explicit OUTPUT statement in the code. So adding an explicit statement makes sure, all records read sequentially from two into the PDV are written to the output datset WANT.

/*Adding an output statement*/

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

output;

run;

id	v2	one	two
1	4	1	1
1	5	1	1
2	5	0	1

Therefore, it is increasingly clear to discern that how observations read from Transaction dataset (two) update data values matching By-groups.

In both the above outputs, we could notice that if there is non-matching BY GROUP value in two, the by group gets appended to the output dataset. Or in other words, whenever there is a non-matching By GROUP in transaction (two) dataset, the sequentially read observations read into the PDV is also written to the output dataset.

So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in two?

Let’s modify two to include one more record for ID=2 (2 6) below and comprehend what happens during update

data two;

input id v2;

cards;

1 4

1 5

2 5

2 6

;

data one;

input id v2;

cards;

1 2

;

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

run;

id	v2	one	two
1	5	1	1
2	6	0	1

Let’s iterate how stuff works again:

A match is found for ID=1 from both datasets one and two
Observations from two are read sequentially replacing v2=2 with v2=4 and v2=6 in the PDV.
Since no explicit OUTPUT statement is used, the PDV variable V2 value is over written every time a record is read from two for the BY GROUP.
Finally, The implicit OUTPUT statement writes the record into the output dataset WANT

Intuitively, including an explicit OUTPUT statement will write each time a record that is read into the PDV from TWO to the output dataset.

/*Adding an output statement*/

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

output;

run;

id	v2	one	two
1	4	1	1
1	5	1	1
2	5	0	1
2	6	0	1

So WHAT-IF there happens to be multiple occurrences of nonmatching BY GROUP values in one and two?

This is where crux of situation or rather the UPDATE statement apexes. We can recall that only the 1^st observation of the BY GROUP in the master dataset (one) will be updated with values from matching BY GROUP of the transaction dataset (two) and a warning note is written to the LOG.

data one;

input id v2;

cards;

1 2

1 3

1 7

;

data two;

input id v2;

cards;

1 4

1 5

2 5

2 6

;

/*Adding an output statement*/

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

*output;

run;

WARNING: The MASTER data set contains more than one observation for a BY group.

a=1 b=0 id=1 v2=3 FIRST.id=0 LAST.id=0 one=1 two=0 _ERROR_=1 _N_=3

WARNING: The MASTER data set contains more than one observation for a BY group.

a=1 b=0 id=1 v2=7 FIRST.id=0 LAST.id=1 one=1 two=0 _ERROR_=1 _N_=4

The above elucidation unvaryingly makes us to expect only the record 1 2 would be updated as that is the 1st record of the BY GROUP in one (master dataset). Uncommenting the output statement would of course write all the records read from Transaction dataset (two) to the output dataset want.

MISSING Values Consideration:

The default action for UPDATE statement is to only update master dataset (one) values with non-missing values from the transaction dataset.

Suppose we have a missing value in 2^nd dataset. The UPDATEMODE Missingvalue default check prevents the missing value from replacing a non-missing value in a PDV variable when a match Is found or when a match isn’t found is not appended as it would otherwise had the value been non-missing.

data one;

input id v2;

cards;

1 2

1 3

1 7

;

data two;

input id v2;

cards;

1 4

1 .

2 5

2 6

;

/*Adding an output statement*/

data want;

update one(in=a) two(in=b);

by id;

one=a;

two=b;

*output;

run;

id	v2	one	two
1	4	1	1
1	3	1	0
1	7	1	0
2	6	0	1

V2=2 is replaced by 4 however v2=4 is not replaced by the sequential missing value.

UPDATE HAVE (OBS=0) HAVE; seems like LOCF (Last observation carried forward) with an OUTPUT statement or collapsing all records into one complete record with NON-MISSING Values without an OUTPUT statement?

data one;

input id v2 v3 v4;

cards;

1 . . 4

1 . 3 .

1 2 . .

;

An Intuitive thought looking into the dataset based upon what has been expounded so far could be

UPDATE HAVE (OBS=0) HAVE

No observation is read from the master dataset HAVE
Like a SELF-MERGE/JOIN= that we are used to and are familiar with, HAVE here is used as Transaction dataset
LOOK UP should fail because of OBS=0 used in master
Therefore records read from Transaction for each BY GROUP should append to the new dataset WANT
1^st record . . 4 is read into the PDV from Trans(one) beginning with the 1^st by group
With MISSINGVALUE default check in place, 4 is read into the PDV while initial missing values remain as-is
Next record . 3 . is yet again checked for the default MISSINGVALUE and so only 3, replaces the missing value in PDV. 4 remains as-is. Now we have . 3 4 in the PDV.
The next trans record is check for missing and this time, 2 replaces the missing value in PDV resulting with values 2 3 4
The implicit OUTPUT writes the 2 3 4 to the output dataset.

Let’s test:

data one;

input id v2 v3 v4;

cards;

1 . . 4

1 . 3 .

1 2 . .

;

data want;

update one(obs=0) one;

by id;

run;

/*Adding an output*/

data want;

update one(obs=0 in=a) one(in=b);

by id;

one=a;

two=a;

output;

run;

Now, to make it seem like a LOCF, the obvious would have to be the inclusion of explicit OUTPUT statement i.e. to output every occurrence of read from trans dataset. The only difference it makes here is because the values are missing, the logic works well otherwise, the meaning of UPDATE with real values holds true significance. 🙂

SAS Innovate 2025: Call for Content

Classroom Training Available!