This is the data I have:
data have;
input cname $ id $ assets rev sales emp code1 $ code2 $;
datalines;
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4911 .
. . . . . . 4931 .
Company2 CN9375780665 101027000 1477554657000 1477554657000 80 7372 .
Company3 AU092210127 . . . . 5172 4953
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1521 1521
. . . . . . 1522 1522
. . . . . . 1531 1531
. . . . . . 1541 1541
. . . . . . 6712 6500
. . . . . . 6719 7389
. . . . . . 6722 8740
;
The lines that do not have the variables cname, id have data that belong to the previous cname and id. I would like to add the data that is in the previous line.
What I want:
data want;
input cname $ id $ assets rev sales emp code1 $ code2 $;
datalines;
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4911 .
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4931 .
Company2 CN9375780665 101027000 1477554657000 1477554657000 80 7372 .
Company3 AU092210127 . . . . 5172 4953
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1521 1521
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1522 1522
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1531 1531
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1541 1541
Company4 BE0840549639 130972000000 64776000000 41034000000 34 6712 6500
Company4 BE0840549639 130972000000 64776000000 41034000000 34 6719 7389
Company4 BE0840549639 130972000000 64776000000 41034000000 34 6722 8740
;
For given dataset and output (without explanation of requirements) it would be something like this:
data have;
input cname $ id $ assets rev sales emp code1 $ code2 $;
format _numeric_ 16.;
datalines;
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4911 .
. . . . . . 4931 .
Company2 CN9375780665 101027000 1477554657000 1477554657000 80 7372 .
Company3 AU092210127 . . . . 5172 4953
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1521 1521
. . . . . . 1522 1522
. . . . . . 1531 1531
. . . . . . 1541 1541
. . . . . . 6712 6500
. . . . . . 6719 7389
. . . . . . 6722 8740
;
proc print;run;
data temp;
set have;
length _cname $15;
retain _cname;
if cname ne '' then _cname=cname;
if cname eq '' then cname=_cname;
drop _cname;
run;
data want;
set temp;
by cname;
length _id _code1 _code2 $15 _assets _rev _sales _emp 8;
retain _id _code1 _code2 _assets _rev _sales _emp;
if not last.cname then do;
if id ne '' then _id=id;
if code1 ne '' then _code1=code1;
if code2 ne '' then _code2=code2;
if assets ne . then _assets=assets;
if rev ne . then _rev=rev;
if sales ne . then _sales=sales;
if emp ne . then _emp=emp;
end;
if not first.cname then do;
if id eq '' then id= _id;
if code1 eq '' then code1=_code1;
if code2 eq '' then code2=_code2;
if assets eq . then assets=_assets;
if rev eq . then rev=_rev;
if sales eq . then sales=_sales;
if emp eq . then emp=_emp;
end;
drop _:;
proc print; run;
Result;
First, CNAME is being carried forward for missing observations . Then data is being processed in BY-Groups (by CNAME) and missing observations populated using RETAIN statement. If there are more variables ARRAY would be a more efficient way in terms of coding.
RETAIN the variables. Check the line before reading the variables.
data have;
input dummy $ @1 @;
if dummy ne ' ' then input cname $ id $ assets rev sales emp code1 $ code2 $;
else input (6*dummy code1 code2) (:$);
drop dummy ;
retain cname -- emp ;
datalines;
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4911 .
. . . . . . 4931 .
Company2 CN9375780665 101027000 1477554657000 1477554657000 80 7372 .
Company3 AU092210127 . . . . 5172 4953
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1521 1521
. . . . . . 1522 1522
. . . . . . 1531 1531
. . . . . . 1541 1541
. . . . . . 6712 6500
. . . . . . 6719 7389
. . . . . . 6722 8740
;
If you unfortunately already have the messed up data as a dataset you could treat it as a last observations carried forward problem. But you really need a grouping variable for that.
data groups;
retain group;
set have;
group + (cname ne ' ') ;
run;
data want;
update groups(obs=0) groups;
by group;
output;
run;
Result
Obs group cname id assets rev sales emp code1 code2 1 1 Company1 CN936000 4.1968086E12 2.606362E12 2.606362E12 908500 4911 2 1 Company1 CN936000 4.1968086E12 2.606362E12 2.606362E12 908500 4931 3 2 Company2 CN937578 101027000 1.477555E12 1.477555E12 80 7372 4 3 Company3 AU092210 . . . . 5172 4953 5 4 Company4 BE084054 130972000000 64776000000 41034000000 34 1521 1521 6 4 Company4 BE084054 130972000000 64776000000 41034000000 34 1522 1522 7 4 Company4 BE084054 130972000000 64776000000 41034000000 34 1531 1531 8 4 Company4 BE084054 130972000000 64776000000 41034000000 34 1541 1541 9 4 Company4 BE084054 130972000000 64776000000 41034000000 34 6712 6500 10 4 Company4 BE084054 130972000000 64776000000 41034000000 34 6719 7389 11 4 Company4 BE084054 130972000000 64776000000 41034000000 34 6722 8740
Id you don't want the missing values of CODE1 and CODE2 filled in with the previous value then add this statement after the BY statement.
set groups(keep=code1 code2);
For given dataset and output (without explanation of requirements) it would be something like this:
data have;
input cname $ id $ assets rev sales emp code1 $ code2 $;
format _numeric_ 16.;
datalines;
Company1 CN9360002629 4196808624000 2606362179000 2606362179000 908500 4911 .
. . . . . . 4931 .
Company2 CN9375780665 101027000 1477554657000 1477554657000 80 7372 .
Company3 AU092210127 . . . . 5172 4953
Company4 BE0840549639 130972000000 64776000000 41034000000 34 1521 1521
. . . . . . 1522 1522
. . . . . . 1531 1531
. . . . . . 1541 1541
. . . . . . 6712 6500
. . . . . . 6719 7389
. . . . . . 6722 8740
;
proc print;run;
data temp;
set have;
length _cname $15;
retain _cname;
if cname ne '' then _cname=cname;
if cname eq '' then cname=_cname;
drop _cname;
run;
data want;
set temp;
by cname;
length _id _code1 _code2 $15 _assets _rev _sales _emp 8;
retain _id _code1 _code2 _assets _rev _sales _emp;
if not last.cname then do;
if id ne '' then _id=id;
if code1 ne '' then _code1=code1;
if code2 ne '' then _code2=code2;
if assets ne . then _assets=assets;
if rev ne . then _rev=rev;
if sales ne . then _sales=sales;
if emp ne . then _emp=emp;
end;
if not first.cname then do;
if id eq '' then id= _id;
if code1 eq '' then code1=_code1;
if code2 eq '' then code2=_code2;
if assets eq . then assets=_assets;
if rev eq . then rev=_rev;
if sales eq . then sales=_sales;
if emp eq . then emp=_emp;
end;
drop _:;
proc print; run;
Result;
First, CNAME is being carried forward for missing observations . Then data is being processed in BY-Groups (by CNAME) and missing observations populated using RETAIN statement. If there are more variables ARRAY would be a more efficient way in terms of coding.
@Satori wrote:
How can I do this for a dataset that already exists? Can I use this in a data step?
See updated answer.
If you have the data as text, like your posted example, then you can use a data step that checks the line and only reads the values when they exists and remembers (RETAINs) them.
Otherwise I showed how you can use the UPDATE statement to handle last observation carried forward. But that requires first calculating a grouping variable.
Otherwise you will need to make a series of NEW variables that you can RETAIN to remember the value from the earlier observations. Here is a pattern. Fill in the other variables fallowing the pattern.
data want ;
set have;
if not missing(cname) then do;
_cname=cname;
_id = id;
...
end;
else do;
cname=_cname;
id=_id;
....
end;
retain _cname _id ...;
drop _cname _id ...;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.