Re: Adding same value to variable within groups of rows

Helle · Posted 11-26-2010 08:28 AM

Hi,

I am importing a txt file into SAS and looking only for rows that contain the four types of text strings below ("date", "ini", "type", "case"). Four rows at a time are related to the same case number. The challenge is that the order of type and case is not always the same in the text file and so I would like to sort by case number to obtain the same order in all by-groups (and use proc transpose afterwards). I have extracted the case number into a variable of its own so that the data set has two variables: text and caseno.

text caseno
date: xxx .
ini: hli .
type: l .
case: 123 123
date: yyy .
ini: lbj .
case: 587 587
type: l .
date: zzz .
ini: jcg .
type: t .
case: 789 789

How can I replace the missing values in caseno with the correct case number? I have experimented with do-loops and the retain statement but somehow cannot figure out how to get it right.

Thanks,
Helle

sbb · Posted 11-26-2010 09:47 AM

Share what code you have tried that is not working for a most effective response.

While inputting your external data, yes, a RETAIN (after performing the proper INPUT logic) will capture / retain the "caseno" value through the input process for related follow-on data-records.

But if you must wait until you perform an INPUT for the "next caseno", then you will need to perform an INPUT and also consider testing _INFILE_ before assigning "caseno", so that you can perform an OUTPUT while still having the prior assigned "caseno" variable in your SAS PDV.

Encourage the OP to share what SAS code has been attempted, and do so within a SAS-generated log where you COPY/PASTE into a forum reply post here.

And, make use of this SAS statement in your DATA step for self-diagnosis to verify that whatever INPUT code you are using is working properly (where using "nn" allows you to insert multiple stmts in your DATA step code for uniqueness):

PUTLOG '>DIAG-nn' / _ALL_;

Scott Barry
SBBWorks, Inc.

SPR · Posted 11-26-2010 01:15 PM

Hello Helle,

I do not know if it is an optimal solution but it does the trick:
[pre]
data a;
infile datalines delimiter=':.';
input text $ caseno $;
datalines;
date: xxx .
ini: hli .
type: l .
case: 123 123
date: yyy .
ini: lbj .
case: 587 587
type: l .
date: zzz .
ini: jcg .
type: t .
case: 789 789
;
run;
/* Total number of observations */;
proc SQL;
select COUNT(*) into :n from a;
%let n=%trim(&n);
;quit;
%macro a;
%local i i1;
/* Saves obs by groups of 4 into separate datasets */;
%do i=4 %to &n %by 4;
%let i1=%EVAL(&i-3);
data _a&i;
set a;
if &i1 <= _n_ <= &i;
run;
/* Transposes */;
proc sort data=_a&i;
by text;
run;
proc transpose data=_a&i out=_a&i;
var caseno;
id text;
run;
/* Combines transposed into a result */;
%if &i=4 %then %do; data r; set _a&i; run; %end;
%else %do; data r; set r _a&i; run; %end;
%end;
%mend a;
%a;
[/pre]
Sincerely,
SPR

Cynthia_sas · Posted 11-27-2010 02:08 PM

Hi:
In addition to SPR's approach, there is an alternate approach using the RETAIN and the single trailing @ to hold the input line. This method assumes that there are ALWAYS, ALWAYS 4 rows of data for each case. It does not matter what order the data lines are in -- CASE could be first, INI could be first, DATE could be first. But, for this program to work, there must be 4 data lines for each case.

Basically, the program works because the first 5 characters on each data line determine how the rest of the data line should be read -- whether the line contains a DATE or a TYPE or an INI, etc, etc.

The whole use of the single trailing @ method of input, allows you to read some of a data line and then "hold" the line while the program decides how to read that line.

The program below uses the MOD function to create a variable that can be tested in order to conditionally do the output. In the test data, there are 12 datalines, representing 3 cases. If you divide the counter by 4, then with the MOD function, the remainder will be fuzzed to 0 on every 4th dataline. So the final CASEINFO dataset has only 3 observations. IF you want an observation for every case, that is a different program. IF there could EVER be only 3 datalines or 2 datalines for a CASE, that will present a problem for this program. If that is a data possibility, then this program will not work and you would have to code other logic to read the file.

The program is shown below this post. Also shown is the intermediate SHOWLOGIC output -- that shows how the data values are being retained on every INPUT statement execution and the final CASEINFO file -- which is the one that I think you want to keep and use. On some of your datalines, you showed a '.' -- like a missing value -- I did not bother to read those items, but you can easily change the program to read those if you need to. Also I changed the data to have a real date value instead of xxx, yyy and zzz for date -- since those are character strings and not actually a date.

cynthia
[pre]
*** The program and test data;
data caseinfo
showlogic;
length first $5.;
** retain the variables as they are read;
retain date inival type casechar casenum origord dtext itext ttext ctext;
** do not think you need dtext, itext, ttext, ctext;

infile datalines dlm=' ';

** read the input line and hold it in order to test what type;
** of data line is being read. Use the first 5 characters on the line;
** to test how to read the line.;
input @1 first $ @;

** Capture the original order into a variable;
origord = _n_;

** divide the original order by 4 when NUM=0, we have read the 4th;
** consecutive data line -- this logic assumes that the text file ;
** ALWAYS has 4 datalines for every case.;
num = mod(origord,4);

** upcase FIRST to control the comparison and not worry about whether;
** it is mixed case or lower case.;
first = upcase(first);

if first =: 'DATE:' then do;
** read DATE;
input @1 dtext $5. date : mmddyy10. ;
end;
else if first =: 'INI: ' then do;
** read INIVAL;
input @1 itext $5. inival $ ;
end;
else if first =: 'TYPE:' then do;
** read TYPE:;
input @1 ttext $5. type $ ;
end;
else if first =: 'CASE:' then do;
** read CASE as character val and numeric val;
** not sure why data line had 2 values for case number;
input @1 ctext $5. casechar $ casenum;
end;
output showlogic;
if num = 0 then do;
** if num=0, then we are on the 4th data row, so now;
** do an output of all the retained values.;
** this will result in 1 output for every 4 datalines;
output caseinfo;

** reset the retained variables;
date =.;
casenum =.;
casechar = ' ';
type=' ';
inival = ' ';
dtext = ' ';
itext = ' ';
ttext = ' ';
ctext = ' ';
end;
return;
datalines;
date: 11/27/2010 .
ini: hli .
type: l .
case: 123 123
date: 11/28/2010 .
ini: lbj .
case: 587 587
type: l .
date: 11/29/2010 .
ini: jcg .
type: t .
case: 789 789
;
run;

ods listing;
options nodate nonumber nocenter ls=120;
proc print data=caseinfo;
title 'Final Case file';
var origord num casechar casenum date inival type dtext itext ttext ctext;
format date date9.;
run;

proc print data=showlogic;
title 'This file shows logic of using NUM and RETAIN';
title2 'Note how observation is being built by each INPUT stmt';
format date date9.;
var origord num first dtext date itext inival ttext type ctext casechar casenum;
run;

title;
[/pre]

The SHOWLOGIC file of 12 observations:
[pre]
This file shows logic of using NUM and RETAIN
Note how observation is being built by each INPUT stmt

Obs origord num first dtext date itext inival ttext type ctext casechar casenum

1 1 1 DATE: date: 27NOV2010 .
2 2 2 INI: date: 27NOV2010 ini: hli .
3 3 3 TYPE: date: 27NOV2010 ini: hli type: l .
4 4 0 CASE: date: 27NOV2010 ini: hli type: l case: 123 123
5 5 1 DATE: date: 28NOV2010 .
6 6 2 INI: date: 28NOV2010 ini: lbj .
7 7 3 CASE: date: 28NOV2010 ini: lbj case: 587 587
8 8 0 TYPE: date: 28NOV2010 ini: lbj type: l case: 587 587
9 9 1 DATE: date: 29NOV2010 .
10 10 2 INI: date: 29NOV2010 ini: jcg .
11 11 3 TYPE: date: 29NOV2010 ini: jcg type: t .
12 12 0 CASE: date: 29NOV2010 ini: jcg type: t case: 789 789
[/pre]

The final CASEINFO file of 3 observations:
[pre]
Final Case file

Obs origord num casechar casenum date inival type dtext itext ttext ctext

1 4 0 123 123 27NOV2010 hli l date: ini: type: case:
2 8 0 587 587 28NOV2010 lbj l date: ini: type: case:
3 12 0 789 789 29NOV2010 jcg t date: ini: type: case:

[/pre]

Ksharp · Posted 11-29-2010 04:17 AM

What your txt file looks like exactly?
And what you want dataset to look like?
Your data is ambiguous.

Helle · Posted 11-29-2010 08:28 AM

Hi,

Thanks for your input. SPR: I understand the logic of your suggestion but unfortunately processing my more than 33,000 observations in that way made SAS come to a complete standstill. Cynthia: I modified your suggestion and made it work on the dataset that I had already created containing only the four observations per case that I needed. The original txt file had different numbers of rows per case so I had to specifically select strings matching the four categories that I was looking for.

Here is the code that I ended up with:
data showlogic
caseinfo (keep=dato inival type casechar);
retain dato inival type casechar origord dtext itext ttext ctext;
set cases;
origord = _n_;
num = mod(origord,4);
if text1 =: 'DATE:' then do;
dtext=text1;
dato=text2;
end;
else if text1 =: 'INIT:' then do;
itext=text1;
inival=text2 ;
end;
else if text1 =: 'LOAN:' then do;
ttext=text1;
type=text2;
end;
else if text1 =: 'CASE:' then do;
ctext=text1;
casechar=text2;
end;
output showlogic;
if num = 0 then do;
output caseinfo;

** reset the retained variables;
dato =.;
casechar = ' ';
type=' ';
inival = ' ';
dtext = ' ';
itext = ' ';
ttext = ' ';
ctext = ' ';
end;
return;
run;

Thanks again,

Helle

SPR · Posted 11-29-2010 09:38 AM

Hello Helle,

I do not know you system parameters of course, but it seems to me you underestimate SAS performance. From my experience 33000 rows is nothing to SAS.

Sincerely,
SPR

chang_y_chung_hotmail_com · Posted 12-07-2010 03:28 PM

...
> I am importing a txt file into SAS and looking only
> for rows that contain the four types of text strings
> below ("date", "ini", "type", "case"). Four rows at a
> time are related to the same case number. The
> challenge is that the order of type and case is not
> always the same in the text file and so I would like
> to sort by case number to obtain the same order in
> all by-groups (and use proc transpose afterwards).
...
If I understand correctly, then you want to read from a text file with only the rows that have all of these four keyword: value pairs. Only the problem is that the orders of type and case change. No problem. Use absolute column pointer controls. Hope this helps a bit.
[pre]
data cases;
length date ini type $8;
input @1 @("date:") date $
@1 @("ini:") ini $
@1 @("case:") case
@1 @("type:") type $;
/* all four should not be missing */
if not missing(date) & not missing(ini) &
not missing(case) & not missing(type) then
output;
cards;
date:xxx ini:hli type:l case:123
date:yyy ini:lbj case:587 type:l
date:zzz ini:jcg type:t case:789
;
run;

/* check */
proc print data=cases;
run;
/* on log
Obs date ini type case
1 xxx hli l 123
2 yyy lbj l 587
3 zzz jcg t 789
*/
[/pre]

Classroom Training Available!