Re: Reading in multiple row input data in single observation/row

amits · Posted 01-19-2011 12:45 AM

I've a raw data in multiple rows which I want to read in single row. Luckily we have an identifier related to each record. Here is the same data:
ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services
PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000
PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800
PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210
ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services
PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000
PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019

I want to have all PARTY details to be in same row as ACTION row(Second column is an identifier)

Any suggestions

Ksharp · Posted 01-19-2011 01:38 AM

Hi.
What do you want your output to look like. not clarify.

[pre]
data temp;
input part1 $ part2 part3 & $2000.;
datalines;
ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services
PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000
PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800
PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210
ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services
PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000
PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019
run;

data result(where =(_part1 is not missing));
set temp end=last;
length whole $ 2000;
retain whole _part1 _part2;
if part2 ne lag(part2) then do; output; call missing(whole); end;
if upcase(part1) eq 'ACTION' then do; _part1=part1; _part2=part2; end;
whole=catx(' ',whole,part3);
if last then output;
keep _part1 _part2 whole;
run;

[/pre]

Ksharp

amits · Posted 01-19-2011 05:38 PM

Hi KSharp,

I want my output to be like this:

Row1 - ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000 PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800 PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210

Row2 - ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000 PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019

Ksharp · Posted 01-20-2011 03:40 AM

OK.
That would be more easy.

[pre]
data temp;
infile datalines length=len ;
input whole $varying500. len;
datalines;
ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services
PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000
PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800
PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210
ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services
PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000
PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019
;
run;
data op;
set temp end=last;
length all $ 2000;
retain pre_id all;
if scan(whole,2,' ') ne pre_id and _n_ ge 2 then do;
output;
call missing(all);
end;
all=catx(' ',all,whole); pre_id = scan(whole,2,' ');
if last then output;
keep all;
run;
[/pre]

Ksharp

Cynthia_sas · Posted 01-20-2011 11:53 AM

Hi:
I understand what you're doing, but creating ID as a separate variable and then using FIRST.ID and LAST.ID seems to me to simplify the program. There's only 1 OUTPUT statement, no need for checking for LAST= variable. You can always drop the ID variable if it is not wanted, but I can see a place where being able to select rows by ID variable might come in handy.

cynthia
** alternate approach using FIRST. and LAST. processing;
[pre]
** CREATE ID and ORIGORD variables for sorting and subsequent processing;
data temp;
infile datalines length=len ;
input whole $varying500. len;
id = input(scan(whole,2,' '),best8.);
origord = _n_;
return;
datalines4;
ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services
PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000
PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800
PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210
ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services
PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000
PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019
;;;;
run;

proc sort data=temp;
by id origord;
run;

data op;
set temp; by id;
length all $ 2000;
retain all;
if first.id then do;
** set/reset all to blank;
all = ' ';
end;

** always concatenate the new value of WHOLE to ALL;
all = catx(' ',all,whole);

** for every LAST.ID, do an explicit OUTPUT;
if last.id then do;
output;
end;

** keep ID and ALL -- could only keep ALL if you wanted;
keep id all;
run;

ods listing close;

ods html file='c:\temp\concat_by_id.html' style=sasweb;
proc print data=op;
var id all;
run;
ods html close;
[/pre]

Ksharp · Posted 01-20-2011 11:31 PM

Hi.Cynthia.
I know your code is more simpler, but maybe not more efficient.
Think about it, when you have a very large dataset such as 200G, your code need to sort the dataset firstly,which will waste lots lots of time.
My code is to scan dataset once,as long as the data look just like that.

Regards
Ksharp

Cynthia_sas · Posted 01-21-2011 10:56 AM

In my experience, the qualification you made "as long as the data look just like that" is a rare condition when receiving raw data from people.

If the data are guaranteed to be already in order by ID, then there is no need to sort -- you could remove the PROC SORT step from my code and just add the NOTSORTED option to the DATA step BY statement. That would allow you to use FIRST.ID and LAST.ID without needing to SORT. But, depending on other people to send you data that is ALWAYS in order is not a best practice, in my mind, for an on-going, production type system. So generally, I put a sort into my code because I consider it to be a best practice, because then I am not relying on anyone else. After a sort, I know that the data is in order. And, unless we were talking about millions of records on a somehow constrained system (which we don't know), the sort isn't, by itself, going to cause the program to be less efficient.

The OP did not say whether this was a one-time conversion of the multi-line input data to a SAS dataset or whether this was an on-going (daily, weekly, monthly) add of records from one system to another. It is premature, in my opinion, to talk about efficiency until you know the whole picture of the process and all the parameters involved in the requirements, including the sizes of machines involved, the sizes of the various files and the level of data integrity that is required for the tracking of the SAS dataset.

For example, I used to work for lawyers and I had to be ready to go to court and explain my code to non-technical people, if they had questions about how the data files from the "other side" were treated in order for our side's analysis to be done. In that case, efficiency was not the primary requirement. We frequently would run 1 datastep program for each change we wanted to make, with a PROC PRINT before and a PROC PRINT after so we could point to the 1 thing and only the 1 thing that was changed. It was highly inefficient programming, but also highly understandable by lawyers, judges and a jury.

An alternative solution is only that -- an alternative. It is up to the OP to test/benchmark and figure out whether there's even a big enough efficiency gain to offset using simpler code, since programmer maintenance time is also one factor in any "efficiency" equation.

cynthia

Ksharp · Posted 01-23-2011 09:24 PM

Hi.
Cynthia.
There might be another problem.After sorted data, the order of observations would be changed.so then after concatenate rows ,the result maybe is not OP wanted.It is the reason why I did not use proc sort in my code.
In the end,about what you said, I admitted it will avoid to make some mistake that processing data before sort the origin data.

Regards
Ksharp

Cynthia_sas · Posted 01-24-2011 12:34 AM

Yes, that's why I created a variable called ORIGORD as my program read in the data:
[pre]
origord = _n_;
[/pre]

Then I can always ensure that the order of the concatenated rows will conform to the original order of the datalines, within each ID.

cynthia

Patrick · Posted 01-19-2011 01:46 AM

Hi

From a point of data organisation I believe it would be better to store the data in at least 2 tables - one for Parties, one for Actions (possibly even in a 3rd intersection table PartyActions).

Organising the data this way will later on make it quite easy to create whatever datamart you need.

Have a look at the example below. Using Proc Transpose even creates in the end the dataset you're after.

data actions (keep=id ActionString)
parties (keep=id PartyString)
;

infile datalines truncover dlm=' ';
input type:$8. id:8. @;

if type='ACTION' then
do;
input ActionString $200.;
output actions;
end;
else
if type='PARTY' then
do;
input PartyString $200.;
output parties;
end;

datalines4;
ACTION 1 DS PE 10043 2010 0 12/01/2011 Sale of Goods & Services
PARTY 1 Claim Corporation VISITOR GUIDE PTY LTD ~ N/A ~ VISITOR GUIDE PTY LTD C/- CHARLES & CO. "UPSTAIRS, SUITE 2, 285 LORD STREET" AUCKLAND 60000
PARTY 1 Defendant Corporation LOCAL PTY LTD ~ N/A ~ 1/28 BAY STREET HEADS 24800
PARTY 1 Defendant Individual STEBBINGS FINCH LEE 30 BURKE STREET LOWER 45210
ACTION 2 DS PE 10303 2010 0 12/01/2011 Sale of Goods & Services
PARTY 2 Claimant Corporation ROHA PTY LTD ~ N/A ~ 196-234 ALBANY RD NEW YORK 32000
PARTY 2 Defendant Individual ARCHER DAVID 7/169 WEST COAST HIGHWAY LONDON 60019
;;;;
run;

proc sql;
create table ActionsAndParties as
select a.id, a.ActionString, p.PartyString
from actions a ,parties p
where a.id=p.id
order by id
;
quit;

proc transpose data=ActionsAndParties out=ActionsAndParties_OneRow(drop=_name_) prefix=PartyString;
by id ActionString;
var PartyString;
run;

proc print data=ActionsAndParties_OneRow;
run;

HTH
Patrick

amits · Posted 01-20-2011 12:23 AM

Thanks Patrick...it solves my purpose

dhana · Posted 01-24-2011 04:40 AM

Hi,

One more solution

data result(keep=newvar);
set temp;
by part2;
length newvar $ 2000;
retain newvar;
if first.part2 then do;
call missing(newvar);
end;
put part3;
newvar=catx(' ',newvar,part3);
if last.part2 then output;
run;

Thanks

Dhanasekaran R
Allianz Cornhill Information Services
Trivandrum, India

Registration is open

SAS Training: Just a Click Away