Hi,
I've got a dataset that contains the name of the company, ipc code (a classification code of a patent), and the year it was obtained, which looks likes this (forgive me it's not written in SAS code):
pnyear | ipc_new | assignee_new |
1997 | A61D | Apple Inc |
1997 | A61F | Apple Inc |
1997 | A32D | HUAWEI |
1998 | C37F | Apple Inc |
1998 | D24A | HUAWEI |
now, what I wanna do is two things:
First, merging multiple records to a single record which were obtained in the same year, by the same company. (each IPC code separated with semicolon)
Second, I want IPC code to be accumulated per a company, which means the field IPC code should have all IPC codes obtained in previous years.
thus, desired output would be like this:
pnyear | ipc_new | assignee_new |
1997 | A61D;A61F | Apple Inc |
1997 | A32D | HUAWEI |
1998 | A61D;A61F;C37F | Apple Inc |
1998 | A32D;D24A | HUAWEI |
dataset itself here, looks pretty simple, but in my realworld it has over 4million records and the size of CSV file is about 300 mb big.
Thanks always.
data have;
input pnyear ipc_new $ assignee_new & $ 20.;
cards;
1997 A61D Apple Inc
1997 A61F Apple Inc
1997 A32D HUAWEI
1998 C37F Apple Inc
1998 D24A HUAWEI
;
proc sort data=have out=_have;
by assignee_new pnyear;
run;
data want;
set _have;
by assignee_new pnyear;
length want $50;
retain want;
if first.assignee_new then want=ipc_new;
else want=catx(';',want,ipc_new);
if last.pnyear;
run;
Thnx novinosrin, you're always outstanding. but as RW9 and andreas wrote in his reply, the length of the variable "want" would increase, especially when there are lots of records. and is there a way to concatenate only DISTINCT ipc codes?
Its as simple as a retain (though bear in mind a variable has a maximum length. However, I would advise against doing such a thing. First you will hit variable length maximums if you have lots of data. Second, it is never a good idea to put multiple data items in one variable, it just makes coding a lot harder for no benefit. Only concatenate data if its needed for a report, otherwise keep data separate. Here is an example:
data want; set have; by pnyear assignee_new; length ipc_all $2000; retain ipc_all; if first.assignee_new then ipc_all=ipc_new; else ipc_all=catx(";",ipc_all,ipc_new); if last.assignee_new then output; run;
That will concatenate all in the by groups. If you need distinct, then use an index() in the if before concatenating.
Now the same thing, with items separate - which is far better to work with:
proc transpose data=have out=want; by pnyear assignee_new; var ipc_new; run;
So much simpler.
So how are you planning to "report" a 441*5 character string, bearing in mind that it wont fit on a page? I would suggest working how the output will work first, then looking at the data as it may not be sensible to report like that.
@jimmychoi wrote:
dataset itself here, looks pretty simple, but in my realworld it has over 4million records and the size of CSV file is about 300 mb big.Thanks always.
Do you know how many ipc_new values are in the data for each year and assignee? You will need to know this to set the length of the combined ipc_new-values variable.
Try
proc sql;
select max(num_elements) as max_elements from (
select count(ipc_new) as num_elements
from have
group by pnyear, assignee_new
)
;
quit;
Multiply the values displayed by 5 if ipc_new is always four chars long (+1 for the delimiter).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.