BookmarkSubscribeRSS Feed
jimmychoi
Obsidian | Level 7

Hi,

 

I've got a dataset that contains the name of the company,  ipc code (a classification code of a patent), and the year it was obtained, which looks likes this (forgive me it's not written in SAS code):

pnyearipc_newassignee_new
1997A61DApple Inc
1997A61FApple Inc
1997A32DHUAWEI
1998C37FApple Inc
1998D24AHUAWEI

 

now, what I wanna do is two things:

 First, merging multiple records to a single record which were obtained in the same year, by the same company. (each IPC code separated with semicolon)

Second, I want IPC code to be accumulated per a company, which means the field IPC code should have all IPC codes obtained  in previous years.

thus, desired output would be like this:

pnyearipc_newassignee_new
1997A61D;A61FApple Inc
1997A32DHUAWEI
1998A61D;A61F;C37FApple Inc
1998A32D;D24AHUAWEI

 

dataset itself here, looks pretty simple, but in my realworld it has over 4million records and the size of CSV file is about 300 mb big.

Thanks always.

8 REPLIES 8
novinosrin
Tourmaline | Level 20

data have;
input pnyear	ipc_new	$ assignee_new & $ 20.;
cards;
1997	A61D	Apple Inc
1997	A61F	Apple Inc
1997	A32D	HUAWEI
1998	C37F	Apple Inc
1998	D24A	HUAWEI
;

proc sort data=have out=_have;
by assignee_new pnyear;
run;

data want;
set _have;
by assignee_new pnyear;
length want $50;
retain want;
if first.assignee_new then want=ipc_new;
else want=catx(';',want,ipc_new);
if last.pnyear;
run;
jimmychoi
Obsidian | Level 7

Thnx novinosrin, you're always outstanding. but as RW9 and andreas wrote in his reply, the length of the variable "want" would increase, especially when there are lots of records. and is there a way to concatenate only DISTINCT ipc codes?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Its as simple as a retain (though bear in mind a variable has a maximum length.  However, I would advise against doing such a thing.  First you will hit variable length maximums if you have lots of data.  Second, it is never a good idea to put multiple data items in one variable, it just makes coding a lot harder for no benefit.  Only concatenate data if its needed for a report, otherwise keep data separate.  Here is an example:

data want;
  set have;
  by pnyear assignee_new;
  length ipc_all $2000;
  retain ipc_all;
  if first.assignee_new then ipc_all=ipc_new;
  else ipc_all=catx(";",ipc_all,ipc_new);
  if last.assignee_new then output;
run;

That will concatenate all in the by groups.  If you need distinct, then use an index() in the if before concatenating.

 

Now the same thing, with items separate - which is far better to work with:

proc transpose data=have out=want;
  by pnyear assignee_new;
  var ipc_new;
run;

So much simpler.

jimmychoi
Obsidian | Level 7
Hi, RW9 thanks for advising, honestly I haven't thought about the length of concatenated data would hit the maximum. and I'm not gonna manage the entire data that way. Like you said, it is meant to be used in a report.

the part that you talked about using index for retrieving distinct observations, which I'm needing right now, I've googled index and it says "Searches a character expression for a string of characters, and returns the position of the string's first character for the first occurrence of the string" how does it have something to do with it? can you please elaborate more?
RW9
Diamond | Level 26 RW9
Diamond | Level 26

So how are you planning to "report" a 441*5 character string, bearing in mind that it wont fit on a page?  I would suggest working how the output will work first, then looking at the data as it may not be sensible to report like that.

andreas_lds
Jade | Level 19

@jimmychoi wrote:
dataset itself here, looks pretty simple, but in my realworld it has over 4million records and the size of CSV file is about 300 mb big.

Thanks always.


 

Do you know how many ipc_new values are in the data for each year and assignee? You will need to know this to set the length of the combined ipc_new-values variable.

 

Try

proc sql;
   select max(num_elements) as max_elements from (
      select count(ipc_new) as num_elements
         from have
            group by pnyear, assignee_new
      )
   ;
quit;

Multiply the values displayed by 5 if ipc_new is always four chars long (+1 for the delimiter).

jimmychoi
Obsidian | Level 7
it gives me 441 for max_elements, does that mean the length of the variable must be, at least 5 times bigger? (ipc codes are always 4 chars long) then I should assign 2500 for the length of the variable. Thanks!
andreas_lds
Jade | Level 19
Exactly, with this input-file you need 2500 chars,next time maybe more or less.
Writing from.my mobile, so the following code will look ugly, is untested:



proc sql noprint;
select max(num_elements) as max_elements
Into :max_elements trimmed
from (
select count(ipc_new) as num_elements
from have
group by pnyear, assignee_new
)
;
quit;

Then in the data-step:
Length var $ %eval(&max_elements * 5);

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 725 views
  • 1 like
  • 4 in conversation