Solved: Re: Merge of two datasets many many records

michtka · Posted 05-11-2013 01:25 PM

Dear all,

anyone can help with this, the idea is create a variable label, with the SOC variable nested to the PTERM variable, and sorting outSOC and PTERM by descending frequency:, the idea is this:

The idea is a variable LABEL taking account:

SOC (Family) (High frequency) GASTRO 41

pterm (subfamily) (High frequency) constipation 16

nausea 14

abdominal disconfort 12

METABOLISM

.

.pterm(low frequency)

SOC (Low Frequency)

pterm (subfamily) (High frequency)

I think need to be something like merge many many, i try with join FULL but i ma missing something here.

Thanks in advance.

.

.pterm(low frequency)

data have1;

length soc pterm $20. npterm nsoc 8.;

input soc $ pterm $ npterm nsoc;

cards;

cardiacdisorder acute 6 21

cardiacdisorder angina 7 21

cardiacdisorder coronary 8 21

gastro constipation 16 16

investigations headache 3 9

investigations backpain 4 9

investigations leftpain 2 9

eye visionblurred 1 1

;

run;

proc sort data=have1;by descending nsoc descending npterm;run;

data want1;

set have1;

by descending nsoc descending npterm;

if first.nsoc then do;

label=upcase(soc);

output;

end;

if first.npterm then do;

label=pterm;

output;

end;

run;

data have2;

length soc pterm $20. npterm2 nsoc2 8.;

input soc $ pterm $ npterm2 nsoc2;

cards;

gastro abdominalpain 15 41

gastro abdominaldisconfort 12 41

gastro nausea 14 41

eye eyepain 5 9

eye ocularpathy 4 9

Metabolism dehydratation 20 39

metabolism decreasedapetite 19 39

cardiacdisorder miocardio 6 8

investigations neckpain 1 1

run;

proc sort data=have2;by descending nsoc2 descending npterm2;run;

data want2;

set have2;

by descending nsoc2 descending npterm2;

if first.nsoc2 then do;

label=upcase(soc);

output;

end;

if first.npterm2 then do;

label=pterm;

output;

end;

run;

proc sql noprint;

create table wanted as

select a.*,b.nsoc2,b.npterm2

from want1 as a full join want2 as b

on a.label=b.label

order by nsoc;

quit;

Tom · Posted 05-12-2013 09:16 AM

I think you might still have issues with combining this data. What do you want to do with preferred terms that are in both source files?

Let's assume that you never have that case, or if you do that you want to keep both records.

So combine the files FIRST, then find the maximum NSOC so that you can re-order it and apply your rules for generating the "label" variable.

Something like this:

proc sql noprint ;

create table report1 as

select

max(nsoc) as nsoc

,npterm

,soc

,pterm

from (select * from have1 union select * from have2)

group by soc

order by 1 desc, 2 desc

;

quit;

data want;

set report1 ;

by descending nsoc ;

if first.nsoc then do;

label=soc; count=nsoc; output;

end;

label=pterm; count=npterm; output;

run;

proc print;

var label count;

run;

View solution in original post

Tom · Posted 05-11-2013 03:22 PM

Not sure I understand the question. But why are you processing by the COUNTS?

data want1;
set have1;
by descending nsoc descending npterm;
if first.nsoc then do;
label=upcase(soc);
output;
end;
if first.npterm then do;
label=pterm;
output;
end;
run;

Do you really want to combine terms just because they had the same number of occurrences?

michtka · Posted 05-11-2013 06:45 PM

Hi Tom, because I need to, Alphabetically is the easy one, but I need to do it in the descending frequency in SOC and PTERM...

I need the sorting out in descending frequency for SOC and PTERM, to finally after merging the two datasets (have1, and have2), to get something like that:

GASTRO 41

constipation 16

abdominalpain 15 *from dataset have2

nausea 14 *from dataset have2

abdominaldisconfort 12 *from dataset have2

METABOLISM 39 *from dataset have 2

dehydratation 20 *from dataset have2

decreasedapetite 19 *from dataset have2

CARDIACDISORDER 21

coronary 8

angina 7

acute 6

miocardio 6 * from dataset have2

Cheers, and thank you

Tom · Posted 05-12-2013 09:16 AM

I think you might still have issues with combining this data. What do you want to do with preferred terms that are in both source files?

Let's assume that you never have that case, or if you do that you want to keep both records.

So combine the files FIRST, then find the maximum NSOC so that you can re-order it and apply your rules for generating the "label" variable.

Something like this:

proc sql noprint ;

create table report1 as

select

max(nsoc) as nsoc

,npterm

,soc

,pterm

from (select * from have1 union select * from have2)

group by soc

order by 1 desc, 2 desc

;

quit;

data want;

set report1 ;

by descending nsoc ;

if first.nsoc then do;

label=soc; count=nsoc; output;

end;

label=pterm; count=npterm; output;

run;

proc print;

var label count;

run;

michtka · Posted 05-12-2013 04:24 PM

Thank you for this Tom, but...like you say in the email, What do you want to do with preferred terms that are in both source files?

This is the problem I really I am interested:

If we twist a bit have1 (soc,pterm,npterm,nsoc), and have2 (soc,pterm,npterm2,nsoc2), where we have GASTRO(constipation) and INVESTIGATIONS(lefpain) in the two datasets, i am interesting to a final dataset with (label,nsoc,npterm,nsoc2,npterm2).

Culd you help me with this?

data have1;

length soc pterm $20. npterm nsoc 8.;

input soc $ pterm $ npterm nsoc;

cards;

cardiacdisorder acute 6 21

cardiacdisorder angina 7 21

cardiacdisorder coronary 8 21

gastro constipation 16 16

investigations headache 3 9

investigations backpain 4 9

investigations leftpain 2 9

eye visionblurred 1 6

eye nausea 5 6

;

run;

data have2;

length soc pterm $20. npterm2 nsoc2 8.;

input soc $ pterm $ npterm2 nsoc2;

cards;

gastro abdominalpain 15 41

gastro abdominaldisconfort 12 41

gastro nausea 14 41

gastro constipation 18 18

eye eyepain 5 9

eye ocularpathy 4 9

metabolism dehydratation 20 39

metabolism decreasedapetite 19 39

cardiacdisorder miocardio 6 6

investigations neckpain 1 4

investigations leftpain 3 4

;

run;

Finally, this is the kind of final dataset I want to get:

I want to keep label, np1 and np2:

label np1 np2

GASTRO 16 59

constipations 1 6 18

abdominalpain 15

nausea 14

abdominaldisconfort 12

INVESTIGATIONS 9 4

backpain 4

headache 3

leftpain 2 3

neckpain 1

Thank you.

michtka · Posted 05-12-2013 04:25 PM

sorry this is wrong: I am interesting to a final dataset with (label,nsoc,npterm,nsoc2,npterm2)...I am interesting in the above dataset, label, np1 and np2

michtka · Posted 05-12-2013 06:40 PM

sorry Tom...it was an easy twist...Thank you for all your help.

proc sql noprint ;

create table report1 as

select

max(nsoc) as nsoc

,max(nsoc2) as nsoc2

,npterm

,npterm2

,soc

,pterm

from temp

group by soc,pterm

order by 1 desc, 2 desc

;

quit;

data want (keep=label count1 count2);

set report1 ;

by descending nsoc ;

if first.nsoc then do;

label=soc; count1=nsoc;count2=nsoc2; output;

end;

label=pterm; count1=npterm;count2=npterm2; output;

run;

proc print data=want;

var label count1 count2;

run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away