## Collapse one dataset based on values in another

Solved
Frequent Contributor
Posts: 101

# Collapse one dataset based on values in another

[ Edited ]

I have a feeling that there is a relatively straightforward way to solve this problem that I am just not thinking of. Naturally, my real datasets have thousands of rows and hundreds of columns, but here is a minimal example to show my problem:

Let's say I have a dataset structured as follows:

DATA features;
input unique_id \$ f1 f2 f3 f4 f5 f6 f7 f8 f9 f10;
datalines;
R1C1 0 0 0 0 0 0 0 0 1 1
R1C2 0 0 0 0 0 1 0 0 1 0
R2C4 0 0 1 1 0 0 0 0 0 1
R5C8 0 0 1 1 0 0 0 0 0 0
R7C2 0 0 1 0 1 0 0 0 0 0
R9C1 0 0 0 0 0 0 0 0 1 1
;
run;

Where unique_id is, well, an arbitrary identifier that uniquely distinguishes each observation. Each observation is characterized by a number of binary features. I have a different dataset that defines clusters of observations over which I want to collapse. Now, currently this other dataset is structured as follows:

DATA clusters;
input unique_id \$ m1 \$  m2 \$ ;
datalines;
R1C1 R1C2 R9C1
R1C2 R9C1 .
R2C4 R5C8 R7C2
R5C8 R7C2 .
;
run;

So there is some redundancy in the way the dataset is structured; you see that R1C1 has been "matched" with R1C2 and R9C1, and subsequently R1C2 has been "matched" with R9C1, and so on. Essentially, what this is telling me is that there is one cluster (call it Cluster1) composed of the unique IDs R1C1, R1C2, and R9C1, and another (call it Cluster2) composed of the unique IDs R2C4, R5C8, and R7C2. Note that one thing I CANNOT do is simply remove rows with missing values to remove the redunancy between rows 1 and 2 or 3 and 4, here; in my real dataset there will be rows with missing values that are unique clusters of observations.

What I want to do is "collapse" the features dataset based on the groupings in the clusters dataset, such that I get a dataset that looks like this:

DATA collapse;
input cluster \$ f1 f2 f3 f4 f5 f6 f7 f8 f9 f10;
datalines;
Cluster1 0 0 0 0 0 0.333 0 0 1 1
Cluster2 0 0 1 0.667 0.333 0 0 0 0 0.333
;
run;

So it is structured similarly to the features dataset, but collapsed so that each row is a unique cluster, and the values for the features correspond to a measure of the frequency with which that feature appeared in that cluster. For now, I am just using the mean within that cluster for that feature (e.g. so you see that feature 3 had a value of 1 for every unique_id in cluster 2, so the value for cluster 2 is 1; while feature 4 had a value of 1 for two of those unique_ids and a value of 0 for the other, so the value for clusters 2 is 0.667, etc.).

What's an efficient way to go about performing this sort of data manipulation? The only idea I could think of seems to me to be super inefficient (and I'm not 100% sure it will work): I could loop through each row of the clusters dataset and create a separate macro variable for each row that contains the values of the columns in that row, then do another loop that would eliminate macro variables that overlap and share the same values, then somehow use these macros in a series of PROC SQL or DATA step WHERE statements. But this seems like a lot of extra steps to get there.

The example I presented is pretty simple, but my real clusters dataset is a little more complex, in that the number of unique IDs is NOT the same for each cluster. That is, some clusters have as many as 100 unique ids associated with them while others have as few as 5. So there are a varying number of missing values for different rows, with the same redunancy issue I already mentioned.

Accepted Solutions
Solution
‎10-16-2015 10:34 AM
Posts: 5,543

## Re: Collapse one dataset based on values in another

[ Edited ]

The only problematic task here is finding the clusters. If you have SAS/OR, it can be done with proc optnet. Otherwise, you can resort to the SubGraphs macro available here. Here's how for your problem

DATA features;
input unique_id \$ f1 f2 f3 f4 f5 f6 f7 f8 f9 f10;
datalines;
R1C1 0 0 0 0 0 0 0 0 1 1
R1C2 0 0 0 0 0 1 0 0 1 0
R2C4 0 0 1 1 0 0 0 0 0 1
R5C8 0 0 1 1 0 0 0 0 0 0
R7C2 0 0 1 0 1 0 0 0 0 0
R9C1 0 0 0 0 0 0 0 0 1 1
;

proc transpose data=features out=featureList;
by unique_id;
var f:;
run;

DATA clusters;
input unique_id \$ m1 \$  m2 \$ ;
datalines;
R1C1 R1C2 R9C1
R1C2 R9C1 .
R2C4 R5C8 R7C2
R5C8 R7C2 .
;

data arcs;
set clusters;
from = unique_id; to = m1; output;
if not missing(m2) then do;
from = unique_id; to = m2; output;
end;
keep from to;
run;

concomp;
run;

/* Or, if you don't have accerss to SAS/OR */
/*
%include "&sasforum.\subgraphs.sas";

%SubGraphs(arcs,from=from,to=to,out=clust);
*/

proc sql;
create table collapseList as
select b.clust, a._NAME_ as var, mean(a.COL1) as value
from featureList as a inner join clust as b
on a.unique_id=b.node
group by b.clust, a._NAME_;
quit;

proc transpose data=collapseList out=collapse(drop = _: );
by clust;
var value;
id var;
run;
PG

All Replies
Super User
Posts: 5,888

## Re: Collapse one dataset based on values in another

If you are forced to use macro programming for matching and calculation purposes between two data sets, it's signal that your data structure is not query friendly.

I'm sure there's a reason for you to have a lot of binary variables (data mining type processing?).

But for data manipulation programming it's not very convenient.

First, create the cluster variable in your second data set. You could use this by a simple data step: clusterid = Cats('Cluster',_n_);

Second, I would transpose your binary data to:

unique_id fvar fval

R1C1 f1 0

R1C1 f2 0

..

R1C2 f1 0

..

Then, you could quite easily join the two data sets using SQL, and use the mean() aggregate function on a single column fval.

If you like, you simply transpose the result back to your "wide" analytic base table.

Data never sleeps
Frequent Contributor
Posts: 101

## Re: Collapse one dataset based on values in another

[ Edited ]

Thank you for your suggestion. But I am a bit confused as to how this would help.

First of all, this doesn't address the redundancy in the clusters dataset. This would give every row of that dataset a unique cluster ID, but as I explained, rows 1 and 2 aren't separate, they are defining the same cluster. Same with rows 3 and 4. That's an important part of my problem, and not one that can easily be ignored. I would need some way of eliminating this redundancy.

Second, how does the transposed version of the features dataset help? Assuming I understand you correctly, you want me to use SQL to create a dataset that looks like:

unique_id fvar fval cluster id

R1C1 f1 0 Cluster1

R1C1 f2 0 Cluster1

..

R1C2 f10 0 Cluster1

R2C4 f1 0 Cluster2

...

And then use the mean function on the column "fval". But with this long version of the data I would need nested loops in order to calculate the means I want, because I need to loop through clusters AND through fvars. Whereas with the wide version of the data I could simply calculate the means of all the fvars simultaneously with an array.

So you will have to explain to me exactly why you think this transposition actually helps? I only see it as making the calculations MORE difficult, and it still doesn't address the primary issue, which is the actual matching of the clusters itself, which you don't even address. Why is the data better in long form than in wide, here, for the purposes of matching the two datasets and calculating the means?

Super User
Posts: 6,785

## Re: Collapse one dataset based on values in another

One possibility might be to create a multi-level format based on the CLUSTERS data set.  Each UNIQUE_ID would translate into every one of the clusters that it belongs to.  Then apply that format when summarizing.

Without seeing all the analyses that you might want to conduct, however, I would opt for a different approach.  You'll need storage space to hold multiple observations for each UNIQUE_ID, one for each cluster that it belongs to.  This would be one path:

data all_combos;

set clusters;

array match {2} m1 m2;

cluster = "Cluster" || put(_n_, z3.);

output;

do _n_=1 to dim(match);

unique_id = match{_n_};

if unique_id not in ('.', ' ') then output;

end;

keep cluster unique_id;

run;

From that point, it's just a sort and merge:

proc sort data=all_combos;

by unique_id;

run;

proc sort data=features;

by unique_id;

run;

data want;

merge features all_combos;

by unique_id;

run;

It takes more space.  It uses some processing time to get there.  But it should be easy to use for any type of analysis going forward.

Good luck.

Frequent Contributor
Posts: 101

## Re: Collapse one dataset based on values in another

Thank you the suggestion! This does not really resolve the redundancy issue, though, without making some modifications. Running your code for "all_combos", I get the following data&colon;

unique_id  cluster

R1C1  Cluster001

R1C2  Cluster001

R9C1  Cluster001

R1C2  Cluster002

R9C1  Cluster002

R2C4  Cluster003

R5C8  Cluster003

R7C2  Cluster003

R5C8  Cluster004

R7C2  Cluster004

I suppose I could then search through this to remove duplicate rows. If I add "nodupkey" to your PROC SORT, I get:

R1C1  Cluster001

R1C2  Cluster001

R2C4  Cluster003

R5C8  Cluster003

R7C2  Cluster003

R9C1  Cluster001

Which is more or less what I want. The cluster numbers would then be nonsensical, but they are arbitrary anyway so it shouldn't matter. I will try this approach on a more complicated example to see if it works.

Super User
Posts: 6,785

## Re: Collapse one dataset based on values in another

Rather than adding NODUPKEY, consider what your data will look like after you merge.

The unique identifier for each record will be the combination of CLUSTER and UNIQUE_ID.

The variables m1 and m2 can be dropped.

Only the final merged data set will be needed for analysis.  Both of the original data sets can be saved for historical purposes, but are likely to never be used again.

You can run a PROC MEANS using CLASS CLUSTER.

The only issue you will need to take care with is counting the number of records.  The number of records will have increased to include one record for each CLUSTER/UNIQUE_ID combination.

Frequent Contributor
Posts: 101

## Re: Collapse one dataset based on values in another

I'm a little confused as to what you mean, to be honest.

Yes, each record will be uniquely identified by a combination of CLUSTER and UNIQUE_ID. But, as I said, this results in redundant observations. Which is exactly the problem.

So, if I run

PROC MEANS data=want;
class cluster;
run;

Using the dataset "want" as created by your code, I get the results from 4 clusters, where two of them (in this case Cluster002 and Cluster004) are utterly meaningless because they represent artifacts of the data structure.

The "only issue" is not the number of records, it is also the number of clusters.

Super User
Posts: 6,785

## Re: Collapse one dataset based on values in another

Sorry, my bad.  I appreciate the problem now, and will have to give it some more thought.  (I can't help thinking that this sort of problem has been addressed in the past year, as a set of parent-child relationships where you need to group the entire family.)

Frequent Contributor
Posts: 101

## Re: Collapse one dataset based on values in another

No worries. It's a deceptively hard problem to solve. Anyway, you're code definitely gets me CLOSER to where I need to be. Using the NODUPKEY on that PROC SORT seems to work pretty well, but I still need to test it in more complicated scenarios. Anyway, thanks for your help so far!

Solution
‎10-16-2015 10:34 AM
Posts: 5,543

## Re: Collapse one dataset based on values in another

[ Edited ]

The only problematic task here is finding the clusters. If you have SAS/OR, it can be done with proc optnet. Otherwise, you can resort to the SubGraphs macro available here. Here's how for your problem

DATA features;
input unique_id \$ f1 f2 f3 f4 f5 f6 f7 f8 f9 f10;
datalines;
R1C1 0 0 0 0 0 0 0 0 1 1
R1C2 0 0 0 0 0 1 0 0 1 0
R2C4 0 0 1 1 0 0 0 0 0 1
R5C8 0 0 1 1 0 0 0 0 0 0
R7C2 0 0 1 0 1 0 0 0 0 0
R9C1 0 0 0 0 0 0 0 0 1 1
;

proc transpose data=features out=featureList;
by unique_id;
var f:;
run;

DATA clusters;
input unique_id \$ m1 \$  m2 \$ ;
datalines;
R1C1 R1C2 R9C1
R1C2 R9C1 .
R2C4 R5C8 R7C2
R5C8 R7C2 .
;

data arcs;
set clusters;
from = unique_id; to = m1; output;
if not missing(m2) then do;
from = unique_id; to = m2; output;
end;
keep from to;
run;

concomp;
run;

/* Or, if you don't have accerss to SAS/OR */
/*
%include "&sasforum.\subgraphs.sas";

%SubGraphs(arcs,from=from,to=to,out=clust);
*/

proc sql;
create table collapseList as
select b.clust, a._NAME_ as var, mean(a.COL1) as value
from featureList as a inner join clust as b
on a.unique_id=b.node
group by b.clust, a._NAME_;
quit;

proc transpose data=collapseList out=collapse(drop = _: );
by clust;
var value;
id var;
run;
PG
Frequent Contributor
Posts: 101

## Re: Collapse one dataset based on values in another

[ Edited ]

I never even knew that PROC OPTNET even existed! Thanks for letting me know. Ahough trying it out on my real data revealed a new set of problems that I hadn't anticipated, I am accepting this as the answer because it did satisfy the question in my OP.

The problem that was revealed when I ran this code on my real data, after verifying it does wha tI need, is that it grouped my 1000+ clusters into only 2 connected components. After looking through my data, I realized that not every cluster is perfectly unique; there are some observations that end up being grouped into multiple clusters. There are not that many of them, but there are enough that it ends up forging connections between what should otherwise be completely independent clusters.

To illustrate this problem, here is the same code with a modified clusters dataset:

DATA clusters;
input unique_id \$ m1 \$  m2 \$ ;
datalines;
R1C1 R1C2 R9C1
R1C2 R9C1 .
R2C4 R5C8 R7C2
R5C8 R7C2 .
R9C1 R2C4 .
;

DATA arcs;
set clusters;
from = unique_id; to = m1; output;
if not missing(m2) then
do;
from = unique_id; to = m2; output;
end;
keep from to;
run;

concomp;
run;

From a theroetical point of view (these observations actually represent patterns of activity across an array of electrodes), there is no reason for each cluster to be perfectly unqiue, but grouping together clusters based on sharing a single observation isnt' warranted. That is, PROC OPTNET is actually working TOO well, because I hadn't foreseen this being an issue in my data!

Is there some option within PROC OPTNET that would allow some sort of thresholding when it determines these connections? For example, if a cluster contains 100 observations, and shares only one of those observations with another cluster that contains 50 other (unshared) observations, I would not want those to be grouped together.

In any case, I am going to play around with PROC OPTNET's various options to see what I can do. It may be that I need to do some changes further upstream in this analysis to flush out some of these spurious connections.

Posts: 5,543