BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:

 

Variable

0_data1

1_data1

0_data2

1_data2

etc

ang1

45

3

45

2

 

ang2

43

23

42

23

 

ang3

44

43

44

43

 

ang4

23

343

23

321

 

etc

 

 

 

 

 

 

All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.

 

To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name. 

Such that:

 

Original , there is no ID/index variable

ang1

ang2

ang3

ang4

Etc.

0

1

0

0

0

1

1

1

1

1

0

1

0

1

0

0

0

1

1

1

etc

 

 

 

 

 

To

Variable

0_data1

1_data1

ang1

45

3

ang2

43

23

ang3

44

43

ang4

23

343

etc

 

 

 

How to arrive at the above output for data1?

Appreciate for any inputs. 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

If it's 0/1 data, and you're only comparing counts that could be equivalent of comparing the mean and N of those variables, or sum. 

So you can run each data set through proc means summarizing the Mean and N and then compare them. If you want the 0/1 values then it's bit more work but doable. 

ods select none;
proc means data=have1 N SUM MEAN;
var _numeric_;
ods output want = _summary1;
run;
ods select all;

Run something like that for each data set and then combine the results as follows:

 

data want;
length ds dsname $50.;
set _summary1 _summary2 indsname=ds;
dsname = ds;
run;

Then you can use proc transpose if you really want the long format. Otherwise you transpose and then merge so really it's the same operations, different order. 

 

If your data is character this will not work. Instead you'll need to use PROC FREQ to replace the PROC MEANS. 

 

Rough idea is here https://gist.github.com/statgeek/e0903d269d4a71316a4e

 

Refer to all variables using _all_, _numeric_ or _character_.

 

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@lydiawawa wrote:

I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:

 

Variable

0_data1

1_data1

0_data2

1_data2

etc

ang1

45

3

45

2

 

ang2

43

23

42

23

 

ang3

44

43

44

43

 

ang4

23

343

23

321

 

etc

 

 

 

 

 

 

All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.

 

To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name. 

Such that:

 

Original , there is no ID/index variable

ang1

ang2

ang3

ang4

Etc.

0

1

0

0

0

1

1

1

1

1

0

1

0

1

0

0

0

1

1

1

etc

 

 

 

 

 

To

Variable

0_data1

1_data1

ang1

45

3

ang2

43

23

ang3

44

43

ang4

23

343

etc

 

 

 

How to arrive at the above output for data1?

Appreciate for any inputs. 


 

View solution in original post

2 REPLIES 2
Reeza
Super User

If it's 0/1 data, and you're only comparing counts that could be equivalent of comparing the mean and N of those variables, or sum. 

So you can run each data set through proc means summarizing the Mean and N and then compare them. If you want the 0/1 values then it's bit more work but doable. 

ods select none;
proc means data=have1 N SUM MEAN;
var _numeric_;
ods output want = _summary1;
run;
ods select all;

Run something like that for each data set and then combine the results as follows:

 

data want;
length ds dsname $50.;
set _summary1 _summary2 indsname=ds;
dsname = ds;
run;

Then you can use proc transpose if you really want the long format. Otherwise you transpose and then merge so really it's the same operations, different order. 

 

If your data is character this will not work. Instead you'll need to use PROC FREQ to replace the PROC MEANS. 

 

Rough idea is here https://gist.github.com/statgeek/e0903d269d4a71316a4e

 

Refer to all variables using _all_, _numeric_ or _character_.

 

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

 


@lydiawawa wrote:

I have two very large datasets with the exact same variables name and row counts, and I'm doing comparisons between the variables. I'm hoping to generate frequency counts in output format such as below:

 

Variable

0_data1

1_data1

0_data2

1_data2

etc

ang1

45

3

45

2

 

ang2

43

23

42

23

 

ang3

44

43

44

43

 

ang4

23

343

23

321

 

etc

 

 

 

 

 

 

All variables are in character format with level '0' or '1', and I need to compare counts among levels. There are over 100 variables on each dataset , each dataset has over 10million rows.

 

To arrive at the output, I thought about outputting categorical variable frequency then merge outputs from both datasets by variable name. 

Such that:

 

Original , there is no ID/index variable

ang1

ang2

ang3

ang4

Etc.

0

1

0

0

0

1

1

1

1

1

0

1

0

1

0

0

0

1

1

1

etc

 

 

 

 

 

To

Variable

0_data1

1_data1

ang1

45

3

ang2

43

23

ang3

44

43

ang4

23

343

etc

 

 

 

How to arrive at the above output for data1?

Appreciate for any inputs. 


 

lydiawawa
Lapis Lazuli | Level 10

The github example worked very well. Thanks!

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 804 views
  • 3 likes
  • 2 in conversation