BookmarkSubscribeRSS Feed
Emma8
Quartz | Level 8

I have a dataset with about 1000 variables. There are many variables that are coded the same but names are different, see an example, those variables have the same information: nHits and fewfe, also those variables have the same values except League contain some missing data: League  dfgwe_23.

How can I find out those variables? So, I wanted to get the list of variables (see below attached)  with the same values. In this example fake data, the following variables have the same values except some missing (should ignore missing). Thank you.

 

Want the following list:

CrAtBat = ab
CrBB =fkae
CrHits =vrgsd
CrHome =dfwef
CrRbi =vewef
CrRuns =fff
Div =wsfde_d
Division = vdfgew
League =dfgwe_23
Position = fgerg
Salary =vregf
Team =fre
YrMajor = rewa
logSalary = fwef
nAssts =cefwe
nAtBat =efqwef
nBB =qfqewfq
nError =qfefe
nHits =fewfe

24 REPLIES 24
ketpt42
Quartz | Level 8
Did you attach the correct file? It doesn't match the description you gave.
Emma8
Quartz | Level 8
Sorry, just attached a correct file.
andreas_lds
Jade | Level 19

What do you mean by "coded the same" exactly?

i am not that fond of writing a data step to read a file with 1k variables, please share the code you have used to read the file.

Emma8
Quartz | Level 8
Sorry, I updated the file again
Kurt_Bremser
Super User

To assist you in finding possible duplicate columns, you can do this:

proc import
  out=baseball
  datafile="/folders/myfolders/baseball2.csv"
  dbms=csv
  replace
;
run;

proc transpose data=baseball out=long;
by name;
var _numeric_;
run;

proc sort data=long;
by name col1;
run;
Emma8
Quartz | Level 8
There are no duplicates
Kurt_Bremser
Super User

That looks like a bad joke. There are several columns with identical names, there's no logical order, the header line can't be used for valid SAS names, .....

If it had been something like

player_name position_1986 salary_1986 at_bats_1986 position_1987 salary_1987 at_bats_1987

and so on, then you could rather easily transpose it into something useful, but this?

"Player's Name","Team at the End of 1986","Times at Bat in 1986","Hits in 1986","Home Runs in 1986","Runs in 1986","RBIs in 1986","Walks in 1986","Years in the Major Leagues","Career Times at Bat","Career Hits","Career Home Runs","Career Runs","Career RBIs","Career Walks","League at the End of 1986","Division at the End of 1986","Position(s) in 1986","Put Outs in 1986","Assists in 1986","Errors in 1986","1987 Salary in $ Thousands","League and Division","Log Salary","Team at the End of 1986","Times at Bat in 1986","Hits in 1986","Walks in 1986","Years in the Major Leagues","Career Times at Bat","Career Hits","Career Home Runs","Career Runs","Career RBIs","Career Walks","League at the End of 1986","Division at the End of 1986","Position(s) in 1986","Assists in 1986","Errors in 1986","1987 Salary in $ Thousands","League and Division","Log Salary"
Shmuel
Garnet | Level 18

1) Are those duplicates all only numeric type or also character type?

2) Can it be that for some observations their values are the same but in other observation,

     those same variables have different value?

3) Assume you found the duplicate values, what do you want to do with that information?

    What kind of output you want?

 

Mainly, theoretically, you can define an array, sort it and then compare values easily.

Emma8
Quartz | Level 8
Thank you.
Numeric variables should compare with numeric and character variables should compare with character—-that is the goal =finding variables with the same values.
I do not want any output data. I want just the list of variables with the same values.
For example, if variable1’s values are equal with variable2’s values, then I want
Variable1 and variable2, etc
Shmuel
Garnet | Level 18

You have not answer the second question:

2) Can it be that for some observations their values are the same but in other observation,

     those same variables have different value? (neglecting missing values).

 

See solution in next link - a very similar question:

https://communities.sas.com/t5/SAS-Programming/Values-the-same/m-p/663839/highlight/false#M198244 

Emma8
Quartz | Level 8
Let me go back.
I have 2 datasets.
Dataset 1 has 20000 observations and 300 variables.

Dataset 2 has 10000 observations and 500 variables.


I want to compare those datasets and list out all variables with the same values (one may have more ID (names as suggests the dataset observations) and missing values. I should ignore those additional observations or missing, just compare existing ids variables in both datasets
Shmuel
Garnet | Level 18

What do you mean by "one may have more ID" - is ID unique at one or both datasets?

Suppose you have unique ID in table1 bot more than one observations of same id in table 2 - what do you want to compare? and how to treat the many to one matching ?

 

I suggest post a sample of each table with upto 10 observations and choose a sample of variables you noticed as same vale and few of different value to serve as test date, then post the output you expect regarding those samples.

Emma8
Quartz | Level 8
In one dataset, there are variables before “=“ and in another dataset, there are variables after “=“ sign:
CrAtBat = ab
CrBB =fkae
CrHits =vrgsd
CrHome =dfwef
CrRbi =vewef
CrRuns =fff
Div =wsfde_d
Division = vdfgew
League =dfgwe_23
Position = fgerg
Salary =vregf
Team =fre
YrMajor = rewa
logSalary = fwef
nAssts =cefwe
nAtBat =efqwef
nBB =qfqewfq
nError =qfefe
nHits =fewfe

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 3310 views
  • 0 likes
  • 5 in conversation