Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Matching on across mulitple columns

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Matching on across mulitple columns

I have a data set that looks like the following.

IDTVG_MSO1TVG_MSO2TVG_MSO3NTIA_MSO1NTIA_MSO2NTIA_MSO3NTIA_MSO4
391517108002054Comcast Cable CommunicationsLamont Digital Systems, Inc.Time Warner CableAT&TTime Warner CableNULLNULL
191130017006002ImOn Communications, LLCMediacom, L. L. C.NULLCenturyLinkDynamic BroadbandImOn Communications, LLCMediacom, L. L. C.
320030016123003CenturyLinkCommercial Satellite SystemCox Cable Communications IncCenturyTel TeleVideo, Inc.Cox Cable Communications IncNULLNULL
360450613002014Time Warner CableNULLNULLTime Warner CableVerizon Service CorpWesTel SystemsNULL
220550020021029NULLNULLNULLAT&TCox Cable Communications IncNULLNULL
480291203002025CS Wireless Systems, Inc.Grande ClearSource Comm.Time Warner CableAT&TGrande ClearSource Comm.Time Warner CableNULL
132551612001003Comcast Cable CommunicationsDirecPathNULLAT&TComcast Cable CommunicationsCovad CommunicationsNULL
260150108002035Wide Open WestNULLNULLBarry County Telephone CompanyNULLNULLNULL
210490202023015Time Warner CableNULLNULLAT&TTime Warner CableNULLNULL
160219702005039NULLNULLNULLNULLNULLNULLNULL

My problem is that that I need find matches where TVG_MSO1 is equal to NTIA_MSO1 or NTIA_MSO1 or NTIA_MSO2 or NTIA_MSO3 or NTIA_MSO4. But the macth could come from any combination of the TVG_ and NTIA_ columns. So for example, in the first row TVG_MSO3 equals NTIA_MSO2, so I would like to have a final column that would say the name of the match "Time Warner Cable". If there is no match, that is fine. Similarly in row two the match is TVG_MSO2 and NTIA_MSO4 and the output column would be "Mediacom, L. L. C.".

If there is any one that could help me out with a solution, I would be forever grateful.

Your friend in SAS,

John Wynne


Accepted Solutions
Solution
‎10-06-2012 02:57 AM
Super Contributor
Posts: 644

Re: Matching on across mulitple columns

As this query is still marked 'Not answered' I thought you might be looking for code.  Here is my suggestion.  I've assumed the word NULL in your table is a token for a null value, not the literal value; however if that is not the case you can substitute  <>  'NULL'  for  IS  NOT NULL.  This code is not tested.  The results may not be in the same ID order as the input data.

*   Assuming ID is unique ;

Proc Transpose

        Data    =   Communications

        Prefix  =   TVG_Values_

        Name    =   Original_TVG_Column

        Out     =   TVG_Data

        ;

    By  ID  Notsorted ;

    Var TVG_MSO: ;

    *   Selects all columns beginning with TVG_MSO ;

Run ;

Proc Transpose

        Data    =   Communications

        Prefix  =   NTIA_Values_

        Name    =   Original_NTIA_Column

        Out     =   NTIA_Data

        ;

    By  ID  Notsorted ;

    Var NTIA_MSO: ;

    *   Selects all columns beginning with NTIA_MSO ;

Run ;

Proc SQL ;

    Create Table   Comm_Matches    As

        Select 

            Distinct

                TVG.ID

            ,   TVG.TVG_Values_1    As  Matching

            ,   TVG.Original_TVG_Column

            ,   NTI.Original_NTIA_Column

        From    TVG_Data    TVG

            ,   NTIA_Data   NTI

        Where   NTI.ID              =   TVG.ID

            And NTI.NTIA_Values_1   =   TVG.TVG_Values_1

            And NTI.NTIA_Values_1   IS  NOT NULL

            And TVG.TVG_Values_1    IS  NOT NULL

            ;

Quit ;

View solution in original post


All Replies
Super User
Posts: 17,863

Re: Matching on across mulitple columns

Can you have multiple matches? And would you care which one matches?

Assuming you can't have multiple matches.

data want;

     set have;

     array tvg(*) tvg_mso1-tvg_mso3;

     array ntia(*) ntia_mso1-ntia_mso4;

do i=1 to dim(tvg);

do j=1 to dim(ntia);

if whichc(tvg(i), of ntia(*))>0 then match=tvg(i);

if match ne '' then leave; *I can't remember if its leave or break to end the loop;

end;

end;

run;

Super Contributor
Posts: 644

Re: Matching on across mulitple columns

I would suggest the following (assuming your data is already sorted) :

  • Use proc transpose to create 2 datasets each using ID as the by variable, one containing all the TVG data in a single column and the other containing the NTIA data.  The automatic _NAME_ column will tell you which column an individual record came from
  • Use proc sql to join the two tables
    • Rename the _NAME_ columns to be distinct if you need this information
    • Include where conditions to exclude nulls from the joins
    • Order by TVG_data then NTIA_data if you expect multiple matches per ID
    • Include the original table in the join if you need to output rows in the original format
    • In the case of multiple joins a following data step could be used to identify the first match in each row of the original table.

I do not at the moment have access to SAS to test and debug but I could sketch out some code if that would help.

Super Contributor
Posts: 644

Re: Matching on across mulitple columns

Forgot to mention: if this is a very large table you can get by with one pass of proc transpose and join the resulting table to itself.  The SQL gets a bit more messy, though you could use views to create two virtual tables and proceed as outlined in my first reply.

New Contributor
Posts: 3

Re: Matching on across mulitple columns

Thanks for your help

Solution
‎10-06-2012 02:57 AM
Super Contributor
Posts: 644

Re: Matching on across mulitple columns

As this query is still marked 'Not answered' I thought you might be looking for code.  Here is my suggestion.  I've assumed the word NULL in your table is a token for a null value, not the literal value; however if that is not the case you can substitute  <>  'NULL'  for  IS  NOT NULL.  This code is not tested.  The results may not be in the same ID order as the input data.

*   Assuming ID is unique ;

Proc Transpose

        Data    =   Communications

        Prefix  =   TVG_Values_

        Name    =   Original_TVG_Column

        Out     =   TVG_Data

        ;

    By  ID  Notsorted ;

    Var TVG_MSO: ;

    *   Selects all columns beginning with TVG_MSO ;

Run ;

Proc Transpose

        Data    =   Communications

        Prefix  =   NTIA_Values_

        Name    =   Original_NTIA_Column

        Out     =   NTIA_Data

        ;

    By  ID  Notsorted ;

    Var NTIA_MSO: ;

    *   Selects all columns beginning with NTIA_MSO ;

Run ;

Proc SQL ;

    Create Table   Comm_Matches    As

        Select 

            Distinct

                TVG.ID

            ,   TVG.TVG_Values_1    As  Matching

            ,   TVG.Original_TVG_Column

            ,   NTI.Original_NTIA_Column

        From    TVG_Data    TVG

            ,   NTIA_Data   NTI

        Where   NTI.ID              =   TVG.ID

            And NTI.NTIA_Values_1   =   TVG.TVG_Values_1

            And NTI.NTIA_Values_1   IS  NOT NULL

            And TVG.TVG_Values_1    IS  NOT NULL

            ;

Quit ;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 366 views
  • 9 likes
  • 3 in conversation