Write and run SAS programs in your web browser

dataset integration

Reply
Contributor
Posts: 36

dataset integration

[ Edited ]

Hi,

I was wondering if someone can please help me?

 

I have three datasets, all of them have one common column called file name.

First dataset contains 2000 rows all unique rows (One row per filename). 

Second one has multiple rows per file name, some filenames have 7 rows and some more than 7. 

Similar for third dataset.

 

I am looking to integrate second and third datasets in a way that both of these datasets have same one row per filename.

 

Can anyone suggest how should I proceed for that?

So, that I can merge both of these datasets into First dataset as features. 

 

Sample table

 

FilenameCategoryTypeCol1Col2Col3Col4Co5FlgReadWriteExecuteAlign
11LOAD0x00000000010000000100001515c1515cR E1010x10000
11GNU_STACK0x00000000000000000000000000000000RW1100x4
11LOAD0x01515c0003515c0003515c0051806874RW1100x10000
21LOAD0x00000008048000080480000d0a00d0a0R E1010x1000
21GNU_STACK0x00000000000000000000000000000000RW1100x4
21LOAD0x00d0a0080560a0080560a00031c06800RW1100x1000
32LOAD0x00000000010000000100001123c1123cR E1010x10000
32GNU_STACK0x00000000000000000000000000000000RW1100x4
32LOAD0x01123c0003123c0003123c004f406644RW1100x10000

 

Regards

 

Grand Advisor
Posts: 10,043

Re: dataset integration

You should provide some data, enough  to demonstrate what you need, for each of the tables and what the final result for the example data should look like. From your description possibly 2 rows for the first table mentioned and cases for the other data sets such that one of the filenames from the first data is only used once  and the other 3 times. You don't need to provide ALL of the variables but enough to demonstrate what you want done with other common variables if they exist. For instance if the variable XYZ exists in sets 2 and 3 which version do you want in the final result when they differ?

 

It really helps to provide data in the form of a datastep. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

It also helps to show code or describe what you have attempted so far.

Contributor
Posts: 36

Re: dataset integration

Hi Ballard,

Thanks for your reply, I am attaching sample XLS sheet for your reference of the datasets. Appologies for not creating dataset they way you asked, I am new to SAS so need study a little bit more. 

 

Sheet 1 contains main data in which I need to integrate sheet 2 and sheet 3. Its like 1 to many relationship. As sheet2 and 3 contain multiple rows of the same files present in sheet1. 

 

Sheet 2 and 3 contains fields highlighted as red, about them I am not sure how to integrate. One way I thought was to integrate like a count of unique addresses per file for example for file 1 , in sd_off the count will be 2 , sd_virtad count=3 etc. 

 

In sheet2, sd_type becomes the feature and represented as a yes/no value in integrated sheet. 

In sheet3, td_sname and td_stype are two important columns and have unique combinations of values as for every td_sname there is a td_stype, so, every unique combination becomes a column and represented as a yes/no value for each file name its present as 1 and 0 for rest.

 

and red fields are the counters and flags should be the same.

 

Hope it clarifies more about my problem. Main thing is sheet 2 and 3 should be integrated in such a way that every file has one row representation only. Dont bother how many fields.

 

 

Regards

 

 

Super User
Super User
Posts: 6,133

Re: dataset integration

If you want to reduce multiple rows to one there three basic methods.

1) You could perform summary functions over the values.  So a numeric column could be reduced to the sum of the values in the column.  Or perhaps you want to take the min or max value, that could work for both numeric and character variables.

2) You could TRANSPOSE the rows into columns. So if there are three observations be group your could convert that into three columns for one observation.

3) You could pick the best observation and only use the data from that observation. So perhaps the first one or the last one or the one with largest value for some column.

 

Which of these methods do you want to do to transform your second and third datasets?

Contributor
Posts: 36

Re: dataset integration

Hi Tom,
I am looking to transpose rows into column but the columns containing hex values may be treated as sum of unqiue values per filename. And columns like td_sname or type can be treated as a flag in which 1 represents particular value present . So transpose will let mutiple columns to be present which is my case.

Each file in ds2 and 3 contains atleast 3 or more rows so multiple features in a single row after transpose. Sample is in sheet 4.
Contributor
Posts: 36

Re: dataset integration

Hi Tom,
I am looking to transpose rows into column but the columns containing hex values may be treated as sum of unqiue values per filename. And columns like td_sname or type can be treated as a flag in which 1 represents particular value present . So transpose will let mutiple columns to be present which is my case.

Each file in ds2 and 3 contains atleast 3 or more rows so multiple features in a single row after transpose. Sample is in sheet4. Only thing i dont want is to skip any row unless same value gets repeated.
Super User
Super User
Posts: 6,133

Re: dataset integration

You will need to explain the data and the collapsing logic more clearly. Please include concrete (simplified) examples.  You could for show an example of 3 rows of data for the input and what you expect the single row of data in the output to look like. Include the example in the text of your message and NOT as an attachment. And if you must use an attachment then attach a text file and not a proprietory format like XLSX or DOCX.

 

I have no idea what you mean by hex values. It sounds like you want to make a list of distinct values?  Do you want those values concatenated into a single longer character variable?  Does the order of the original observations matter, or could you sort the data to eliminate duplicate values?

 

If by flags you mean boolean 0/1 values then usually you will want to either take the MAX() over the observations which is effectively an OR across the observations or SUM() which is effectively a count of the number of true cases.

 

Contributor
Posts: 36

Re: dataset integration

Hi Tom,
Reason why I uploaded XLS file because in my orginal post I copied few rows of data which was not clear for Ballard. If you notice in sheet number two and three some columns contain long numeric values , these values are actually hexa decimal values. for these values I am looking to get a sum of unique values per file number. For columns like sd_type, td_sname, td_stype , I am looking to transpose them into columns. These columns will contain either 0 or 1 to represent whether this column was present in this file. I called it a flag (I may be wrong on this). I do not want concatenation. As I explained earlier, my main dataset contains only one row per file and these second and third datasets contain more than one rows. I need to merge them into main datasets in such a way that I just copy columns on the right hand side of the last column of main dataset. Because they contain unique file numbers.
Super User
Super User
Posts: 6,133

Re: dataset integration

[ Edited ]

The way you can create a clear example of what you want is to copy and paste your data as TEXT ONLY. No tabs. No HTML table. Make sure to use the Insert code or Insert SAS code icons in the rich text editor on this site so that it will preserve the spaces and be displayed using non-proportional font. If you want to edit the text in the code boxes remember to put the cursor inside the box and click the icon again so that you can edit in the pop-up window.  If you try to edit without doing that it messes up the formatting of the line breaks.

 

Do not try to post an example with more than one of two variables and one or two groups (filenames?). Once you know how to do the transformation for variable, say SD_TYPE, then you can do it for other variables that you want to process in the same way yourself.

 

The basic structure of your program is going to be.

  1. transform SHEET2 into NEW_SHEET2 that is one row per filename.
  2. transform SHEET3 into NEW_SHEET3 that is one row per filename.
  3. Then you just need to merge the three datasaets.
data want ;
   merge sheet1 new_sheet2 new_sheet3 ;
   by filename;
run;

 

Ask a Question
Discussion stats
  • 8 replies
  • 177 views
  • 1 like
  • 3 in conversation