Re: dataset integration

geniusgenie · Posted 06-02-2017 06:24 PM

Hi,

I was wondering if someone can please help me?

I have three datasets, all of them have one common column called file name.

First dataset contains 2000 rows all unique rows (One row per filename).

Second one has multiple rows per file name, some filenames have 7 rows and some more than 7.

Similar for third dataset.

I am looking to integrate second and third datasets in a way that both of these datasets have same one row per filename.

Can anyone suggest how should I proceed for that?

So, that I can merge both of these datasets into First dataset as features.

Sample table

Filename	Category	Type	Col1	Col2	Col3	Col4	Co5	Flg	Read	Write	Execute	Align
1	1	LOAD	0x000000	00010000	00010000	1515c	1515c	R E	1	0	1	0x10000
1	1	GNU_STACK	0x000000	00000000	00000000	00000	00000	RW	1	1	0	0x4
1	1	LOAD	0x01515c	0003515c	0003515c	00518	06874	RW	1	1	0	0x10000
2	1	LOAD	0x000000	08048000	08048000	0d0a0	0d0a0	R E	1	0	1	0x1000
2	1	GNU_STACK	0x000000	00000000	00000000	00000	00000	RW	1	1	0	0x4
2	1	LOAD	0x00d0a0	080560a0	080560a0	0031c	06800	RW	1	1	0	0x1000
3	2	LOAD	0x000000	00010000	00010000	1123c	1123c	R E	1	0	1	0x10000
3	2	GNU_STACK	0x000000	00000000	00000000	00000	00000	RW	1	1	0	0x4
3	2	LOAD	0x01123c	0003123c	0003123c	004f4	06644	RW	1	1	0	0x10000

Regards

ballardw · Posted 06-02-2017 06:34 PM

You should provide some data, enough to demonstrate what you need, for each of the tables and what the final result for the example data should look like. From your description possibly 2 rows for the first table mentioned and cases for the other data sets such that one of the filenames from the first data is only used once and the other 3 times. You don't need to provide ALL of the variables but enough to demonstrate what you want done with other common variables if they exist. For instance if the variable XYZ exists in sets 2 and 3 which version do you want in the final result when they differ?

It really helps to provide data in the form of a datastep. Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

It also helps to show code or describe what you have attempted so far.

geniusgenie · Posted 06-03-2017 03:28 AM

Hi Ballard,

Thanks for your reply, I am attaching sample XLS sheet for your reference of the datasets. Appologies for not creating dataset they way you asked, I am new to SAS so need study a little bit more.

Sheet 1 contains main data in which I need to integrate sheet 2 and sheet 3. Its like 1 to many relationship. As sheet2 and 3 contain multiple rows of the same files present in sheet1.

Sheet 2 and 3 contains fields highlighted as red, about them I am not sure how to integrate. One way I thought was to integrate like a count of unique addresses per file for example for file 1 , in sd_off the count will be 2 , sd_virtad count=3 etc.

In sheet2, sd_type becomes the feature and represented as a yes/no value in integrated sheet.

In sheet3, td_sname and td_stype are two important columns and have unique combinations of values as for every td_sname there is a td_stype, so, every unique combination becomes a column and represented as a yes/no value for each file name its present as 1 and 0 for rest.

and red fields are the counters and flags should be the same.

Hope it clarifies more about my problem. Main thing is sheet 2 and 3 should be integrated in such a way that every file has one row representation only. Dont bother how many fields.

Regards

Tom · Posted 06-03-2017 05:32 PM

If you want to reduce multiple rows to one there three basic methods.

1) You could perform summary functions over the values. So a numeric column could be reduced to the sum of the values in the column. Or perhaps you want to take the min or max value, that could work for both numeric and character variables.

2) You could TRANSPOSE the rows into columns. So if there are three observations be group your could convert that into three columns for one observation.

3) You could pick the best observation and only use the data from that observation. So perhaps the first one or the last one or the one with largest value for some column.

Which of these methods do you want to do to transform your second and third datasets?

geniusgenie · Posted 06-03-2017 05:48 PM

Hi Tom,
I am looking to transpose rows into column but the columns containing hex values may be treated as sum of unqiue values per filename. And columns like td_sname or type can be treated as a flag in which 1 represents particular value present . So transpose will let mutiple columns to be present which is my case.

Each file in ds2 and 3 contains atleast 3 or more rows so multiple features in a single row after transpose. Sample is in sheet 4.

geniusgenie · Posted 06-03-2017 05:52 PM

Hi Tom,
I am looking to transpose rows into column but the columns containing hex values may be treated as sum of unqiue values per filename. And columns like td_sname or type can be treated as a flag in which 1 represents particular value present . So transpose will let mutiple columns to be present which is my case.

Each file in ds2 and 3 contains atleast 3 or more rows so multiple features in a single row after transpose. Sample is in sheet4. Only thing i dont want is to skip any row unless same value gets repeated.

Tom · Posted 06-03-2017 11:39 PM

You will need to explain the data and the collapsing logic more clearly. Please include concrete (simplified) examples. You could for show an example of 3 rows of data for the input and what you expect the single row of data in the output to look like. Include the example in the text of your message and NOT as an attachment. And if you must use an attachment then attach a text file and not a proprietory format like XLSX or DOCX.

I have no idea what you mean by hex values. It sounds like you want to make a list of distinct values? Do you want those values concatenated into a single longer character variable? Does the order of the original observations matter, or could you sort the data to eliminate duplicate values?

If by flags you mean boolean 0/1 values then usually you will want to either take the MAX() over the observations which is effectively an OR across the observations or SUM() which is effectively a count of the number of true cases.

geniusgenie · Posted 06-04-2017 01:00 AM

Hi Tom,
Reason why I uploaded XLS file because in my orginal post I copied few rows of data which was not clear for Ballard. If you notice in sheet number two and three some columns contain long numeric values , these values are actually hexa decimal values. for these values I am looking to get a sum of unique values per file number. For columns like sd_type, td_sname, td_stype , I am looking to transpose them into columns. These columns will contain either 0 or 1 to represent whether this column was present in this file. I called it a flag (I may be wrong on this). I do not want concatenation. As I explained earlier, my main dataset contains only one row per file and these second and third datasets contain more than one rows. I need to merge them into main datasets in such a way that I just copy columns on the right hand side of the last column of main dataset. Because they contain unique file numbers.

Tom · Posted 06-04-2017 02:02 AM

The way you can create a clear example of what you want is to copy and paste your data as TEXT ONLY. No tabs. No HTML table. Make sure to use the Insert code or Insert SAS code icons in the rich text editor on this site so that it will preserve the spaces and be displayed using non-proportional font. If you want to edit the text in the code boxes remember to put the cursor inside the box and click the icon again so that you can edit in the pop-up window. If you try to edit without doing that it messes up the formatting of the line breaks.

Do not try to post an example with more than one of two variables and one or two groups (filenames?). Once you know how to do the transformation for variable, say SD_TYPE, then you can do it for other variables that you want to process in the same way yourself.

The basic structure of your program is going to be.

transform SHEET2 into NEW_SHEET2 that is one row per filename.
transform SHEET3 into NEW_SHEET3 that is one row per filename.
Then you just need to merge the three datasaets.

data want ;
   merge sheet1 new_sheet2 new_sheet3 ;
   by filename;
run;

Registration is open

SAS Training: Just a Click Away