Antonio Aguilar Posted May 27, 2019 Posted May 27, 2019 I have a set of data that i need to take of a sample of 10% for a control group with the same characteristics of the main based on a range of days 0 to 10, 10 to 20, etc.
Gaia Paolini Posted May 30, 2019 Posted May 30, 2019 Can you upload sample data or at least explain the data structure and how the sample should be taken (a range of days 0 to 10 does not guarantee 10% so I am not sure I understand). Where should this sampled data table be stored A new data table with 10% of the rows, or a marker column on the existing data table specifying which rows should be included or excluded Gaia
Kirsten Smith (she/her) Posted May 30, 2019 Posted May 30, 2019 I would suggest looking at RandBetween() and RowId() to come up with a random sample, then you can use that to decide which RowIds to include.
Antonio Aguilar Posted May 30, 2019 Author Posted May 30, 2019 It is a set of data of 200K records with phone number and days of silent (no calls). I created a column with the range of silent days and need a sample of 10% of them. The idea is to flag those 10%.
Gaia Paolini Posted May 31, 2019 Posted May 31, 2019 So I understand you want to generate a stratified sample, i.e. a sample that preserves population proportions of your groups, your groups being the silent days ranges. It would be easy with a TERR data function but you did not say whether this is an option. I can suggest a workaround, which should be good enough since you have a lot of rows, unless some of your groups are incredibly rare (fewer than 10 instances - but then no stratified sampling would do). Create a calculated column with the following expression (call it e.g. [rand1]): RandBetween(1,Count(RowId()) OVER ([silentdays]),123) / Count(RowId()) OVER ([silentdays]) Where [silentdays] is the column containing your groups. 123 is just a random seed. Count(RowId()) OVER ([silentdays]) is the total number of rows for each group. For each group, we extract a random number between 1 and this count, and we divide it by the count, so this number is bound between 0 and 1 and should uniformly sample each group. Then, create a boolean column to separate out what is included in the sample If([rand1]
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now