Jump to content

Recommended Posts

Posted

Can you upload sample data or at least explain the data structure and how the sample should be taken (a range of days 0 to 10 does not guarantee 10% so I am not sure I understand).

 

Where should this sampled data table be stored A new data table with 10% of the rows, or a marker column on the existing data table specifying which rows should be included or excluded

 

Gaia

Posted
It is a set of data of 200K records with phone number and days of silent (no calls). I created a column with the range of silent days and need a sample of 10% of them. The idea is to flag those 10%.
Posted

So I understand you want to generate a stratified sample, i.e. a sample that preserves population proportions of your groups, your groups being the silent days ranges.

It would be easy with a TERR data function but you did not say whether this is an option.

I can suggest a workaround, which should be good enough since you have a lot of rows, unless some of your groups are incredibly rare (fewer than 10 instances - but then no stratified sampling would do).

 

Create a calculated column with the following expression (call it e.g. [rand1]):

RandBetween(1,Count(RowId()) OVER ([silentdays]),123) / Count(RowId()) OVER ([silentdays])

Where [silentdays] is the column containing your groups. 123 is just a random seed.

Count(RowId()) OVER ([silentdays]) is the total number of rows for each group. For each group, we extract a random number between 1 and this count, and we divide it by the count, so this number is bound between 0 and 1 and should uniformly sample each group.

 

Then, create a boolean column to separate out what is included in the sample

If([rand1]

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...