Jump to content

Generate extra rows for each month between dates in two columns

Go to solution Solved by Andre Kooy,

Recommended Posts

Hi, I  have a data set that contains various columns, including a START and END date.

I want to expand each date range with a row per month in the date period, so for example a set that starts looking like this:

Start End Person Project
01-Jan-24 01-Apr-24 name1 project1
01-Feb-24 01-Apr-24 name1 project2
01-Jan-24 01-Mar-24 name2 project3
01-Feb-24 01-Apr-24 name2 project4



Would be expanded to more rows like this where there is a row for the 1st of the Month for every month a task is ongoing (in my test data the tasks all end on the 1st of the month but this isn't always the case).  In this example the first 3 rows are from the 1st input row, then the next 2 are from the second input row and so on:

Start End Person Project
01-Jan-24 01-Apr-24 name1 project1
01-Feb-24 01-Apr-24 name1 project1
01-Mar-24 01-Apr-24 name1 project1
01-Feb-24 01-Apr-24 name1 project2
01-Mar-24 01-Apr-24 name1 project2
01-Jan-24 01-Mar-24 name2 project3
01-Feb-24 01-Mar-24 name2 project3
01-Feb-24 01-Apr-24 name2 project4
01-Mar-24 01-Apr-24 name2 project4


It's similar to this example I've found in some old documentation https://support.tibco.com/s/article/How-to-create-a-continuous-date-range-given-start-and-end-dates-in-TIBCO-Spotfire-with-Data-Functions but I've not used R within Spotfire (or much at all) so am struggling to implement this.
Is anyone able to point me in the right direction?

Link to comment
Share on other sites

As a possibly clearer (or possibly not) indication of what I'm trying to do, the Sheet1 on left is the format of my original data set, the Sheet2 on right is an example I have manually prepared, but would like to automate on a large set.  Also an example of the kind of output my users are looking for, tracking various outputs over time from a set of data loaded with start/end dates.

Link to comment
Share on other sites

Posted (edited)

Yes, I'm looking at doing this in the database using SQL, but the person responsible for the database in question is currently reluctant to deploy the view that I had prepared.  The article you linked too may be useful, that's not a function I'm aware of so will investigate.  I don't want to spread the values out across the rows, but want to repeat them.

I'd hoped to be able to adapt the examples in the article, but these seem to sum the values whereas I need a defined start and end date.

Edited by Paul Hickford 2
Link to comment
Share on other sites

Hmm, if I try and run the SQL suggested in Spotfire I just get a wall of errors, but running the same query directly in database either seems to run forever in a test set limited to a single user, or gets an ORA-01436 CONNECT BY loop in user data if I remove the dbms_random.string line.

I'm on the verge of giving up and just accepting that Spotfire can't do this even though it feels like it should be trivial, and it seems to be a simple request in a lot of other software.  I can make the transformation quite easily in Knime but was hoping to remove the need for any manual data manipulations.  

Link to comment
Share on other sites

You can do it with a data function, as you are generating a table that has more rows than the initial.
Here I have done it with Python.
You need to have the Python packages pandas and dateutils available. Probably dateutils needs installing with Menu > Tools > Python tools > Package management.

import pandas as pd
from dateutil.relativedelta import relativedelta

# Expand as many rows as difference in months
def months_delta(a,b):
    r = relativedelta(b, a)
    return (r.years * 12) + r.months

#Add a running month to the start date
def running_month(a,m):
    aa = a+relativedelta(months=m-1)
    return aa.strftime("%d-%b-%Y")

df['TASK_START'] = pd.to_datetime(df['TASK_START']) 
df['TASK_END'] = pd.to_datetime(df['TASK_END']) 

df['__DELTA']= df.apply(lambda x: months_delta(x['TASK_START'], x['TASK_END']), axis=1)

df2['__MONTH_INDEX'] = df2.groupby(['USER','PROJECT']).cumcount()+1

df2['TASK_START']= df2.apply(lambda x: running_month(x['TASK_START'], x['__MONTH_INDEX']), axis=1)
df2['TASK_END'] = df2['TASK_END'].dt.strftime("%d-%b-%Y") 



Link to comment
Share on other sites

Hi Paul,

It can also be done just in Spotfire.

You have your original file (data table 1). Add data table 2 that has all months of the year and a dummy column with value "X".

Now create data table 3 that does a pivot on data table 1, that has only the project names as identifiers (so just 1 column). Then add a transformation for a calculated column called Dummy with value "X" and add it to data table 3. This gives you a table with all months for all projects.

Then create data table 4 with the following steps:

1) load data table 1

2) join data table 3, joining on project name. 

3) add the following calculated column If([Month]>[End],"Exclude","Include")

Now filter to the included rows only.

I added the dxp I created as an example

Extra rows.dxp

Link to comment
Share on other sites

  • 2 weeks later...

Many thanks all for the suggestions.  I had a look at the two approaches of Python and using the separate tables in the Data Canvas.  I got the Python version working, but I the Data Canvas approach seems to have much better performance;  I set the various dummy columns and filters to be used in the necessary transformations and then removed so the end users won't need to deal with any extra columns.  It will break in a hundred years or so when my list of months runs out, but will deal with that later.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...