Jump to content
  • Statistica Performance with SSD vs HDD Drives


    Statistica is optimized for analyzing data in-memory and has custom caching for handling large datasets, along with caching done by the operating system. Statistica can use all the free disk space for caching files to complete an analytic project. Therefore we recommend Statistica be installed on its own drive and the System TEMP folder for cached files should also be moved to this drive. The operating system and other applications should be kept on separate drives. 

    With Statistica and TEMP isolated on a drive, comparisons were made on performance between SSD vs HDD drives. We used a Samsung SSD drive and Fusion-io ioDrive HDD drive on a 2.20GHz AMD Quad-Core Processor running 64-bit Windows operating system with 8GB of RAM.

    When working with the large data sources, the size of the datasets as compared to available physical memory has a significant impact on performance. There are two common funnel points with performance; data access and write/read cached data. There are multiple reasons for caching to disk. Sometimes the entire dataset will not fit in-memory. Or the specific algorithm requires multiple passes through the data to complete the analysis, so the dataset plus the intermediate results will not fit in-memory. 

    Because of this, we ran three different tests:

    1. Open a datafile, Statistica spreadsheet, fits in-memory
    2. Open a datafile, execute descriptive statistics algorithms, does not fit in-memory
    3. Data management algorithms that need multiple passes through the dataset, does not fit in-memory

    In all cases, files stored on HDD loaded quickly, even when opening the file for the first time. 

    Files stored on SSD disk had different results depending on their size compared to the amount of physical memory used by the operating system for file caching. For files that were smaller than the available cache, the SSD files took longer to load on the first run, but on subsequent runs the times were similar to HDD, indicating that files were cached by the operating system. 

    Test One - Fits In-Memory

    The times listed below are for a 2GB random-filled Statistica Spreadsheet, consisting of 9,000,000 cases by 30 variables, and performing a subset operation to select about 50 percent of the cases into a new spreadsheet.

    File opened on SSD:

    • 1st pass: 72.4 seconds
    • 2nd pass: 27.5 seconds

     File opened on HDD:

    • 1st pass: 27.2 seconds
    • 2nd pass: 27.4 seconds

    Once the file is cached in RAM, the second use of the dataset within the application instance on SSD was similar to the HDD performance.

    Test Two - Does Not Fit In-Memory

    We used a file that exceeded the operating system (OS) cache so that the OS cannot cache the entire file. Multiple cached files were needed. The difference in processing speed is very significant. 

    This test case opened a 47GB random-filled Statistica Spreadsheet of 200,000,000 cases by 30 variables. Then descriptive statistics (N, mean, standard deviation, minimum & maximum) were calculated on all variables. This was a parallelized execution across four CPUs. CPU utilization was monitored, with a low utilization meaning more time was spent waiting on disk access. The difference was impressive.

    Descriptive statistics on SSD: 

    • 432 seconds, overall CPU utilization around 32%

    Descriptive statistics on HDD: 

    • 87 seconds, overall CPU utilization around 90%

    Putting the file on HDD was five times faster for this test, and is confirmed by the process becoming less I/O-bound and more CPU-bound. Note, it is expected that performance increases for any operation where the disk access time is large compared to the calculation time, but the increase will be less for calculation-intensive operations that are more CPU-bound.

    Test Three - Does Not Fit In-Memory

    For this test, we scripted several data management operations for a 9,000,000 case by 30 variable file which created several files in the TEMP directory. These algorithms do make multiple passes through the data which required disk caching.

     Data management script on SSD:

    •  312 seconds

      Data management script on HDD:

    •  101 seconds

    The results show that the performance was three times faster on HDD.

    Feedback (1)

    This article makes the erroneous claim that HDDs are faster then SSDs which is not correct (from our experience). Fusion IO is not a HDD but rather a flash based storage medium (like SSDs), as far as I understand other documents about it.

     

     

    Flagged by : nverstegen URL : https://community.spotfire.com/users/nverstegen


    Don Johnson 9:10am Dec. 14, 2020
    Flag as Inappropriate


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...