Jump to content
  • Spotfire® Streaming Python Operator Notes


    A Little History and Product Terminology

    Spotfire® Streaming 7.7.3 and 10.2.1 introduced Python operators to Streaming.

    The branding Spotfire® Streaming was introduced as of the 10.4.0 releases. Spotfire® Streaming includes StreamBase, and for purposes of this page, the term StreamBase is used. However, the notes here apply to Spotfire® Streaming, TIBCO Streaming, TIBCO StreamBase, TIBCO LiveDatamart, TIBCO Spotfire Data Streams, and TIBCO Data Streams as well -- it's all the same underlying technology with different product names, license scopes, and expected ranges of application.

    Disclaimer

    Some of the information on this page was contributed by Spotfire® employees with knowledge of the Spotfire® Streaming internals and implementations. This information is provided here for clarity and convenience only, and is not part any Spotfire® product release. It may be inaccurate, and even if it is accurate, it may change without notice as new versions of the product are released. Some of the information on this page has already been incorporated into released product artifacts and with any luck, will be removed from here once confirmed to be incorporated into the releases, in order to avoid potential inconsistencies.

    Here are some notes on using the Python operators that, at the time of the writing of these notes, weren't entirely explicit from the documentation and samples:

    Establishing the Python Ecosystem

    The purpose of the Python operators is to integrate with the user's Python applications and installations, not to provide a Python ecosystem as part of the StreamBase product.

    With that principle in mind, there is no Python included with Spotfire® Streaming. In order to use the Python operators, the user has to install Python on the same machine as StreamBase is installed on.

    Along the same lines, any Python packages or other dependencies used by the python scripts executed by the Python operator must also be installed on the machine on which StreamBase and Python are installed. Deployment of these dependencies is not within the scope of the StreamBase product feature set.

    Setting up for the StreamBase Python Samples

    By default, the sample applications are configured to expect to find the python executable at a specific location. On Windows, for example, this is C:\Python\python.exe. If your Python install is in a different place, change (for StreamBase 10+) /sample_python/src/main/configurations/sbengine.conf /sample_python/src/main/configurations/Python.conf to reflect the location of your Python installation. Prior to 10.5.0, these locations were a little different: /sample_python/src/main/configurations/streambase.conf and /sample_python/src/main/resources/adapter-configurations.xml.

    The Python operators support a number of Python distributions/implementations, however if you want to explore learning how to use the Python operators by using included Python Operator Samples fully, use the Python.org (or Anaconda) python distribution and install a version of Python 3 >= 3.4 in order to use all the samples with ease. This is Python, so make sure you pick either 32 or 64 bits for Python and everything it uses, and be consistent with whatever you choose.

    The Local Instance Sample requires Python 3.x. It doesn't work with Python 2.7, at least as of StreamBase 10.2.1.

    To run the TensorFlow sample, the TensorFlow and OpenCV-Python packages must be installed.

    • The procedure to install packges will vary with your Python distribution and version. There are some steps documented in the README for the Sample.
    • Here are some additional notes on using Python.org 3.6 on Microsoft Windows.
      • Make sure the python\scripts directory is on your PATH environment variable
        • For example, on Windows, PATH=C:\Python36;C:\Python36\scripts;%PATH%
      • Make sure the python install directory is on your PATH environment variable
      • To install these, go to your command line and install the packages. pip is usually used.
        • Install the packages with pip or pip3:
          • pip install OpenCV-Python
          • pip install TensorFlow
      • On Windows at the time of this writing (2018), the current version of Python TensorFlow library requires msvcp140.dll. This is in the Visual C++ Redistributable for Visual Studio 2015 which needs to be downloaded and installed.

    How the Python Operator Executes Python Scripts

    Each python instance created by the operator is a sub-process of the StreamBase engine that creates the instance. The top-level script in each instance, which we may call worker, runs for the entire lifetime of the instance, and communicates with the Python operator instance using a local socket connection over which the script to be executed as well as the input and output variable names and values are serialized and deserialized.

    The aforementioned worker script, which happens to be called worker.py at least as of Streaming 10.6.1, makes good reading. While it is totally unsupported to rely on the details of it from release to release, of course, you know, it's right there in your local Maven repository (after you've installed StreamBase and its Maven dependencies for the Python operators, of course) at <your local repository>\repository\com\tibco\ep\sb\adapter\python\10.6.1\python-10.6.1.jar\com\streambase\sb\adapter\python\resources\worker.py and you could do worse than have a look to understand what's going on at runtime.

    One of the more salient things to understand from the worker is that the Python script executed by the Python operator is invoked using Python's built-in exec() function. More specifically, the contents of the inputVars tuple is converted using the documented data type conversion rules to a Python dictionary, and that dictionary is passed to exec() as the global dictionary. Even more specifically, if we imagine that this global dictionary is named vars and the script is contained in a variable named script, that invocation is exec(script, vars). Therefore both the global and local variables available to the script come from vars. The values of the outputVars tuple from the operator are taken from vars after the script is executed, as modified by the script's execution.

    There are multiple ways to specify the script to the Python operator, but ultimately they are all executed in the same way: as a string passed to exec(). It doesn't matter whether the script is presented to the operator as a text string or as a file. If a file, the content of the file is read by the operator (that is, by Java), then converted to a java.lang.String, then serialized to python, where the string is passed to exec() as above.

    The python instances spawned by the operator don't necessarily have any special access to the resources on the StreamBase engine's runtime classpath. Once control is passed to the python instance, python rules apply for finding referenced files and modules. The current working directory for the python instance is either the current working directory of the StreamBase engine, or whatever current working directory is configured for that instance in the HOCON configuration file for the adapter. The python interpreter is invoked as "python -" -- meaning that it gets its input from stdin (which is a pipe from the StreamBase engine process) but also that means -- at least as of Python 3.7 -- that the interpreter's current working directory is added to the start of the python instance's sys.path. This behavior can be quite handy, but it is also worth keeping in mind to avoid surprises as to what sys.path searches at runtime might uncover.

    Debugging and Testing Python Scripts Executed by the Python Operator

    Python sub-process hangs. It can be difficult to see what's going on when running Python scripts in Python sub-processes under StreamBase control. Sometimes, the processes often just appear to hang in the face of certain types uncaught exceptions (especially syntax errors but also some runtime errors although not all). This behavior, though, is not yet reproducible with a simple script, so it's hard to give an example here, but you'll know it when you see it. When the python process is stuck, it's generally time to stop the node. However, frequently the StreamBase engine shutdown initiated by the stop node action often will also get stuck trying to clean up the python process, and then the node declares itself to be in a corrupted state and has to be forcefully stopped, and then removed and re-installed.

    Therefore, in light of the process behavior noted above, it quickly becomes a useful practice to thoroughly test and debug one's Python scripts outside of the context of StreamBase Python operator, simulating the inputVars by passing in a dictionary desired inputVars values as a globals dict to your script, and then observing the globals on output, perhaps under the control of the Python unittest framework.

    Remember that the python sub-processes exist -- more or less -- for the lifetime of the StreamBase engine that creates them. The python script executed by the operator, therefore, is executed in an interpreter process whose state is persistent between script executions in the same process. That is, if you want the process to always do the same thing when presented with the same input, be very careful how your script execution side-effects the state of the process. Indempotence testing, therefore, should be part of your script unit-testing practice.

    The Python operator does very little logging regarding the processing of input tuples, so raising the log level to see what's going in and out on the operator instances doesn't result in much information. It does have nice logging around the AMS/ADS functionality, however.

    Using a Python Debugger with the Python Operator (rpdb)

    The above section describes how to enable the tried-and-true debugging style of using print() to stdout/stderr invocations. But is it possible to use a python debugger to debug a Python script invoked from the Python Operator?

    Why, certainly! Here is an example: Using the rpdb Python Debugger with the StreamBase EventFlow Python Operator

    Specifying the python -u command line argument

    It's useful to configure python instances to add -u as an argument to python, if perhaps only during development work. This argument makes the stdout and stderr streams from the python instances to unbuffered. That is, you won't have to wait for the buffers to flush in order to see the output from these important streams.

    That said, while there's a nice example of how to set -u in the StreamBase 7.7 Sample project using the now outdated adapter-configurations.xml format, there's as yet no published example for how to do this with the relatively new StreamBase 10.5.0+'s HOCON com.tibco.ep.streambase.configuration.adapter configuration file type. Below is an example that has been tested empirically with StreamBase 10.6.1 on Windows:

    name = "Python.conf"
    type = "com.tibco.ep.streambase.configuration.adapter"
    version = "1.0.0"
    configuration = {
      AdapterGroup = {
        adapters = {
          python = {
            sections = [ 
              {
                name = "python"
                settings = {
                  executable = "C:/python374/python.exe"
                  instance = "pythonic"
                  useTempFile = "false"
                }
                sections = [
                  {
                    name = "arguments"
                    settings {
                      val = "-u"
                    }
                  } 
                ]
              }
            ]
          }
        }
      }
    }
     

     

     

    Only one arguments setting is shown here because, as also stated below, only the last-occurring such setting seems to applied at runtime.

    captureOutput is Broken on Windows

    At least as of StreamBase 10.6.1, the captureOutput adapter setting doesn't appear to do anything at all on Windows. This behavior may be a bug. It is rumored to work on Macs. It appears that the implemented functionality is for the contents of the stdout and stderr streams from the Python instances should end up in as StreamBase engine log INFO log messages if captureOutput is false. Otherwise, I suppose it will go where the StreamBase engine stdout and stderr end up, which is not always easy to determine.

    Capturing stdout and stderr to files

    In order to see what is happening in the Python instance, consider simply opening a file in the current working directory (or elsewhere) in the python script itself and using the Python print(....., file=mylogfile.txt) function.

    A note/rant about data type mapping between StreamBase and Python

    Python is very much an interpreted-style language with rather dynamic, often polymorphic, typing. User-defined types abound. StreamBase EventFlow is almost entirely a very, very strictly typed compiled language. Making new types in StreamBase? Sorry, the most recent new data type was added in 2010, and there's no way for users to make their own. These are two very different viewpoints in language design. You might imagine, then, that passing data, then, between Python and StreamBase can be challenging. And you would be correct in so imagining. The Python operator has a nice set of natural type conversions it implements automatically. The challenge comes when, for example, Python has, say, a list where each element in the list may be a value of any type. In StreamBase lists, all the elements of the list must be the same type. Indeed, in StreamBase there isn't really a data type that's just list. The complete data type rendered in English would be something list of ints or list of tuples of named schema MySchema. Ruthlessly enforced. No mercy whatsoever. The Python operator does not assist us with these fairly common integration scenarios. As an implementation approach, have a look at the StreamBase TERR operator. R is another polymorphic interpreted language. The interface to the TERR operator has evolved conventions for translating R types to and from StreamBase types by having a set of lists of various StreamBase types onto which incoming and outgoing R variable values may be hung and grabbed from. This is perhaps a topic that is due a longer treatment elsewhere, so let this note primarily serve to raise the issue. Also, some advice -- do not try overly hard to loosen StreamBase's strict data typing convention. The reality is that EventFlow is going to channel your types into a pretty narrow type system. (If you have to have what amounts to user-defined types, try JSON, BSON, or XML as a way to carry your types on top of the StreamBase type system, using string and blob fields to carry the payloads.)

    Saving call tracing to a file

    The python standard trace module only outputs to stdout, which can be hard to capture, depending on the platform environment. One quick way, for example, to do Python call tracing to a file (and also "logging" the globals in and out of each method invocation would be to leverage kindall's Answer to this StackOverflow post https://stackoverflow.com/questions/8315389/how-do-i-print-functions-as-they-are-called. With grateful attribution and the required link to the license and hereby indicating that changes were made to the licensed material, here is the modified code snippet:

    import sys
    
    # set inputVars.myTrace to true or false accordingly in StreamBase
    
    # quick trace to file utility
    if 'myTrace' in globals() and myTrace:
          tracef = open('tracefile.txt', 'w', 1)
          close(tracef)
          tracef = open('tracefile.txt', 'a', 1)
    
    def tracefunc(frame, event, arg, indent=[0]):
          if event == "call":
              indent[0] += 2
              print("-" * indent[0] + "> call function", frame.f_code.co_name, file=tracef)
          elif event == "return":
              print("<" + "-" * indent[0], "exit function", frame.f_code.co_name, file=tracef)
              indent[0] -= 2
          return tracefunc
    
    if 'myTrace' in globals() and myTrace:
          sys.setprofile(tracefunc)
          print('globals in: ' + str(globals), file=tracef)
    
    # now execute your real script code
    . . . .
    
    #then
    if 'myTrace' in globals() and myTrace:
          print('globals out: ' + str(globals), tracef)
     

     

    Python Operator Documentation Clarifications

    As of StreamBase 10.6.1, there are some parts of the Using the Python Operator documentation page that seem inconsistent with the current implementations:

    • In the Configuring a Global Python Instance section, it says, in the first paragraph "The global Python instance environment is configured in the module's configuration file, src/main/resources/adapter-configurations.xml." The reference here to adapter-configurations.xml is outdated. Global Python instances are now configured in a HOCON Configuration file of type com.tibco.ep.streambase.configuration.adapter as stated further down in the same section.

    • In the Configuring a Global Python Instance section, the envVariables property states that "You can use more than one <setting> line." Updated, that sentence would read "The settings object may have more than one envVariables setting."

    • In the Configuring a Global Python Instance section, the arguments property is said to be accepted as either a setting or a section in the configuration file. Empricially, if the property appears in the settings object of the python section of the configuration, it has no effect. If instead there is an arguments section list at the same level as the settings object, then the last val object in the section is effective, and all earlier val objects are ignored. In other words, it is possible to specify one arguments object to python, but that is all.

    • In the Configuring a Global Python Instance section, the arguments property documentation states: "The usual use for this property it to pass -u, which forces Python to use unbuffered stdin, stdout, and stderr streams." The behavior of the python -u option is version-specific (and perhaps Python implementation-specific as all. For example, for Python.org CPython, beginning in Python 3.3, only the stdout and stderr are unbuffered using this option.

    Using Anaconda Python with the TIBCO Streaming Python Operators

    The Anaconda distribution has wonderful support for using the idea of environments to accomodate switching between multiple installations of different versions of Python on the same machine. There's no particular magic support in TIBCO Streaming for Anaconda's resulting Python path manipulation. It's up to the Streaming application developer/deployer to ensure that the Python runtime environment for the user under which the Streaming application is running and/or the direct Streaming engine process itself has the right directories on it  and configured properly for the python version and installation you want to use. The details of how this works, exactly, is going to be different for StreamBase Studio, the StreamBase Command Prompt, and a Windows Service. The details will likely be different depending on whether Anaconda was installed for one user or all users on the Windows machine. But make sure to add figuring this out to the list of application development and deployment issues to deal with in any Streaming project that uses Streaming and Anaconda together.

    For a user-created conda environment for some other version of python, the list may look something like:

    C:\Users\me\.conda\envs\python374;C:\Users\me\.conda\envs\python374\Library\mingw-w64\bin;C:\Users\me\.conda\envs\python374\Library\bin;C:\Users\me\.conda\envs\python374\Scripts;C:\Users\me\.conda\envs\python374\bin
     

    Here's a concrete example. Anaconda has been installed in C:\ProgramData\Anaconda. Then the Anaconda directories needed to support the base environment Python installation for Anaconda looks something like:

     C:\ProgramData\Anaconda3;C:\ProgramData\Anaconda3\Library\mingw-w64\bin;C:\ProgramData\Anaconda3\Library\bin;C:\ProgramData\Anaconda3\Scripts;C:\ProgramData\Anaconda3\bin
     

    Anaconda Python requires a number of directories to be on the runtime path of the python process. This is managed for the user when using the Anaconda Prompt (CMD) and the conda env utility, but outside of that controlled environment, it is necessary to ensure that the PATH of the Anaconda python interpreter's process has all the right directories on it.

    That said, there are some things to note about integrating TIBCO Streaming and Anaconda Python.

    Anaconda Python is a wonderful thing, especially in that it makes it fairly easy to have separate conda environments for multiple versions of python on the same machine. Also, Streaming Product Support has stated that it supported to use Anaconda Python with TIBCO Streaming.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...