Configuring the General tab

The General tab in the Properties dialog of all PolyAnalyst nodes looks almost the same. Below you can see a screenshot of the CSV Source node General tab.

The tab contains a number of options which vary slightly depending on the type of the node. All the options on the General tab are configured optionally.

Generally speaking, the following options are represented:

The Name option (the node name)

The Name text field is used to enter the name of the node. Users are recommended to give the node a name, which explains the purpose of the node; this allows both the node creator and other project users to quickly understand the function of the node later. For example, the Internet Source node can be named after the web-site whose content is extracted. By default, the node will use its node palette name.

Users can rename the node later.

Note, that a number of symbols is limited for the node name: you can use max. 1000 symbols; otherwise the node name configuration cannot be saved.

The Auto memory option

This option is always checked by default. Some nodes may not display the option because memory cannot be managed directly. The option can be used to tune the node performance. Some nodes have memory defaults in their default configuration (Administrative Tool > Server settings > Default user settings Default project settings > Node defaults). Other nodes use the value from Maximum memory used per node setting in default project settings.

The Auto memory option controls how PolyAnalyst Server allocates virtual memory (RAM) for carrying out the node’s processing instructions when the node is executing. If checked, PolyAnalyst will dynamically assess the memory needs of the node during processing and then allocate and release memory as necessary. On the other hand, if the option is unchecked, PolyAnalyst uses the memory setting that is part of the project properties that can be configured through the Settings menu per project. This is a static, or constant setting, global to all nodes in the project, whereas the Auto memory option is configured per node.

If the full memory available to PolyAnalyst Server is not being used, then the option has extremely little impact (effectively none) on processing performance. This option only plays a role when all memory available is being used and certain operations start competing for memory.

If you are unaware of how virtual memory works, or are not experiencing any performance issues, you can safely ignore this option in all nodes.

The Inherit highlighting option

This option is always present on the General tab and is always checked by default. It allows dataset nodes to inherit text highlighting from parent nodes.

For example, a user can configure and execute the Score Taxonomy node, which will highlight text fragments, and connect it to another node which represents a dataset, for example, Filter Rows. The latter will inherit the highlighting from the parent node.

The ability of nodes to inherit highlighting allows users to understand during later stages of their analysis, that some text appears in a certain dataset only because it was found to contain certain character sequences earlier.

Despite the fact that we can directly observe the effect of this option only in text analysis nodes, it is checked in all the other nodes by default. If you uncheck it in one of the nodes in a node chain, all highlighting will be permanently lost.

The Use soft time out option

This option is available for most data source nodes as well as individual nodes in other branches of the node palette, for example, such a data analysis node as Case-Based Reasoning. Such nodes work by iterations and process data cyclically.

If you check this option, the node will know that its execution time should be limited. In this case, even if there are a lot of data to process, the node will gradually wind up for a finish. Users are fully aware of the fact that have to be content with imperfect results for the sake of time economy while processing data.

The Staged option

This option is not available in all PolyAnalyst nodes.

Staging a node provides a way to speed up the loading of the view of a node’s results. By default, most nodes are not staged, because staging may involve storing a large amount of redundant data.

By default, PolyAnalyst optimizes how nodes store output data.

For example, consider a flowchart containing a data source node connected to a Filter Rows node. The documentation describes the Filter Rows node as creating a new subset of data that is essentially a copy of some of the records output from the data source node. However, this is not what actually happens. PolyAnalyst instead stores record pointers, which are basically record numbers (e.g. row 1, row 3, row 4), in the output of the Filter Rows node. When you view the output of the Filter Rows node, PolyAnalyst dynamically creates a full view of the records from the record pointers.

Storing only record ids in subsequent nodes such as in the above example provides a great way to save disk space. For a given record, storing a single value instead of all of the values of the record is an obvious improvement. However, the tradeoff is that the view of the record must be recreated whenever you view the data, which is slower than the case where the record itself is immediately available.

If you find that you are working with what you consider to be a small amount of data, or have plenty of storage space, and want to improve the speed at which node results are loaded into view, you can toggle the Staged option for various nodes.

To stage a node, check its Staged checkbox on the General tab of its properties, and then re-execute the node. From this point onward, the node is staged. You can return later and uncheck the box and re-execute the node if you are concerned about disk space.

The stage option is not available to data source nodes because it is not relevant in an import context. Imported data is always copied. PolyAnalyst only performs the "pointer optimization" when working with non-data source nodes.

The process of accessing datasets can slow down because of the increase in the length of the node sequence or the distance between the node requesting a column and the node where the column is specified (one of the data source nodes, Derive, Aggregate, Generic Dataset, etc.). The degree of the slow-down also depends on which nodes are used between a source of data and their end user.

For example, Filter Columns, Derive and Score nodes do not slow down the process of retrieving data from a data source, while Filter Rows, Sample, Join and some other nodes do.

If you stage a node, it becomes a data source itself, and the node sequence from the source node to the end user node is shortened. Consequently, we recommend that you stage some nodes at certain key points on the flowchart. For example, it can be done after the data has been transformed and fully prepared for further analysis, before the analysis itself starts. We also recommend you to stage nodes at the data preparation stage, for example, every 4th or 8th node, depending on how much disk space you are ready to sacrifice to boost the program’s performance.

If the Staged option is checked, only the results of the node are staged, while the rest - indices, highlighting and other internal information - is irrevocably lost. If a text analysis node is used after a staged node, all the required text indices created BEFORE it will be lost, and the system will create them again, as if they did not exist at all. First, re-indexing large datasets requires more processing time. Second, it can lead to unexpected results.

For example, if a Keyword Extraction node precedes a staged node, and you intend to use a taxonomy with the keyword function right after the staged node, the system will display an error message, as the results of text indexing in Keyword Extraction will be lost. After a staged node only basic indexing with default settings will be performed. If this does not come up to the user’s expectations, he will have to configure and execute all the required text analysis nodes after the staged node.

The Description option (the node description)

The Description text field can be used to enter any custom description of the node. For example, users can briefly describe, what has already been done and what is to be done, comment on the node configuration or some of its unexpected results or behavior, and so on.