PDL language

The pattern definition language (PDL) is PolyAnalyst’s proprietary query language designed for information retrieval from unstructured texts. PDL is a powerful tool for analyzing textual data without formal structure (such as news articles, blog posts, customer feedback, research papers, reports, social media, etc). Using PDL queries it is possible to retrieve any information, such as names of companies, vehicle models, contact information, product defects, names of drugs and chemicals, research article topics, share rates, market dynamics information, customer problems, etc.

For instance, a query that searches for words in title case followed by words "Ltd", "Co" or "Inc" retrieves company names ("Samsung Electronics Co.", "Apple Inc.", "Argus Solutions, Ltd" etc.). A query to match mentions of ministries would be the word "ministry" in title case followed by preposition "of" and a sequence of words in title case ("Ministry of Trade", "Ministry of Foreign Economic Relations", "Ministry of Internal Affairs", etc.).

PDL includes a wide range of features from simple word matching to ontology-based searching and allows to capture different ways in which information of interest may be expressed.

For instance, PDL allows user to search for words from a particular dictionary, word synonyms (such as "sales grow" and "sales increase"), parts of speech (such as nouns or verbs), or to specify morphological features of the words (such as the word "park" as a verb but not as a noun).

PDL supports advanced proximity search to match arguments in text within a specified distance. Proximity queries can specify the required distance, search within one or several sentences, set constraints on the word order and surrounding context, and indicate terms that must or must not occur.

Using PDL syntax-based features it is possible to query syntax trees and search for concepts connected by syntactic relationship.

PDL also provides ontology-based functions to search for semantically similar or related terms from associated ontologies.

In order to retrieve complex information, search queries can be combined or nested inside each other.

PDL Syntax

A PDL query is a sequence of PDL functions, operators and strings.

Functions

The function name is followed by parentheses containing the list of comma separated arguments. If a function has no parameters, the parentheses are left empty. Function names are case-insensitive but for better readability, it is recommended to use a single format, like all lowercase, or all uppercase.

Syntax

function_name([argument_1 [, argument_2 …​]])

Example

phrase(company, project) matches word "company" followed by a word "project";

stem(advertise) matches all forms of the word "advertise" ("advertise", "advertised", "advertising", etc.);

number() matches all numbers.

Functions parameters

Many PDL functions support optional parameters to change the function’s default behavior. Parameters are usually passed as the function’s first argument.

Example

The phrase() function matches arguments that follow each other within a sentence.

The first optional argument is used to set a maximum allowed distance between the arguments; by default this value is set to 1.

So, the query phrase(increase, sales) matches "company intends to increase sales", but not "company intends to increase annual sales" because, in the last example, distance between the arguments is greater than 1. Query phrase(2, increase, sales) matches both of the examples above.

Many PDL functions also support named parameters.

Syntax

parameter_name:=parameter_value

Example

By default the near() function matches arguments in text within a specified distance regardless of their order.

The function supports several parameters that can be used to change its default behaviour.

For instance, named parameter max_gap defines the number of maximum tokens between the function arguments; by default its value is set to 0 (no other tokens can be used between the arguments).

Thus, the query near(Germany, Austria) = near(Germany, Austria, max_gap:=0) both match "Germany, Austria and Switzerland" and "Austria, Germany and Switzerland", but not "Austria and Germany", because in this text there is a token between function arguments. The query near(Germany, Austria, max_gap:=1) matches all examples above.

In the same manner, the named parameter allow_punct regulates whether punctuation marks and spaces are allowed within the arguments sequence; by default its value is set to "yes" (punctuation marks are allowed).

Query near(company, say, allow_punct:=no) finds "The company said it does not intend to seek another buyer", but not "'That’s how you understand the whole company', said John Smith, President of the bank".

For the whole list of the named parameters for all the functions, please see the chapter "PDL functions reference".

The order of the named parameters is not significant, but for better readability, it is recommended to put them after function arguments.

Example

near(Germany, Austria, max_gap:=0, allow_punct:=no)

=

near(Germany, allow_punct:=no, Austria, max_gap:=0)

=

near(max_gap:=0, allow_punct:=no, Germany, Austria)

Recommended option:

near(Germany, Austria, max_gap:=0, allow_punct:=no)

Operators

PDL Operators can be applied following a single expression (unary operators) or they can connect two expressions (binary operators). Like functions, operator names are case-insensitive.

Syntax

binary operator: argument_1 operator argument_2

unary operator: operator argument_1

Example

"car" or "vehicle" matches documents that contain at least one of the words "car", "vehicle" or both.

not "vehicle" matches documents that do not contain the word "vehicle".

The following table lists all types of operators in PDL.

Operator

Name

Type

not

not

unary

and

and

binary

or

or

binary

xor

xor

binary

&

set intersection

binary

/

set difference

binary

For more information about operators, please see the chapter "Operators".

Functions and operators can be nested, so that the result of one function (or operator) is an argument of the parent function (or operator). Therefore, a PDL-query is usually a combination of functions, operators and strings.

Example

sentence(near(income or revenue, company, max_gap:=2), phrase(3, number(), dollar оr "$"))

For more information about complex PDL-queries, please see the chapter "Walkthrough Example".