Walkthrough Example

In order to better understand writing a PDL query, we will consider a simple example. For instance, a user may need to find street names in a document. In this case, users can write the following expression: "phrase(case(title), street or avenue)".

The query consists of three parts:

1) The case(title) subquery finds all words in title case.

2) The inclusion of the arguments "street or avenue" matches all forms of the words "street"/"avenue".

3) The phrase() function includes the subqueries explained above and matches a sequence of arguments that follow each other (in this case, a word in title case followed by "street"/"avenue").

Please note that this query does not extract complex street names containing more than one word, e.g., “East First Avenue”, “Von Karman Avenue”.

pdl intro 1

To include these results, we can use the repeat() function to indicate that a word in title case may occur more than onсe in the sentence.

pdl intro 2

Scrolling through the results, it is evident that the query extracts some cases which are not street names, e.g., "Wall Street Analyst", “New York City streets”.

There are several ways to specify the query thus excluding the incorrect results:

1) Indicating that the word “street” cannot be used in plural form.

This can be done by using the lemma() function (for more information, please see the section "Morphology-based Search"):

phrase(lemma(singular, street, avenue), repeat(case(title))).

2) Indicating that the words "street" and "avenue" cannot be preceded by a city or country name, as well as the word "Wall" (to avoid the name of a famous newspaper).

City and country names can be found in Geoadministrative dictionaries. If users want to find words from a dictionary, the dictword() function can be used (for more information, please see the section "Using dictionaries").

The query dictword(geoadministrative, "category = city|country", "Class = unique") matches cities and countries mentioned in the document, if they are present in the Geoadministrative dictionary and are unique.

In order to exclude arguments from the query results, the except() function can be used ((for more information, please see the section "Excluding search results").

except(dictword(geoadministrative, "category = city|country"), wall) excludes all city and country names and the word "wall" from the results.

3) Users may also use the operator not to exclude the words "journal", "bank" and "inc" from the query, in order to avoid such false results, as “Wall Street Journal”, “State Street Bank”, "Web Street, Inc.".

phrase(0, repeat(1,2, case(title)), (street or avenue), not (journal or bank or inc)

Our final query finds one or more words in title case except for "wall", city and country names followed by the words "street" or "avenue". The words "street" and "avenue", in turn, cannot be followed by "journal", "bank" and "inc":

phrase(repeat(1,2, case(title, except(dictword(geoadministrative, "category = city|country", "Class=unique"), Wall))), lemma(singular, street, avenue), not (journal or bank or inc))

pdl intro 3