Search within Document Parts

The function docpart() is used to search within specific document parts (headings, tables, lists, contents).

Syntax

docpart(section_type, [term_1, [term_2, …​]])

The first parameter section_type specifies the target document part. The docpart() function is used to search within the sections listed below:

Section

Comments

Supported attributes

Attribute

Comments

table_of_contents/contents/toc

table of contents

heading

heading

1. level

Heading level: [1, 6], 1 — the most important heading, 6 — the least important heading.

list

list

1. number

List number within the document.

2. type (ordered, unordered, bulleted)

List type (ordered, unordered, bulleted).

3. item

Item number within the list.

4. level

List level: [1, 6], 1 — the most important level, 6 — the least important level.

list_item

list item

number

Item number within the list.

table

table

1. name

Table name.

2. number

Table number within the document.

3. column/col

Name/number of table’s column.

4. col_number

Number of table’s column.

5. row

Name/number of table’s row.

6. row_number

Number of table’s row.

row

table row

1. name

Row name.

2. number

Row number.

column/col

table column

1. name

Column name.

2. number

Column number.

cell

table cell

section

section

1. name

Section name; can be a PDL-query.

2. whole (yes/no)

If set to "yes", the name parameter refers to the entire section name (set to "no" by default)

3. level

Specifies the section’s level, corresponds to a heading.

4. field (body/heading/any)

Search within a section’s body/heading/both body and heading. Set to "any" by default.

mail

email section

1. sender

Email’s sender.

2. recipient

Email’s recipient.

3. copy

Recipient in copy.

4. subject

Email’s subject.

5. opening

Email’s opening.

6. closing

Email’s closing.

7. signature

Email’s signature.

8. body

Email’s body.

9. date_time

Email’s date and time.

10. forwarded:yes/no

Defines whether email is a forwarded message.

page

page/page range

number

Sets the page number or page range if two parameters are specified.

hyperlink

internet hyperlink

The function also takes the optional parameter ocr used to find documents containing words that were recognized by the PolyAnalyst OCR module with a high recognition confidence score. The function also takes the named parameter confidence which sets the confidence range of OCR recognition.

Note

  1. If users wish to search within several sections, they may list them with "|" symbol.

  2. If the attributes are omitted, the function matches all sections of the specified type.

  3. One can use the relational operators ">", "<", ">=", "<=", "!=" to specify a search within numerical parameters, e.g. docpart(table, col:>1, col:<3, row:>1)

  4. The docpart() function matches the intersection of the query with table sections or pages set by the number argument. Therefore, the query can only partially reside in the specified table sections or on the specified pages.

  5. The optional named attribute number of the page parameter can take a negative value. In this case, it is counted from the last page in the document, i.e. number:"-1" limits the query to the last page, number:>="-2" limits the query to the last two pages.

  6. The hyperlink parameter finds hyperlinks only in html-pages. In order to use the parameter, it is necessary to connect the node to an already executed parent node Internet source.

  7. Supported file formats for each section are listed in the table below.

Section

Supported File Formats

contents

docx, odt

heading

docx, html, odt, pptx, ppt, rtf

list

docx, html, odt, pptx

table

docx, doc, html, odt, pptx, ppt, pdf, rtf

section

docx, html, odt, pptx, ppt, rtf

page

docx, pdf

Example

docpart(contents, international organization) matches all occurrences of the words "international" or "organization" in the table of contents.

docpart(heading, consumption) matches all occurrences of the word "consumption" in the document’s headings.

docpart(table, consumption) matches all occurrences of the word "consumption" in tables.

docpart(heading|table, consumption) matches all occurrences of the word "consumption" in headings and tables.

docpart(heading, level:=1) matches the highest headings of the document.

docpart(table_name) matches names of the tables.

docpart(list) matches all lists of the document.

docpart(list, type:=unordered) matches all unordered lists of the document.

docpart(list, type:=ordered) matches all ordered lists of the document.

docpart(list, type:=bulleted) matches all bulleted lists of the document

docpart(list, number:=3, item:=2) matches the second item of the third list of the document.

docpart(list_item, number:=5) matches the fifth item of each document list.

docpart(section, History, field:=body) matches the word "History" within sections' body, but not within headings.

docpart(section, convention, name:=article) matches the word "convention" within sections containing the word "article" in their headings.

docpart(section, name:=phrase(article, number()), whole:=yes) matches sections with names like "article 1", "article 2", but not, for example, "new article 1".

docpart(section, programme, level:=2) = docpart(heading, programme, level:=2) matches the word "programme" within sections' headings of the second level.

docpart(mail, field:=subject) matches "Subject: RE: West Position" in the email.

docpart(mail, forwarded:=yes) mathes forwarded messages in the dataset.

docpart(mail, field:=copy) matches recipients in copy, for example, cc: "Debbie Nowak (E-mail)" <dnowak@enron.com>.

docpart(page) matches all pages of a document.

docpart(page, number:=1) matches and highlights the first page of a document.

docpart(page, <query>) matches positions of a <query> within a page.

docpart(page, contract, rent, number:=1) matches any of the arguments "contract" or "rent" on the first page.

docpart(page, contract, number:=1) matches the word "contract" on the first page.

docpart(page, contract, number:="-1") matches the word "contract" on the last page.

docpart(page, signature, number:!=1) matches the word "signature" not on the first page.

docpart(page, sum, number:>=2, number:<=3) matches the word "sum" on the second and the third page.

docpart(hyperlink) matches all hyperlinks in a document.

docpart(hyperlink, "weather") matches the hyperlinks containing the word «weather».

Task example: Find the phrase "table of contents" in the table of contents

Users can write a query docpart(contents, "table of contents") that matches all occurrences of the phrase "table of contents" in the table of contents.

pdl docpart 1