docpart

Purpose

Finds text records that contain arguments in the document’s section specified by the first parameter of the function. If the arguments are omitted, the function matches all sections of the specified type.

Syntax

docpart(section_type, [term1, [term2, …​]])

Arguments

The list of section_type parameter’s values and optional names arguments below is exhaustive.

  • table_of_contents/contents/toc — table of contents

  • heading — heading

    1. level — heading level: [1, 6], 1 — is the most important heading, 6 — is the least important;

  • list — list

    1. number: number of document’s list;

    2. type: type of list: ordered, unordered/bulleted;

    3. item: number list’s item;

    4. level: list level: [1, 6], 1 — is the most important list, 6 — is the least important;

  • list_item — list item

    1. number: number of list’s item;

  • table — table

    1. name: name of table;

    2. number: number of document’s table;

    3. column/col: name/number of table’s column;

    4. col_number: number of table’s column;

    5. row: name/number of table’s row;

    6. row_number: number of table’s row;

  • row — table row

    1. name: name of row;

    2. number: number of table’s row;

  • column/col — table column

    1. name: name of column;

    2. number: number of table’s column;

  • cell — table cell

  • table_name — table name

  • section — section

    1. name : section’s name;

    2. whole (yes/no) : If set to "yes", the name parameter refers to the entire section name (set to "no" by default);

    3. level: corresponds to a heading’s level;

    4. field: search within a section’s body/heading/both body and heading. Set to "any" by default;

  • mail — email section

    1. sender: email’s sender;

    2. recipient: email’s recipient;

    3. copy: recipient in copy;

    4. subject: email’s subject;

    5. opening: email’s opening;

    6. closing: email’s closing;

    7. signature: email’s signature;

    8. body: email’s body;

    9. date_time: email’s date and time

    10. forwarded:=yes/no: defines whether email is a forwarded message.

  • page — specific page/page range

    1. number: sets the page number or page range if two parameters are specified.

  • hyperlink — internet hyperlinks.

The function also takes the following optional parameters:

  • * ocr is used to find documents containing words that were recognized by the PolyAnalyst OCR module with a high recognition confidence score.

    1. confidence sets the confidence range of OCR recognition.

  • rotated/unrotated search for rotated/unrotated text (set to "unrotated" by default).

    1. degree sets the degree of rotation (i.e., 15, 16, 30.5, 45, 90, etc.);

    2. type (horizontal/vertical) sets vertical/horizontal type of rotation;

    3. scope (token/sentence/paragraph/text) specifies whether to output results by tokens, sentences, paragraphs, or the entire text (scope:=text by default). The parameter works if there are no nested arguments.

Note
  1. The parameter page supports docx and pdf formats.

  2. If users wish to search within several sections, they may list them with "|" symbol.

  3. If the attributes are omitted, the function matches all sections of the specified type.

  4. One can use the relational operators ">", "<", ">=", "<=", "!=" to specify a search within numerical parameters, e.g. docpart(table, col:>1, col:<3, row:>1)

  5. The docpart function matches the intersection of the query with table sections or pages set by the number argument. Therefore, the query can only partially reside in the specified table sections or on the specified pages.

  6. The optional named attribute number of the page parameter can take a negative value. In this case, it is counted from the last page in the document, i.e. number:"-1" limits the query to the last page, number:>="-2" limits the query to the last two pages.

  7. The hyperlink parameter finds hyperlinks only in html-pages. In order to use the parameter, it is necessary to connect the node to an already executed parent node Internet source.

  8. Search for rotated text is possible in .docx documents and in documents recognized by OCR. Document rotation is considered in two directions: clockwise (positive value) and counterclockwise (negative value). The rotation value is set in the range [-180; 180] degrees.

Returned Value

Text records.

Examples

docpart(contents, international organization) matches all occurrences of the words "international" or "organization" in the table of contents.

docpart(heading, consumption) matches all occurrences of the word "consumption" in the document’s headings.

docpart(table, consumption) matches all occurrences of the word "consumption" in tables.

docpart(heading|table, consumption) matches all occurrences of the word "consumption" in headings and tables.

docpart(heading, level:=1) matches the highest headings of the document.

docpart(table_name) matches names of the tables.

docpart(list) matches all lists of the document.

docpart(list, type:=unordered) matches all unordered lists of the document.

docpart(list, type:=ordered) matches all ordered lists of the document.

docpart(list, type:=bulleted) matches all bulleted lists of the document

docpart(list, number:=3, item:=2) matches the second item of the third list of the document.

docpart(list_item, number:=5) matches the fifth item of each document list.

docpart(section, History, field:=body) matches the word "History" within sections' body, but not within headings.

docpart(section, convention, name:=article) matches the word "convention" within sections containing the word "article" in their headings.

docpart(section, name:=phrase(article, number()), whole:=yes) matches sections with names like "article 1", "article 2", but not, for example, "new article 1".

docpart(section, programme, level:=2) = docpart(heading, programme, level:=2) matches the word "programme" within sections' headings of the second level.

docpart(mail, field:=subject) matches "Subject: RE: West Position" in the email.

docpart(mail, forwarded:=yes) matches forwarded messages in the dataset.

docpart(mail, field:=copy) matches recipients in copy, for example, cc: "Debbie Nowak (E-mail)" <dnowak@enron.com>.

docpart(ocr, confidence:>80) matches words with OCR module recognition threshold greater than 80.

docpart(page) matches all pages of a document.

docpart(page, number:=1) matches and highlights the first page of a document.

docpart(page, <query>) matches positions of a <query> within a page.

docpart(page, contract, rent, number:=1) matches any of the arguments "contract" or "rent" on the first page.

docpart(page, contract, number:=1) matches the word "contract" on the first page.

docpart(page, contract, number:="-1") matches the word "contract" on the last page.

docpart(page, signature, number:!=1) matches the word "signature" not on the first page.

docpart(page, sum, number:>=2, number:<=3) matches the word "sum" on the second and the third page.

docpart(hyperlink) matches all hyperlinks in a document.

docpart(hyperlink, "weather") matches the hyperlinks containing the word «weather».

docpart(rotated) matches rotated text.

docpart(unrotated) = docpart(rotated, degree:=0) matches unrotated text.

docpart(rotated, degree:="-90") matches text rotated 90 degrees counterclockwise.

docpart(rotated, type:=horizontal) = orn(docpart(unrotated), docpart(rotated, degree:=180), docpart(rotated, degree:="-180")) matches text rotated 180 degrees.

docpart(rotated, type:=vertical) = orn(docpart(rotated, degree:=90), docpart(rotated, degree:="-90")) matches text rotated 90 degrees clockwise or counterclockwise.

docpart(rotated, noun()) matches all rotated nouns.

docpart(rotated, type:=vertical, noun()) matches all nouns rotated 90 degrees clockwise or counterclockwise.