Spanner Algebra

The spanner module of Text Extensions for Pandas provides span-specific operations for Pandas DataFrames, based on the Document Spanners formalism, also known as spanner algebra.

Spanner algebra is an extension of relational algebra with additional operations to cover NLP applications. See the paper [“Document Spanners: A Formal Approach to Information Extraction”]( https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf) by Fagin et al. for more information.

text_extensions_for_pandas.spanner.adjacent_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second', min_gap: int = 0, max_gap: int = 0)[source]

Compute the join of two series of spans, where a pair of spans is considered to match if they are adjacent to each other in the text.

Parameters
  • first_series – Spans that appear earlier. dtype must be TokenSpanDtype.

  • second_series – Spans that come after. dtype must be TokenSpanDtype.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

  • min_gap – Minimum number of spans allowed between matching pairs of spans, inclusive.

  • max_gap – Maximum number of spans allowed between matching pairs of spans, inclusive.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.consolidate(df: pandas.core.frame.DataFrame, on: str, how: str = 'left_to_right') pandas.core.frame.DataFrame[source]

Eliminate overlap among the spans in one column of a pd.DataFrame.

Parameters
  • df – DataFrame containing spans and other attributes

  • on – Name of column in df on which to perform consolidation

  • how – What policy to use to decide what spans are considered to overlap and which of an overlapping pair will remain after consolidation. Available policies: * left_to_right: Walk through the spans from left to right, keeping the longest non-overlapping match at each position encountered

Returns

the rows of df that remain after applying the specified policy to the spans in the column specified by on.

text_extensions_for_pandas.spanner.contain_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]

Compute the join of two series of spans, where a pair of spans is considered to match if the second span is contained within the first.

Parameters
  • first_series – First set of spans to join, wrapped in a pd.Series

  • second_series – Second set of spans to join. These are the ones that are contained within the first set where the join predicate is satisfied.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.create_dict(entries: Iterable[str], tokenizer: spacy.tokenizer.Tokenizer = None) pandas.core.frame.DataFrame[source]

Create a dictionary from a list of entries, where each entry is expressed as a single string.

Tokenizes and normalizes the dictionary entries.

Parameters
  • entries – Iterable of strings, one string per dictionary entry.

  • tokenizer – Preconfigured tokenizer object for tokenizing

dictionary entries. Must always tokenize the same way as the tokenizer used on the target text! If None, this method will use tokenizer returned by text_extensions_for_pandas.io.spacy.simple_tokenizer().

Returns

pd.DataFrame with the normalized, tokenized dictionary entries.

text_extensions_for_pandas.spanner.extract_dict(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], dictionary: pandas.core.frame.DataFrame, output_col_name: str = 'match')[source]

Identify all matches of a dictionary on a sequence of tokens.

Parameters
  • tokensSpanArray of token information, optionally wrapped in a pd.Series. These tokens must come from the same tokenizer that tokenized the entries of ``dictionary``. To tokenize with SpaCy, use text_extensions_for_pandas.io.spacy.make_tokens().

  • dictionary – The dictionary to match, encoded as a pd.DataFrame in the format returned by load_dict()

  • output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

a single-column DataFrame of token ID spans of dictionary matches

text_extensions_for_pandas.spanner.extract_regex(doc_text: str, compiled_regex: re.Pattern)[source]

Identify all non-overlapping matches of a regular expression, as returned by re.Pattern.finditer(), and return those locations as an array of spans.

Parameters
  • doc_text – Text of the document; will be the target text of the returned spans.

  • compiled_regex – Regular expression to evaluate, compiled with either the re or the regex package.

Returns

SpanArray containing a span for each match of the regex.

text_extensions_for_pandas.spanner.extract_regex_tok(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], compiled_regex: regex.regex.compile, min_len=1, max_len=1, output_col_name: str = 'match')[source]

Identify all (possibly overlapping) matches of a regular expression that start and end on token boundaries.

Parameters
  • tokensSpanArray of token information, optionally wrapped in a pd.Series.

  • compiled_regex – Regular expression to evaluate.

  • min_len – Minimum match length in tokens

  • max_len – Maximum match length (inclusive) in tokens

  • output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

A single-column DataFrame containing a span for each match of the regex.

text_extensions_for_pandas.spanner.extract_split(doc_text: str, split_points: Union[Sequence[int], numpy.ndarray, text_extensions_for_pandas.array.span.SpanArray]) text_extensions_for_pandas.array.span.SpanArray[source]

Split a document into spans along a specified set of split points.

Parameters
  • doc_text – Text of the document; will be the target text of the returned spans.

  • split_points

    A series of offsets into doc_text, expressed as either: * A sequence of integers (split at certain locations and return a set of splits that

    covers every character in the document) as a list or 1-d Numpy array

    • A sequence of spans (split around the indicated locations, but discard the parts of the document that are within a split point)

Returns

SpanArray that splits the document in the specified way.

text_extensions_for_pandas.spanner.lemmatize(spans: Union[pandas.core.series.Series, text_extensions_for_pandas.array.span.SpanArray, Iterable[text_extensions_for_pandas.array.span.Span]], token_features: pandas.core.frame.DataFrame, lemma_col_name: str = 'lemma', token_span_col_name: str = 'span') List[str][source]

Convert spans to their normal form using lemma information in a token features table.

Parameters
  • spans – Spans to be normalized. Each may represent zero or more tokens.

  • token_features – DataFrame of token metadata. Index must be aligned with the token indices in spans.

  • lemma_col_name – Optional custom name for the DataFrame column containing the lemmatized form of each token.

  • token_span_col_name – Optional custom name for the DataFrame column containing the span of each token.

Returns

A list containing normalized versions of the tokens in spans, with each token separated by single space character.

text_extensions_for_pandas.spanner.load_dict(file_name: str, tokenizer: spacy.tokenizer.Tokenizer = None)[source]

Load a SystemT-format dictionary file. File format is one entry per line.

Tokenizes and normalizes the dictionary entries.

Parameters
  • file_name – Path to dictionary file

  • tokenizer – Preconfigured tokenizer object for tokenizing

dictionary entries. Must be the same configuration as the tokenizer used on the target text! If None, this method will use SpaCy’s default English tokenizer.

Returns

pd.DataFrame with the normalized, tokenized dictionary entries.

text_extensions_for_pandas.spanner.overlap_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]

Compute the join of two series of spans, where a pair of spans is considered to match if they overlap.

Parameters
  • first_series – First set of spans to join, wrapped in a pd.Series

  • second_series – Second set of spans to join.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.unpack_semijoin(target_region: text_extensions_for_pandas.array.span.Span, model_results: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]

Unpack the results of evaluating an extraction model, such as dependency parsing or named entity recognition, using a semijoin strategy to reduce the amount of text over which the model is applied.

To use unpack_semijoin(), first identify regions of the text that you wish to run the model. Then run the model over the text of those regions to produce spans whose begin and end offsets are relative to the text of each distinct target region. Then you can pass the spans and the model results to this function to produce result spans whose begin and end offsets are relative to the original document text.

Parameters
  • target_region – Span indicating a section of the original document text over which the model was applied.

  • model_results – Results from running your model over target_region, as a pd.DataFrame.

Returns

A pd.DataFrame with the same schema as model_results, but with all spans converted from spans over the target text of target_region to spans over the original document text.