Spanner Algebra

The spanner module of Text Extensions for Pandas provides span-specific operations for Pandas DataFrames, based on the Document Spanners formalism, also known as spanner algebra.

Spanner algebra is an extension of relational algebra with additional operations to cover NLP applications. See the paper [“Document Spanners: A Formal Approach to Information Extraction”]( https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf) by Fagin et al. for more information.

text_extensions_for_pandas.spanner.adjacent_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second', min_gap: int = 0, max_gap: int = 0)[source]

Compute the join of two series of spans, where a pair of spans is considered to match if they are adjacent to each other in the text.

Parameters
  • first_series – Spans that appear earlier. dtype must be TokenSpanDtype.

  • second_series – Spans that come after. dtype must be TokenSpanDtype.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

  • min_gap – Minimum number of spans allowed between matching pairs of spans, inclusive.

  • max_gap – Maximum number of spans allowed between matching pairs of spans, inclusive.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.consolidate(df: pandas.core.frame.DataFrame, on: str, how: str = 'left_to_right') pandas.core.frame.DataFrame[source]

Eliminate overlap among the spans in one column of a pd.DataFrame.

Parameters
  • df – DataFrame containing spans and other attributes

  • on – Name of column in df on which to perform consolidation

  • how – What policy to use to decide what spans are considered to overlap and which of an overlapping pair will remain after consolidation. Available policies: * left_to_right: Walk through the spans from left to right, keeping the longest non-overlapping match at each position encountered

Returns

the rows of df that remain after applying the specified policy to the spans in the column specified by on.

text_extensions_for_pandas.spanner.contain_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]

Compute the join of two series of spans, where a pair of spans is considered to match if the second span is contained within the first.

Parameters
  • first_series – First set of spans to join, wrapped in a pd.Series

  • second_series – Second set of spans to join. These are the ones that are contained within the first set where the join predicate is satisfied.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.extract_dict(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], dictionary: pandas.core.frame.DataFrame, output_col_name: str = 'match')[source]

Identify all matches of a dictionary on a sequence of tokens.

Parameters
  • tokensSpanArray of token information, optionally wrapped in a pd.Series.

  • dictionary – The dictionary to match, encoded as a pd.DataFrame in the format returned by load_dict()

  • output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

a single-column DataFrame of token ID spans of dictionary matches

text_extensions_for_pandas.spanner.extract_regex_tok(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], compiled_regex: regex.regex.compile, min_len=1, max_len=1, output_col_name: str = 'match')[source]

Identify all (possibly overlapping) matches of a regular expression that start and end on token boundaries.

Parameters
  • tokensSpanArray of token information, optionally wrapped in a pd.Series.

  • compiled_regex – Regular expression to evaluate.

  • min_len – Minimum match length in tokens

  • max_len – Maximum match length (inclusive) in tokens

  • output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

A single-column DataFrame containing a span for each match of the regex.

text_extensions_for_pandas.spanner.lemmatize(spans: Union[pandas.core.series.Series, text_extensions_for_pandas.array.span.SpanArray, Iterable[text_extensions_for_pandas.array.span.Span]], token_features: pandas.core.frame.DataFrame, lemma_col_name: str = 'lemma', token_span_col_name: str = 'span') List[str][source]

Convert spans to their normal form using lemma information in a token features table.

Parameters
  • spans – Spans to be normalized. Each may represent zero or more tokens.

  • token_features – DataFrame of token metadata. Index must be aligned with the token indices in spans.

  • lemma_col_name – Optional custom name for the DataFrame column containing the lemmatized form of each token.

  • token_span_col_name – Optional custom name for the DataFrame column containing the span of each token.

Returns

A list containing normalized versions of the tokens in spans, with each token separated by single space character.

text_extensions_for_pandas.spanner.overlap_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]

Compute the join of two series of spans, where a pair of spans is considered to match if they overlap.

Parameters
  • first_series – First set of spans to join, wrapped in a pd.Series

  • second_series – Second set of spans to join.

  • first_name – Name to give the column in the returned dataframe that is derived from first_series.

  • second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.