Aggregating functions for DOCS

Aggregating file to base DOCS file

Documents.aggregate(filename: str, *, left_on: None | str | callable = None, right_on: None | str | callable = None, on: str | None = None, subset: None | str | List[str] = None, drop: None | str | List[str] = None, rename: dict | None = None, merge_policy: Literal['merge', 'keep', 'erase', 'replace', 'fill_na'] | None = 'merge', preprocessing: None | Callable | Operation = None, postprocessing: None | Callable | Operation = None, read_method: Callable | None = None, kw_read: dict | None = {})

Aggregation of a table from an Excel/CSV file.

The left_on and right_on arguments are column names used to perform a join: left_on is a column in the central file effectif.xlsx, and right_on is a column from the file to aggregate. For more complex joins, you can use the guv.helpers.id_slug() and guv.helpers.concat() functions (see Examples). If left_on and right_on have the same value, you can simply specify on.

subset is a list of columns to keep if you don’t want to aggregate all columns. drop is a list of columns to remove. rename is a dictionary of columns to rename. read_method is a callable used to read the file and is called with kw_read. preprocessing and postprocessing are callables that take a DataFrame as input and return a processed one — they are applied before and after the aggregation, respectively. merge_policy indicates how to handle merging columns with the same name in effectif.xlsx and the aggregated file.

Parameters:
  • filename (str) – Path to the CSV/Excel file to aggregate.

  • left_on (str) – Column name in the effectif.xlsx file used for the join. You can also use functions like guv.helpers.id_slug() and guv.helpers.concat() for multi-column joins.

  • right_on (str) – Column name in the file to aggregate used for the join. Functions like guv.helpers.id_slug() and guv.helpers.concat() are also supported.

  • on (str) – Shortcut when left_on and right_on are the same.

  • subset (list, optional) – List of columns to include. By default, all columns are incorporated.

  • drop (list, optional) – List of columns to exclude from aggregation.

  • rename (dict, optional) – Dictionary to rename columns after incorporation.

  • read_method (callable, optional) – Function used to load the file. Pandas functions like pd.read_csv and pd.read_excel are automatically selected based on file extension (“.csv”, “.xlsx”).

  • kw_read (dict, optional) –

    Keyword arguments passed to read_method. For example, for a “.csv” file:

    kw_read={"header": None, "names": ["Email", "PW group"]}
    kw_read={"na_values": "-"}
    

  • preprocessing (callable, optional) – Pre-processing function applied to the DataFrame before incorporation.

  • postprocessing (callable, optional) – Post-processing function applied to the DataFrame after incorporation.

  • merge_policy (str, optional) –

    Strategy for merging columns with the same name:

    • merge: Merge only if columns complement each other (NA values).

    • replace: Use all non-NA values from the file to aggregate.

    • fill_na: Only replace NA values in effectif.xlsx.

    • keep: Keep the original column from effectif.xlsx without changes.

    • erase: Overwrite the column with values from the file to aggregate.

Examples

  • Aggregating columns from a CSV file, matching the email column in the CSV with Email address in the central file:

    DOCS.aggregate(
        "documents/notes.csv",
        left_on="Email address",
        right_on="email"
    )
    
  • Aggregating only the Note column:

    DOCS.aggregate(
        "documents/notes.csv",
        left_on="Email address",
        right_on="email"
        subset="Note"
    )
    
  • Aggregating and renaming the Note column to Note_médian:

    DOCS.aggregate(
        "documents/notes.csv",
        left_on="Email address",
        right_on="email"
        subset="Note",
        rename={"Note": "Note_médian"}
    )
    
  • Aggregating using a CSV without a header:

    DOCS.aggregate(
        "documents/notes.csv",
        on="Email",
        kw_read={"header": None, "names": ["Email", "Grade"]},
    )
    
  • Aggregating based on Name and Last name using a slug ID to allow for flexible matching (ignores accents, cases, hyphens, etc.):

    from guv.helpers import id_slug
    DOCS.aggregate(
        "documents/notes.csv",
        left_on=id_slug("Name", "Last name"),
        right_on=id_slug("Name", "Last name")
    )
    
  • Aggregating when the aggregated file has a single column with both names, while the main file separates them:

    from guv.helpers import concat
    DOCS.aggregate(
        "documents/notes.csv",
        left_on=concat("Name", "Last name"),
        right_on="Full_name"
    )
    
Documents.aggregate_jury(filename: str)

Aggregates the result of a jury from the task XlsGradeBookJury.

The equivalent using aggregate() is written as:

DOCS.aggregate(
    "generated/jury_gradebook.xlsx",
    on="Email",
    subset=["Aggregated grade", "ECTS grade"]
)
Parameters:

filename (str) – The path to the file to aggregate.

Examples

DOCS.aggregate_jury("generated/jury_gradebook.xlsx")
Documents.aggregate_moodle_grades(filename: str, rename: dict | None = None)

Aggregates grade sheets exported from Moodle.

The grade sheet must be an export of one or more grades from the Moodle gradebook in the form of an Excel or CSV file.

Unused columns will be removed. Column renaming can be performed by specifying rename.

Parameters:
  • filename (str) – The path to the file to aggregate.

  • rename (dict, optional) – Dictionary to rename columns after incorporation.

Examples

DOCS.aggregate_moodle_grades("documents/SY02 Notes.xlsx")
Documents.aggregate_moodle_groups(filename: str, colname: str, backup: bool | None = False)

Aggregates group data from the “Group Choice” activity.

Since the group column is always named “Groupe”, the colname argument allows you to specify a different name. If the column already exists in the central file, it can be backed up using the backup option.

Parameters:
  • filename (str) – The path to the file to aggregate.

  • colname (str) – The column name to store group information in.

  • backup (bool) – Save the column before making any changes

Examples

DOCS.aggregate_moodle_groups("documents/Paper study groups.xlsx", "Paper")
Documents.aggregate_org(filename: str, colname: str, on: str | None = None, postprocessing: None | Callable | Operation = None)

Aggregation of a file in Org format.

The document to aggregate is in Org format. The headings serve as keys for aggregation, and the content under those headings is aggregated.

Parameters:
  • filename (str) – Path to the Org file to aggregate.

  • colname (str) – Name of the column in which to store the information from the Org file.

  • on (str, optional) – Column from the effectif.xlsx file used as the key to aggregate with the headings from the Org file. By default, the headings must contain the students’ first and last names.

  • postprocessing (callable, optional) – Post-processing to apply to the DataFrame after integrating the Org file.

Examples

  • Aggregating a file where the headings are student names:

    The Org file:

    * Bob Morane
      Frequently absent
    * Untel
      See excuse email
    

    The aggregation command:

    DOCS.aggregate_org("documents/infos.org", colname="Information")
    
  • Aggregating a file where the headings match elements from an existing column. For example, you can aggregate project grades from an Org file, grouped by project team. Specify the column containing the project groups, e.g., “Project1_group”.

    The Org file:

    * Project1_Group1
      A
    * Project1_Group2
      B
    

    The aggregation command:

    DOCS.aggregate_org("documents/infos.org", colname="Project 1 Grade", on="Project1_group")
    
Documents.aggregate_self(*columns)

Aggregation of the central file effectif.xlsx itself.

It is sometimes more convenient to manually add or modify columns directly in the central file effectif.xlsx rather than programmatically using DOCS.aggregate(...), for example. Since guv cannot reliably detect which columns were manually added or modified, you must specify them explicitly using the *columns argument.

Parameters:

*columns (any number of str) – The manually added or modified columns to retain when updating the central file.

Directly modify base DOCS file

Documents.add(filename, func=None)

Adds a file to the central file

Simply adds the file if the central file does not yet exist. Otherwise, uses the function func to perform the aggregation. The function func takes as arguments the existing DataFrame and the path to the file, and returns the updated DataFrame.

See specialized functions for incorporating standard documents:

Parameters:
  • filename (str) – The path to the file to aggregate.

  • func (callable, optional) – A function with signature DataFrame, filename: str -> DataFrame that performs the aggregation.

Examples

  • Adding data when there is nothing yet:

    DOCS.add("documents/base_listing.xlsx)
    
  • Aggregating using a function:

    def function_that_incorporates(df, file_path):
        # Incorporate the file at `file_path` into the DataFrame `df`
        # and return the updated DataFrame.
    
    DOCS.add("documents/notes.csv", func=function_that_incorporates)
    
Documents.apply_cell(name_or_email: str, colname: str, value, msg: str | None = None)

Replaces the value of a cell.

name_or_email is the student’s full name or email address, and colname is the name of the column where the change should be made. The new value is provided via value.

Parameters:
  • name_or_email (str) – The student’s full name or email address.

  • colname (str) – The name of the column where the modification should be made.

  • value – The value to assign.

  • msg (str, optional) – A message describing the operation.

Examples

DOCS.apply_cell("Mark Watney", "DIY Grade", 20)
Documents.apply_column(colname: str, func: Callable, msg: str | None = None)

Modifies an existing column using a function.

colname is the name of an existing column, and func is a function that takes an element from the column as input and returns a modified value.

A msg can be provided to describe the operation; it will be displayed when the aggregation is performed. Otherwise, a generic message will be shown.

Parameters:
  • colname (str) – Name of the column where replacements will be made.

  • func (callable) – Function that takes an element and returns the modified element.

  • msg (str, optional) – A message describing the operation.

Examples

DOCS.apply("note", lambda e: float(str(e).replace(",", ".")))
Documents.apply_df(func: Callable, msg: str | None = None)

Modifies the central file using a function.

func is a function that takes a DataFrame representing the central file and returns the modified DataFrame.

A msg can be provided to describe what the function does. It will be displayed when the aggregation is performed. Otherwise, a generic message will be shown.

Parameters:
  • func (callable) – Function that takes a DataFrame and returns a modified DataFrame.

  • msg (str, optional) – A message describing the operation.

Examples

  • Add a student missing from the official listing:

    import pandas as pd
    
    df_one = (
        pd.DataFrame(
            {
                "Last name": ["NICHOLS"],
                "Name": ["Juliette"],
                "Email": ["juliette.nichols@silo18.fr"],
            }
        ),
    )
    
    DOCS.apply_df(lambda df: pd.concat((df, df_one)))
    
  • Remove duplicate students:

    DOCS.apply_df(
        lambda df: df.loc[~df["Email"].duplicated(), :],
        msg="Remove duplicate students"
    )
    
Documents.compute_new_column(*cols: str, func: Callable, colname: str, msg: str | None = None)

Creating a column from other columns

The columns required for the calculation are specified in cols. In case you want to change the column used for the calculation without modifying the func function, you can provide a tuple ("col", "other_col") where col is the name used in func and other_col is the actual column used.

The func function that computes the new column receives a Pandas Series of all the values from the specified columns.

Parameters:
  • *cols (list of str) – List of columns provided to the func function

  • func (callable) – Function that takes a dictionary of “column names/values” as input and returns a computed value

  • colname (str) – Name of the column to create

  • msg (str, optional) – A message describing the operation

Examples

  • Weighted average of two grades:

    def average(grades):
        return .4 * grades["Note_médian"] + .6 * grades["Note_final"]
    
    DOCS.compute_new_column("Note_médian", "Note_final", func=average, colname="Note_moyenne")
    
  • Average ignoring undefined values:

    def average(grades):
        return grades.mean()
    
    DOCS.compute_new_column("note1", "note2", "note3", func=average, colname="Note_moyenne")
    
  • Recalculation with a modified grade without redefining the average function:

    DOCS.compute_new_column(
        ("note1", "note1_fix"), "note2", "note3", func=average, colname="Note_moyenne (fix)"
    )
    
Documents.fillna_column(colname: str, *, na_value: str | None = None, group_column: str | None = None)

Replaces undefined values in the colname column

Only one of the options na_value and group_column should be specified. If na_value is specified, values are unconditionally replaced with the provided value. If group_column is specified, values are filled by grouping on group_column and using the only valid value in that column per group.

Parameters:
  • colname (str) – Name of the column where NA values will be replaced

  • na_value (str, optional) – Value to replace undefined values with

  • group_column (str, optional) – Name of the column used for grouping

Examples

  • Replace undefined entries in the note column with “ABS”:

    DOCS.fillna_column("note", na_value="ABS")
    
  • Replace undefined entries within each group identified by the groupe_projet column with the only defined value in that group:

    DOCS.fillna_column("note_projet", group_column="groupe_projet")
    
Documents.flag(filename_or_string: str, *, colname: str, flags: List[str] | None = ['Oui', ''])

Flag a list of students in a new column

The document to aggregate is a list of student names displayed line by line.

Parameters:
  • filename_or_string (str) – Path to the file to aggregate, or the file content as a string.

  • colname (str) – Name of the column in which to place the flag.

  • flags (str, optional) – The two flag values used, default is “Yes” and empty.

Examples

Aggregating a file with student names as header:

The file “tiers_temps.txt”:

# Comments
Bob Morane

# Robust to circular permutation and case
aRcTor BoB

The aggregation instruction:

DOCS.flag("documents/tiers_temps.txt", colname="Tiers-temps")
Documents.replace_column(colname: str, rep_dict: dict, new_colname: str | None = None, backup: bool | None = False, msg: str | None = None)

Replacements in a column

Replaces values specified in rep_dict in the column colname.

If the backup argument is provided, the column is saved before any modification (with a _orig suffix). If the new_colname argument is provided, the column is copied to a new column named new_colname and changes are made on this new column.

A message msg can be specified to describe what the function does; it will be displayed when the aggregation is performed. Otherwise, a generic message will be shown.

Parameters:
  • colname (str) – Name of the column in which to perform replacements

  • rep_dict (dict) – Dictionary of replacements

  • new_colname (str) – Name of the new column

  • backup (bool) – Save the column before making any changes

  • msg (str, optional) – A message describing the operation

Examples

DOCS.replace_column("group", {"TD 1": "TD1", "TD 2": "TD2"})
ECTS_TO_NUM = {
    "A": 5,
    "B": 4,
    "C": 3,
    "D": 2,
    "E": 1,
    "F": 0
}
DOCS.replace_column("Note_TP", ECTS_TO_NUM, backup=True)
Documents.replace_regex(colname: str, *reps, new_colname: str | None = None, backup: bool | None = False, msg: str | None = None)

Regex replacements in a column

Replaces, in the colname column, all occurrences matching the regular expressions specified in reps.

If the backup argument is provided, the column is saved before any modification (with a _orig suffix). If the new_colname argument is provided, the column is copied to a new column named new_colname and the modifications are applied to that new column.

A msg message can be specified to describe what the function does; it will be displayed when the aggregation is performed. Otherwise, a generic message will be shown.

Parameters:
  • colname (str) – Name of the column in which to perform replacements

  • *reps (any number of tuple) – Pairs of regex / replacement

  • new_colname (str) – Name of the new column

  • backup (bool) – Save the column before making any changes

  • msg (str, optional) – A message describing the operation

Examples

DOCS.replace_regex("group", (r"group([0-9])", r"G\1"), (r"g([0-9])", r"G\1"))
Documents.switch(filename_or_string: str, *, colname: str, backup: bool = False, new_colname: str | None = None)

Performs value swaps in a column

The colname argument specifies the column in which to perform the swaps. If the backup argument is provided, the column is saved before any modification (with a _orig suffix). If the new_colname argument is provided, the column is copied to a new column named new_colname and the modifications are applied to this new column.

Parameters:
  • filename_or_string (str) – Path to the file to aggregate, or the content of the file as a string.

  • colname (str) – Name of the column where the changes will be applied

  • backup (bool) – Save the column before making any changes

  • new_colname (str) – Name of the new column

Examples

DOCS.switch("fichier_échange_TP", colname="TP")
DOCS.switch("Dupont --- Dupond", colname="TP")