ashpool package¶
Submodules¶
ashpool.ashpool module¶
ashpool.dummy module¶
Module contents¶
-
ashpool.attach_temp_id(dframe, field_list=None, id_label=u'tempid', append_uuid=False, prefix=u'')¶ Attach an column with ID created from field_list and optionally add uuid
- Arguments:
- dframe {pandas.DataFrame} – Input dataframe.
- Keyword Arguments:
field_list {list} – List of columns to use to build tempid. (default: {None})
append_uuid {bool} – True appends uuid. (default: {False})
- Returns:
- pandas.DataFrame – Dataframe with tempid using flds.
-
ashpool.attach_unique_id(dframe, threshold=1.0)¶ Return a new dataframe based on input dframe with unique fields attached.
- Arguments:
dframe {pandas.DataFrame} – Source dataframe
threshold {float} – Specify how unique 0.0 to 1.0 (most unique)
- Returns:
- pandas.DataFrame – Dataframe with a new id (u_id), which meets the threshold for uniqueness. u_id is based on a combination of non-numeric series, trying to meet the uniqueness threshold with the fewest number of series.
-
ashpool.best_id_pair(dframe_l, dframe_r, threshold=0.5)¶ Return df showing which IDs are best for matching two dfs.
- Arguments:
dframe_l {pandas.DataFrame} – Left dataframe.
dframe_r {pandas.DataFrame} – Right dataframe.
- Keyword Arguments:
- threshold {float} – Value between 0 and 1 that represents minimum coveredness (default: {0.5})
- Returns:
- pandas.DataFrame – Dataframe showing best IDs to use to align source dataframes.
-
ashpool.check_coveredness(dframe_l, dframe_r)¶ Returns ratings of coveredness for columns in dframe_l
- Arguments:
dframe_l {pandas.DataFrame} – Source dataframe.
dframe_r {pandas.DataFrame} – Target dataframe.
- Returns:
- pandas.DataFrame – Dataframe showing statistics regard each columns coveredness.
-
ashpool.completeness(srs)¶ Return completeness score for series - i.e., the percentage of non-null values in a series.
- Arguments:
- srs {pandas.Series}
- Returns:
- float
-
ashpool.coveredness(srs_l, srs_r)¶ Returns percentage of srs_l members that can be found in srs_r
- Arguments:
srs_l {pandas.Series} – Source series.
srs_r {pandas.Series} – Target series.
- Returns:
- float – Percentage of srs_l members that can be found in srs_r
-
ashpool.cum_uniq(dframe, flds=None)¶ Return list of incremental uniqueness as tempid is created based on flds.
- Arguments:
- dframe {pandas.DataFrame} – Source dataframe.
- Keyword Arguments:
- flds {list} – List of columnn names to be used for create tempid. (default: {None})
- Returns:
- list – List of floats representing incremental addition to uniqueness as more columns are used to create a tempid.
-
ashpool.depiction(srs)¶ Returns description (depiction) of series.
- Arguments:
- srs {pandas.Series}
- Returns:
- pandas.DataFrame
-
ashpool.differ(dframe_l, dframe_r, left_on, right_on, fields_l=None, fields_r=None, show_diff=False, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, **kwargs)¶ Returns dataframe showing comparison between fields_l and fields_r. Dataframes are first aligned using left_on and right_on.
- Arguments:
dframe_l {pandas.DataFrame} – Left dataframe
dframe_r {pandas.DataFrame} – right dataframe
left_on {list} – list of series names
right_on {list} – list of series names
- Keyword Arguments:
fields_l {list} – List of series names to compare (default: {None})
fields_r {list} – List of series to compare (default: {None})
show_diff {bool} – If true return difference between comparison series (default: {False})
show_ratio {bool} – If true return ratio between comparison series (default: {False})
show_data {bool} – If true return data series in returned results (default: {True})
tol_pct {float} – Tolerance in percentage terms when considering matches in numerical data (default: {0})
tol_abs {float} – Tolerance in units when considering matches in numerical data (default: {0})
depict {bool} – If true return stats regarding differ results per comparison pair (depiction) (default: {False})
- Returns:
- pandas.DataFrame – Dataframe showing comparison between fields_l and fields_r.
-
ashpool.get_combos(lst)¶ Returns list of combinations of members of list.
- Arguments:
- lst {list} – List of strings
- Returns:
- list – List of combinations
-
ashpool.get_dtypes(dframe)¶ return dtypes and kinds by column names (fld)
-
ashpool.get_most_coveredness(srs_l, dframe_r, top_limit=3)¶ Returns columns that most cover source series
- Arguments:
srs_l {pandas.Series} – Input series.
dframe_r {pandas.DataFrame} – Target dataframe to search.
- Keyword Arguments:
- top_limit {int} – Maximum number of column names (default: {3})
- Returns:
- list – List of columns from dframe_r that most cover srs_l.
-
ashpool.get_sorted_fields(dframe)¶ Returns list of fields sorted by most_complete, most_unique, and non_object
- Arguments:
- dframe {pandas.DataFrame} – Input dataframe.
- Returns:
- dict – Dictionary of lists with fields sorted by completeness and uniqueness. Also a list for fields with are non_object, which are not ranked for completeness or uniqueness.
-
ashpool.get_unique_fields(dframe, candidate_flds, threshold=1.0, max_member_length=30, show_all=False)¶ Return list of fields that combine to create an ID that has uniqueness >= threshold.
- Arguments:
dframe {pandas.DataFrame} – Input dataframe.
candidate_flds {list} – List of column names.
- Keyword Arguments:
threshold {float} – Uniqueness threshold where 1 is perfectly unique (default: {1})
max_member_length {int} – Used to filter out columns which have members that are too lengthy. (default: {30})
show_all {bool} – Show all results even if uniqueness does not meet threshold. (default: {False})
- Returns:
- list – List of column names that combine to create a unique ID that meets the uniqueness threshold. Stops looking after finding the first list that meets threshold.
-
ashpool.has_name_match(srs_l, dframe_r)¶ Returns True if srs_l name found in dframe_r
- Arguments:
srs_l {pandas.Series} – Source series.
dframe_r {pandas.DataFrame} – Dataframe to search.
- Returns:
- bool – True if srs_l.name is found in dframe_r.columns.
-
ashpool.jaccard_similarity(srs_l, srs_r)¶ Returns the jaccard similarity between two lists
-
ashpool.leven_dist(x, y)¶ Returns Levenshtein distance for two strings and a null if not valid strings.
- Arguments:
x {str} – First string.
y {str} – Second string.
- Returns:
- long – Levenshtein distance
-
ashpool.longest_element(srs)¶ return the max len() of any element in series.
- Arguments:
- srs {pandas.Series}
- Returns:
- float
-
ashpool.make_good_label(x_value)¶ Return something that is a better label.
- Arguments:
- x_value {string} – or something that can be converted to a string
-
ashpool.mash(dframe, flds=None, keep_zeros=False)¶ Returns df of non-null and non-zero on flds
- Arguments:
- dframe {pandas.DataFrame} – Input dataframe
- Keyword Arguments:
flds {list} – List of column nmaes (default: {None})
keep_zeros {bool} – True will keep zeros. (default: {False})
- Returns:
- pandas.DataFrame – Dataframe with rows removed if null or zero on column[flds].
-
ashpool.oneness(srs_l, srs_r)¶ TODO
-
ashpool.rate_series(dframe)¶ return ratings of fields for completeness and uniqueness
- Arguments:
- dframe {pandas.DataFrame} – Input dataframe.
- Returns:
- pandas.DataFrame – Dataframe with statistics regarding the quality of columns as identifiers.
-
ashpool.reconcile(dframe_l, dframe_r, fields_l=None, fields_r=None, show_diff=True, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, breaks_only=False)¶ Aligns and compares two dataframes
- Arguments:
dframe_l {pandas.DataFrame} – left dataframe
dframe_r {pandas.DataFrame} – right dataframe
- Keyword Arguments:
fields_l {list} – list of columns names to compare from dframe_l (default: {None})
fields_r {list} – list of columns names to compare from dframe_r (default: {None})
show_diff {bool} – whether or not to include a calculation of the difference in results (default: {True})
breaks_only {bool} – return only those rows that are not matched (default: {True})
- Returns:
- dataframe – shows results of the comparison
-
ashpool.suggest_id_pairs(dframe_l, dframe_r, threshold=0.5, incl_all_dtypes=False, incl_all_pairs=False)¶ Suggest matching series from two dfs.
- Arguments:
dframe_l {pandas.DataFrame} – Left dataframe.
dframe_r {pandas.DataFrame} – Right dataframe.
- Keyword Arguments:
threshold {float} – Value between 0 and 1 that represents minimum coveredness. (default: {0.5})
incl_all_dtypes {bool} – Try to use all dtypes (not just object) if True. (default: {False})
incl_all_pairs {bool} – Show all pairs regardless of threshold. (default: {False})
- Returns:
- pandas.DataFrame – Statistics regarding which pairs of columns to use as IDs and their score (id_scr).
-
ashpool.uniqueness(srs)¶ return uniqueness score for series - i.e., percentage of unique values in series (excl. nulls).
- Arguments:
- srs {pandas.Series}
- Returns:
- float