common module¶
The common module contains common functions and classes used by the other modules.
dataframe_to_dict(df, orientation='columns')
¶
Convert a DataFrame to a dictionary in various orientations.
- df: The DataFrame to convert.
- orientation: The dictionary orientation. Options:
- "columns": {column -> {index -> value}}
- "index": {index -> {column -> value}}
- "records": [{column -> value}, ... , {column -> value}]
- "list": {column -> [values]}
- "split": Dict with keys: index, columns, and data
- Dictionary representation of the DataFrame.
Source code in intellikit/common.py
def dataframe_to_dict(df, orientation="columns"):
"""
Convert a DataFrame to a dictionary in various orientations.
Parameters:
- df: The DataFrame to convert.
- orientation: The dictionary orientation.
Options:
- "columns": {column -> {index -> value}}
- "index": {index -> {column -> value}}
- "records": [{column -> value}, ... , {column -> value}]
- "list": {column -> [values]}
- "split": Dict with keys: index, columns, and data
Returns:
- Dictionary representation of the DataFrame.
"""
# Supported orientations for to_dict() method
valid_orientations = {"dict", "index", "records", "list", "split"}
# Check if the given orientation is valid
if orientation not in valid_orientations:
raise ValueError(f"Invalid orientation. Choose from {', '.join(valid_orientations)}")
# Convert DataFrame to dictionary
return df.to_dict(orient=orientation)
linearRetriever(df, query, similarity_functions, feature_weights, top_n=1)
¶
linear retriever performs a K-NN search by computing all similarities between the query and each case one by one sequentially
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.DataFrame |
DataFrame containing features for case characterization |
required |
similarity_function |
Dictionary |
A dictionary containing similarity functions for each feature in the casebase. |
required |
query |
pandas.DataFrame |
Datathe total similarity or weighted similarity column |
required |
feature_weights |
Dictionary |
A dictionary of weights assigned to each feature |
required |
top_n |
int |
Number of top similar cases to return. |
1 |
Returns:
Type | Description |
---|---|
dataframe |
top k cases from the casebase. |
Source code in intellikit/common.py
def linearRetriever(df, query, similarity_functions, feature_weights, top_n=1):
"""
linear retriever performs a K-NN search by computing all similarities between the query and each case one by one sequentially
Args:
df (pandas.DataFrame): DataFrame containing features for case characterization
similarity_function (Dictionary): A dictionary containing similarity functions for each feature in the casebase.
query (pandas.DataFrame): Datathe total similarity or weighted similarity column
feature_weights (Dictionary): A dictionary of weights assigned to each feature
top_n (int): Number of top similar cases to return.
Returns:
dataframe: top k cases from the casebase.
"""
# Create a DataFrame to store similarities
similarities = pd.DataFrame(index=df.index)
# Iterate over the columns in the DataFrame
for feature in df.columns:
# Check if the feature is in the similarity_functions dictionary
if feature in similarity_functions:
# Retrieve the similarity function for the feature
similarity_function = similarity_functions[feature]
# Check if feature weight is iterable, convert to float if necessary
feature_weight = float(feature_weights.get(feature, 1.0)) # Default to 1.0 if not provided
# Apply the similarity function to calculate similarities
similarities[feature] = similarity_function(df[[feature]].copy(), query[[feature]], feature) * feature_weight
else:
# If the feature is not found in the similarity_functions dictionary, set the similarity to 0
similarities[feature] = 0.0
# Calculate total similarity as the sum of weighted similarities
similarities['total_similarity'] = similarities.sum(axis=1)
# Select top N cases with the lowest total similarity
top_n_indices = similarities['total_similarity'].nlargest(top_n).index
top_n_cases = df.loc[top_n_indices]
return top_n_cases
macfacRetriever(df, query, mac_features, fac_features, similarity_functions, feature_weights, top_n_mac=2, top_n_fac=1)
¶
MACFAC Retriever performs a two-staged retrieval where the first phase (MAC) uses a lightweight similarity to remove irrelevant cases for the second phase (FAC) where the final similarity for the filtered cases is evaluated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.DataFrame |
DataFrame containing features for case characterization |
required |
similarity_function |
Dictionary |
A dictionary containing similarity functions for each feature in the casebase. |
required |
query |
pandas.DataFrame |
Datathe total similarity or weighted similarity column |
required |
feature_weights |
Dictionary |
A dictionary of weights assigned to each feature |
required |
mac_features |
A list of features to use for the MAC phase (mac_features = ['feature4', 'feature5']) |
required | |
fac_features |
A list of features to use for the FAC phase |
required | |
top_n_mac |
int |
Number of top similar cases to return during the MAC phase. |
2 |
top_n_fac |
int |
Number of top similar cases to return during the FAC phase |
1 |
Returns:
Type | Description |
---|---|
dataframe |
top k cases from the casebase specified for the FAC phase. |
Source code in intellikit/common.py
def macfacRetriever(df, query, mac_features, fac_features, similarity_functions, feature_weights, top_n_mac=2, top_n_fac=1):
"""
MACFAC Retriever performs a two-staged retrieval where the first phase (MAC) uses a lightweight similarity to remove irrelevant cases for the second phase (FAC) where the final similarity for the filtered cases is evaluated.
Args:
df (pandas.DataFrame): DataFrame containing features for case characterization
similarity_function (Dictionary): A dictionary containing similarity functions for each feature in the casebase.
query (pandas.DataFrame): Datathe total similarity or weighted similarity column
feature_weights (Dictionary): A dictionary of weights assigned to each feature
mac_features: A list of features to use for the MAC phase (mac_features = ['feature4', 'feature5'])
fac_features: A list of features to use for the FAC phase
top_n_mac (int): Number of top similar cases to return during the MAC phase.
top_n_fac (int): Number of top similar cases to return during the FAC phase
Returns:
dataframe: top k cases from the casebase specified for the FAC phase.
"""
filtered_df = mac_stage(df, query, mac_features, similarity_functions, top_n_mac)
top_similar_cases = fac_stage(filtered_df, query, fac_features, similarity_functions, feature_weights, top_n_fac)
return top_similar_cases
parallelRetriever(df, query, similarity_functions, feature_weights, top_n=1)
¶
Parallel linear retriever performs a K-NN search by computing all similarities between the query and each case using all available computing cors of the respective CPU.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pandas.DataFrame |
DataFrame containing features for case characterization |
required |
similarity_function |
Dictionary |
A dictionary containing similarity functions for each feature in the casebase. |
required |
query |
pandas.DataFrame |
Datathe total similarity or weighted similarity column |
required |
feature_weights |
Dictionary |
A dictionary of weights assigned to each feature |
required |
top_n |
int |
Number of top similar cases to return. |
1 |
Returns:
Type | Description |
---|---|
dataframe |
top k cases from the casebase. |
Source code in intellikit/common.py
def parallelRetriever(df, query, similarity_functions, feature_weights, top_n=1):
"""
Parallel linear retriever performs a K-NN search by computing all similarities between the query and each case using all available computing cors of the respective CPU.
Args:
df (pandas.DataFrame): DataFrame containing features for case characterization
similarity_function (Dictionary): A dictionary containing similarity functions for each feature in the casebase.
query (pandas.DataFrame): Datathe total similarity or weighted similarity column
feature_weights (Dictionary): A dictionary of weights assigned to each feature
top_n (int): Number of top similar cases to return.
Returns:
dataframe: top k cases from the casebase.
"""
with Pool(cpu_count()) as pool:
# Use starmap to pass additional arguments to calculate_similarity
similarity_results = pool.starmap(calculate_similarity, [(feature, df, query, similarity_functions, feature_weights) for feature in df.columns])
similarities = pd.concat(similarity_results, axis=1)
similarities['total_similarity'] = similarities.sum(axis=1)
# Select top N cases with the lowest total similarity
top_n_indices = similarities['total_similarity'].nlargest(top_n).index
top_n_cases = df.loc[top_n_indices]
return top_n_cases
retrieve_topk(cases, similarity_data, sim_column, k)
¶
Ranks features in a DataFrame by total similarity and returns top k results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cases |
pandas.DataFrame |
DataFrame containing features |
required |
similarity_data(pandas.DataFrame) |
DataFrame containing similarity scores for each feature and a 'total_similarity' column. |
required | |
sim_column |
the total similarity or weighted similarity column |
required | |
k |
int |
Number of top features to return. |
required |
Returns:
Type | Description |
---|---|
dataframe |
top k cases from the casebase. |
Source code in intellikit/common.py
def retrieve_topk(cases, similarity_data, sim_column, k):
"""
Ranks features in a DataFrame by total similarity and returns top k results.
Args:
cases (pandas.DataFrame): DataFrame containing features
similarity_data(pandas.DataFrame): DataFrame containing similarity scores for each feature and a 'total_similarity' column.
sim_column: the total similarity or weighted similarity column
k (int): Number of top features to return.
Returns:
dataframe: top k cases from the casebase.
"""
#Combining the sets
merged = pd.concat([cases, similarity_data], axis=1)
# Sort DataFrame by 'total_similarity' in descending order
sorting_df = merged.sort_values(by=sim_column, ascending=False)
# Get the column to keep from the second DataFrame (assuming there's only one)
column_to_keep = similarity_data.filter(like=sim_column).columns[0] # Extract column name containing 'F'
# Keep only the desired columns using list comprehension
desired_columns = [col for col in merged.columns if col in cases.columns or col == column_to_keep]
result_df = sorting_df[desired_columns]
# Select top k features
top_k_cases = result_df.head(k)
return top_k_cases
stream_text(text, delay=0.02)
¶
Printing text with a slight delay like ChatGPT
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
plain text |
The text to be printed with a slight delay for example a paragraph or document |
required |
delay |
float |
Amount of time to delay. Defaults to 0.02. |
0.02 |
Source code in intellikit/common.py
def stream_text(text, delay=0.02):
"""
Printing text with a slight delay like ChatGPT
Args:
text (plain text): The text to be printed with a slight delay for example a paragraph or document
delay (float, optional): Amount of time to delay. Defaults to 0.02.
"""
for char in text:
print(char, end = "", flush = True)
time.sleep(delay)