Evaluation Command
Basic Evaluation
Force Re-evaluation
Results
Results are saved inexperiments/BrowseComp/<Agent>/<Timestamp>/tasks_eval_result/.
BrowseComp uses a Grader for evaluation. Results contain:
predicted_label: 1 = success, 0 = failuregrader_response: Evaluation details