Task 3 of Diagnostic Questions competition, which was one of the NeurIPS 2020 Competitions.
- term_project_task3.ipynb : Mid-term project
- term_project_final.ipynb : Final project
The correct answer rate per question and the value that inverted it (=the wrong answer rate)
result_df = pd.DataFrame()
result_df['Correct_Rate'] = data.groupby('QuestionId').agg(['mean'])['IsCorrect']
result_df['Correct_Rate_Inverse'] = 1-result_df['Correct_Rate']
Skill/concept related score associated with the problem
- Less skill/concept associated with the problem, higher score
- (data/metadata/student_metadata_task_3_4.csv)
import ast
question_data = question_metadata.sort_values('QuestionId')
question_data['SubjectList'] = question_data['SubjectId'].apply(lambda x: ast.literal_eval(x))
question_data['SubjectLen'] = question_data['SubjectList'].apply(len)
the entropy of the normalized value counts of the AnswerValue column grouped by the QuestionId
result_df['AnswerValue_Entropy'] = data.groupby('QuestionId')['AnswerValue'].agg(lambda x:multinomial.entropy(1,x.value_counts(normalize=True)).mean())
the entropy of the normalized value counts of the IsCorrect column grouped by the QuestionId
result_df['IsCorrect_Entropy'] = data.groupby('QuestionId')['IsCorrect'].agg(lambda x:multinomial.entropy(1,x.value_counts(normalize=True)).mean())
Number of characters in the problem
- Use Python's EasyOCR module to recognize characters in an image
- (data/images, image2text.ipynb)
import easyocr
reader = easyocr.Reader(['en'])
image2text = []
question_id = []
for idx in range(0, 948):
result = reader.readtext('git_dir/data/images/'+str(idx)+'.jpg')
question_text = ''
for data in result:
question_text += data[1]
Evaluate all possible combinations to find the best parameter combination of the above features.
import itertools
# Calculate the combination of all columns
combinations = []
column_combinations = []
num_columns = len(result_df.columns)
for r in range(1, num_columns+1):
for cols in itertools.combinations(result_df.columns, r):
combination_sum = result_df[list(cols)].sum(axis=1)
column_combinations.append(list(cols)) # Save column name combinations
subset_df = pd.concat(combinations, axis=1)
subset_df.columns = ['Combination_{}'.format(i) for i in range(len(combinations))]
Scaling processing to improve performance
Reduce 6 features to 2 dimensions.
Interesting information inference.