Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final dh #5

Open
wants to merge 65 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
f610ed7
Create README.md
sangki930 May 25, 2021
97f4701
add feature
tofulim May 27, 2021
d2521e1
models commit
sangki930 May 28, 2021
b0e15a1
model recommit
sangki930 May 28, 2021
4913de4
[dh] new feature commit
tofulim May 28, 2021
1357646
[dh] k-fold commit
tofulim Jun 2, 2021
4ab9a96
[dh] make time2global feature
tofulim Jun 3, 2021
312c880
commit test
sangki930 Jun 3, 2021
1160816
[sangki930] branch commit
sangki930 Jun 3, 2021
af9892a
[dh] commit to merge
tofulim Jun 4, 2021
e527d8c
Update README.md
sangki930 Jun 4, 2021
23bf831
Merge pull request #2 from bcaitech1/new_branch_01
sangki930 Jun 4, 2021
5a4307a
pplz
tofulim Jun 4, 2021
4831d39
[dh] test
tofulim Jun 4, 2021
0d933bd
[dh] oh yeah
tofulim Jun 4, 2021
7a99180
add master
PrimeOfMine Jun 4, 2021
0f34d9b
dh explane
tofulim Jun 6, 2021
b110420
Merge branch 'comb_main' into sangki930
tofulim Jun 6, 2021
cf7a7e5
Merge pull request #3 from bcaitech1/sangki930
tofulim Jun 6, 2021
85449dd
test_sangki
sangki930 Jun 6, 2021
8cbe13d
sangki commit
sangki930 Jun 6, 2021
401ef43
cm
tofulim Jun 6, 2021
c55eb22
Merge branch 'comb_main' of https://github.com/bcaitech1/p4-dkt-olleh…
tofulim Jun 6, 2021
4f73e48
[dh] cont fix..
tofulim Jun 7, 2021
c813636
cm
tofulim Jun 7, 2021
02ab294
[dh] continuous fix
tofulim Jun 7, 2021
eff3dea
cm
tofulim Jun 7, 2021
088158b
[dh] submit fix, lstmattn fix
tofulim Jun 7, 2021
db91d72
[dh] change setting and split model.py to each architecture
tofulim Jun 9, 2021
03b01a9
[dh] change setting
tofulim Jun 10, 2021
b091795
[dh] make new branch to merge feat
tofulim Jun 10, 2021
324463c
cm
tofulim Jun 10, 2021
975b8f6
[dh] add presicion,recall,f1 metric
tofulim Jun 11, 2021
e68165a
[dh] cont/cate mid check
tofulim Jun 11, 2021
0f8e947
[dh] mid check
tofulim Jun 11, 2021
19428b2
[dh] push for compare
tofulim Jun 12, 2021
40fb49c
[dh] apply on model
tofulim Jun 12, 2021
047c851
fixed untracked files
tofulim Jun 13, 2021
aaab502
[dh] model final fix
tofulim Jun 13, 2021
105a1d3
[dh] final model fix
tofulim Jun 13, 2021
edee0f0
[dh] lgbm change
tofulim Jun 14, 2021
c8a7246
[dh] lgbm change
tofulim Jun 14, 2021
f3d3de8
[dh] cm
tofulim Jun 14, 2021
e701e79
[dh] cm
tofulim Jun 14, 2021
9a1fe4b
edit for k-fold
PrimeOfMine Jun 14, 2021
731b63e
add comments
PrimeOfMine Jun 14, 2021
8c0194c
debugging
PrimeOfMine Jun 14, 2021
c5f44df
[dh] fix & pull
tofulim Jun 14, 2021
0be294a
Merge branch 'final_dh' of https://github.com/bcaitech1/p4-dkt-ollehd…
tofulim Jun 14, 2021
f610a72
[dh] use test file
tofulim Jun 15, 2021
8f39e38
[dh] final push
tofulim Jun 15, 2021
0488c36
[dh] push
tofulim Jun 15, 2021
d6c3d6e
[dh] ffffinal commit
tofulim Jun 15, 2021
552e5ce
Update README.md
tofulim Jun 20, 2021
5eccb79
Update README.md
tofulim Jun 20, 2021
cba42b1
Update README.md
tofulim Jul 20, 2021
12a85d2
Update README.md
tofulim Jul 20, 2021
0a26104
Update README.md
tofulim Jul 24, 2021
0b1e351
Update README.md
tofulim Jul 24, 2021
6bd2e44
Update README.md
tofulim Jul 24, 2021
1512d6d
Update README.md
tofulim Jul 25, 2021
c98b8ae
Update README.md
tofulim Jul 25, 2021
ca4a390
Create README.md
tofulim Jul 25, 2021
e9c5699
Update README.md
tofulim Jul 26, 2021
7b94b5f
Update README.md
tofulim Jul 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[dh] push for compare
tofulim committed Jun 12, 2021

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 19428b236b6189e65cee83040f5fa80b9430c08a
12 changes: 6 additions & 6 deletions conf.yml
Original file line number Diff line number Diff line change
@@ -14,13 +14,13 @@ wandb :
- baseline

##main params
task_name: test
task_name: lstm_time_test_nokfold
seed: 42
device: cuda

data_dir: /opt/ml/input/data/train_dataset
file_name: train_time_fixed.csv
test_file_name: test_time_fixed.csv
file_name: train_time_finalfix.csv
test_file_name: test_time_finalfix.csv

asset_dir: asset/
model_dir: models/
@@ -30,10 +30,10 @@ max_seq_len: 128
num_workers: 1

##K-fold params
use_kfold : True #n개의 fold를 이용하여 k-fold를 진행한다.
use_stratify : True
use_kfold : False #n개의 fold를 이용하여 k-fold를 진행한다.
use_stratify : False
n_fold : 5
split_by_user : True #k-fold를 수행할 dataset을 user 기준으로 split
split_by_user : False #k-fold를 수행할 dataset을 user 기준으로 split

##모델
hidden_dim : 256
31 changes: 13 additions & 18 deletions dkt/dataloader.py
Original file line number Diff line number Diff line change
@@ -87,11 +87,14 @@ def __preprocessing(self, df, is_train = True):
le.classes_ = np.load(label_path)

df[col] = df[col].apply(lambda x: x if x in le.classes_ else 'unknown')

#모든 컬럼이 범주형이라고 가정
df[col]= df[col].astype(str)
test = le.transform(df[col])
df[col] = test

#cate feat들의 이름 / 고유값 개수를 dict로 conf에 저장
self.args.cate_feat_dict=dict(zip(cate_cols,[len(df[col].unique()) for col in cate_cols]))

print("preprocessing하고 나서 user개수",len(df['userID'].unique()))
return df

def __feature_engineering(self, df):
@@ -108,7 +111,7 @@ def __feature_engineering(self, df):
print(df.columns)

print('dataframe 확인')
print(df.loc[:3])
print(df)

# drop_cols = ['_',"index","point","answer_min_count","answer_max_count","user_count",'sec_time'] # drop할 칼럼
# for col in drop_cols:
@@ -127,10 +130,8 @@ def load_data_from_file(self, file_name, is_train=True):
csv_file_path = os.path.join(self.args.data_dir, file_name)
print(f'csv_file_path : {csv_file_path}')
df = pd.read_csv(csv_file_path)

print("load data 전",len(df['userID'].unique()))
if self.args.model=='lgbm':
#유저별 시퀀스를 고려하기 위해 아래와 같이 정렬
# df['distance']=np.load('/opt/ml/np_train_tag_distance_arr.npy') if is_train else np.load('/opt/ml/np_test_tag_distance_arr.npy')

df.sort_values(by=['userID','Timestamp'], inplace=True)
return df
@@ -153,21 +154,22 @@ def load_data_from_file(self, file_name, is_train=True):
ret.pop(len(self.args.cate_feats)-1)
#맨뒤로 붙여줌
ret.append('answerCode')
print("ret",ret)
print("answercode의 순서 뒤로 변경",ret)
group = df[columns].groupby('userID').apply(
lambda r: tuple([r[i].values for i in ret])
)

len(f'group.values->{len(group.values)}')
print(group)
# print(f"유저수 {len(group)} 피처수 {len(group[0])} 푼 문제 수 {len(group[0][0])}")
# len(f'group.values->{len(group.values)}')
print("load data 후",len(df['userID'].unique()))
return group.values


def load_train_data(self, file_name):
# self.train_data = self.load_data_from_file(file_name)
self.train_data = self.load_data_from_file(file_name)

def load_test_data(self, file_name):
# self.test_data = self.load_data_from_file(file_name, is_train= False)
self.test_data = self.load_data_from_file(file_name,is_train=False)

class MyDKTDataset(torch.utils.data.Dataset):
@@ -199,9 +201,6 @@ def __getitem__(self,index):

# np.array -> torch.tensor 형변환
for i, col in enumerate(columns):
print(i,"번째가 문제다")
print(len(col))
print(col.dtype)
columns[i] = torch.tensor(col)

return columns
@@ -239,7 +238,7 @@ def __getitem__(self,index):

# np.array -> torch.tensor 형변환
for i, col in enumerate(cate_cols):
cate_cols[i] = torch.tensor(col)
cate_cols[i] = torch.tensor(col.astype(int))

return cate_cols

@@ -327,11 +326,7 @@ def get_loaders(args, train, valid):
train_loader = torch.utils.data.DataLoader(trainset, num_workers=args.num_workers, shuffle=True,
batch_size=args.batch_size, pin_memory=pin_memory, collate_fn=collate)
if valid is not None:
# valset = DKTDataset(valid, args)
# valset = DevDKTDataset(valid,args)
# valset = TestDKTDataset(valid,args)
valset = MyDKTDataset(valid,args)
# print('inference gogo')
valid_loader = torch.utils.data.DataLoader(valset, num_workers=args.num_workers, shuffle=False,
batch_size=args.batch_size, pin_memory=pin_memory, collate_fn=collate)

4 changes: 2 additions & 2 deletions dkt/metric.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@ def get_metric(targets, preds):
auc = roc_auc_score(targets, preds)
acc = accuracy_score(targets, np.where(preds >= 0.5, 1, 0))
precision=precision_score(targets, np.where(preds >= 0.5, 1, 0))
recall=recall_score(label, np.where(preds >= 0.5, 1, 0))
f1=f1_score(label, np.where(preds >= 0.5, 1, 0))
recall=recall_score(targets, np.where(preds >= 0.5, 1, 0))
f1=f1_score(targets, np.where(preds >= 0.5, 1, 0))

return auc, acc ,precision,recall,f1
53 changes: 30 additions & 23 deletions dkt/models_architecture/lstm.py
Original file line number Diff line number Diff line change
@@ -28,16 +28,18 @@ def __init__(self, args):
#answerCode 때문에 하나 뺌
cont_len=len(args.cont_feats)-1
# cate Embedding
self.cate_embedding_list = nn.ModuleList([nn.Linear(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]])
self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]])
# cont Embedding
self.cont_embedding = nn.Linear(1, (self.hidden_dim//2)//cont_len)

# comb linear
self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1
self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2)

# interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0)
self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len)
# self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3)
# self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3)
# self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3)
#shape(batch,msl,feats)
#continuous


self.lstm = nn.LSTM(self.hidden_dim,
self.hidden_dim,
self.n_layers,
@@ -64,44 +66,49 @@ def init_hidden(self, batch_size):
return (h, c)

def forward(self, input):
# cate + cont + interaction + mask + gather_index= input
print('-'*80)
print("forward를 시작합니다")
# cate + cont + interaction + mask + gather_index + correct= input
# print('-'*80)
# print("forward를 시작합니다")
#userID가 빠졌으므로 -1
cate_feats=input[:len(self.args.cate_feats)-1]
print("cate_feats개수",len(cate_feats))
# print("cate_feats개수",len(cate_feats))

#answercode가 없으므로 -1
cont_feats=input[len(self.args.cont_feats)-1:-3]
print("cont_feats개수",len(cont_feats))
interaction=input[-3]
mask=input[-2]
gather_index=input[-1]
cont_feats=input[len(self.args.cate_feats)-1:-4]
# print("cont_feats개수",len(cont_feats))
interaction=input[-4]
mask=input[-3]
gather_index=input[-2]

batch_size = interaction.size(0)

# cate Embedding
cate_feats_embed=[]
embed_interaction = self.embedding_interaction(interaction)
cate_feats_embed.append(embed_interaction)

# print(self.cate_embedding_list)
# print("cate shapes")
for i, cate_feat in enumerate(cate_feats):
cate_feats_embed.append(self.cate_embedding_list[i](cate_feat))


# unsqueeze cont feats shape
cont_feats=list(map(unsqueeze(-1),cont_feats))
# cont Embedding
# unsqueeze cont feats shape & embedding
cont_feats_embed=[]

for i, cont_feat in enumerate(cont_feats):
for cont_feat in cont_feats:
cont_feat=cont_feat.unsqueeze(-1)
cont_feats_embed.append(self.cont_embedding(cont_feat))


#concat cate, cont feats
embed_cate = torch.cat(cate_feats_embed, 2)
embed_cate=self.cate_comb_proj(embed_cate)

embed_cont = torch.cat(cont_feats_embed, 2)

embed_cont=self.cont_comb_proj(embed_cont)


X = torch.cat([embed_cate,embed_cont], 2)
print("cate와 cont를 concat한 shape : ", X.shape)
# print("cate와 cont를 concat한 shape : ", X.shape)

hidden = self.init_hidden(batch_size)
out, hidden = self.lstm(X, hidden)
Loading