-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadme.py
96 lines (69 loc) · 2.63 KB
/
readme.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#%%
import sqlite3
import pandas as pd
import numpy as np
conn = sqlite3.connect('data/dcard.sqlite')
c = conn.cursor()
#%%
r = c.execute("""
SELECT token.token, pos.pos, t1.num FROM
(SELECT token_id, pos_id, COUNT(*) AS num
FROM oneGram GROUP BY token_id, pos_id
ORDER BY 3 DESC) t1
LEFT JOIN token
ON t1.token_id=token.token_id
LEFT JOIN pos
ON t1.pos_id=pos.pos_id;
""")
#token_type_count = [x for x in r]
df = pd.DataFrame((x for x in r), columns=['token', 'pos', 'count'] )
df.index = np.arange(1, len(df)+1)
#%%
import json
with open('data/dcard.jsonl') as f:
corp = [json.loads(l) for l in f]
#%%
tk_num = 0
male_text_count = 0
female_text_count = 0
text_num = len(corp)
for text in corp:
if text['gender'] == 1:
male_text_count += 1
else:
female_text_count += 1
for sent in text['text']:
tk_num += len(sent)
#%%
readme = f'''
# Dcard post data
This repo hosts the post data retrieved from Dcard API,
which were colleceted for the purpose of building a small corpus.
These posts came from the top-100 popular forums of Dcard.
Each post is at least 100-character-long.
The post data were segmented and PoS tagged using [`ckiplab/ckiptagger`](https://github.com/ckiplab/ckiptagger).
## Files
- `data/dcard.jsonl`: The segmented and tagged corpus. Each line is a json string representing a post.
- `data/rawdata.zip`: The raw data retrieved from <https://www.dcard.tw/_api/forums> and <https://www.dcard.tw/_api/posts>.
## Concordancer
The quickest way to query KWIC concordance in this corpus with [this concordancer](https://kwic.yongfu.name) is using [docker](https://www.docker.com).
Download image:
```bash
docker pull liao961120/dcard
```
Run server:
```bash
docker run -it -p 127.0.0.1:1420:80 liao961120/dcard
```
When you see `Corpus Loaded` printed on the command line, you can visit <https://kwic.yongfu.name> to use the app.
The source code of the concordancer is hosted in [`liao961120/kwic`](https://github.com/liao961120/kwic) and [`liao961120/kwic-backend`](https://github.com/liao961120/kwic-backend). Read more about the concordancer in [this post](https://yongfu.name/2020/03/20/building-concordancer.html).
## Corpus Stats
- Number of tokens: {tk_num}
- Number of posts: {text_num}
- Female author: {female_text_count} ({round(100*female_text_count/text_num, 2)}%)
- Male author: {male_text_count} ({round(100*male_text_count/text_num, 2)}%)
#### Word List (Top 100 frequent)
{df.iloc[:100,:].to_html(justify='center').replace('border="1"', '')}
'''.strip()
with open("README.md", "w") as f:
f.write(readme)