Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
mhmd-azeez authored Oct 8, 2017
1 parent 586755b commit ed7c247
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@
A small web crawler used to collect Kurdish text over the web

It has these commands:
- **Crawl:** used to crawl web pages and save them to a folder on disk.
- **Extract:** used to extact kurdish text from the pages that are collected from the previous command.
- **Normalize:** used to convert the text collected in the previous command to standard unicode text.
- **Merge:** Used to merge the text files produced from the previous commands.
- **WordList:** used to make a wordlist from the text file that's produced from the previous command.
- [X] **Crawl:** used to crawl web pages and save them to a folder on disk.
- [ ] **Extract:** used to extact kurdish text from the pages that are collected from the previous command.
- [ ] **Normalize:** used to convert the text collected in the previous command to standard unicode text.
- [ ] **Merge:** Used to merge the text files produced from the previous commands.
- [ ] **WordList:** used to make a wordlist from the text file that's produced from the previous command.

## How to use

Expand All @@ -25,7 +25,7 @@ https://en.wikipedia.org/wiki/Wikipedia:About
```
- `output`: The folder to save the crawled pages. The crawler will also save a `$Stats.txt` file that contains the crawling stats.
- `delay`: Number of milliseconds to wait between crawling two pages. Default value is `1000`
- `pages`: Maximum number of pages to crawl per **each** seed page. Default value is `250`
- `pages`: Maximum number of pages to crawl for **each** seed page. Default value is `250`

#### Examples:
```
Expand Down

0 comments on commit ed7c247

Please sign in to comment.