Skip to content

Commit

Permalink
Merge branch 'v3.4.2' into 'master'
Browse files Browse the repository at this point in the history
Closing version 3.4.2

See merge request saffron/saffron!174
  • Loading branch information
Bianca Pereira committed Jul 15, 2020
2 parents aad5b96 + 4efde51 commit 4946d85
Show file tree
Hide file tree
Showing 37 changed files with 473 additions and 116 deletions.
40 changes: 40 additions & 0 deletions .gitlab/issue_templates/Bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<!---
Please read this!
Before opening a new issue, make sure to search for keywords in the issues
filtered by the "bug" label:
- https://gitlab.insight-centre.org/saffron/saffron/issues/issues?label_name=bug
and verify the issue you're about to submit isn't a duplicate.
If it is duplicate, then ensure it is exactly the same problem and just reopen the issue.
--->

### Summary

(Summarize the bug encountered concisely)
(Inform the branch you are using)

### Steps to reproduce

(How one can reproduce the issue - this is very important)

### What is the current *bug* behavior?

(What actually happens)

### What is the expected *correct* behavior?

(What you should see instead)

### Relevant logs and/or screenshots

(Paste any relevant logs - please use code blocks (```) to format console output,
logs, and code as it's tough to read otherwise.)

### Possible fixes

(If you can, link to the line of code that might be responsible for the problem)

/label ~bug
61 changes: 61 additions & 0 deletions .gitlab/issue_templates/Feature Request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
<!-- The first four sections: "Problem to solve", "Intended users", "User experience goal", and "Proposal", are strongly recommended, while the rest of the sections can be filled out during the design stage of a new Saffron version. However, keep in mind that providing complete and relevant information early helps us validating the problem and start working on a solution. -->

### Problem to solve

<!-- What problem do we solve? Try to define the who/what/why of the opportunity as a user story. For example, "As a (who), I want (what), so I can (why/value)." -->

### Intended users

<!-- Who will use this feature? If known, include any of the following: types of users (e.g. Developer, Non-technical), personas (e.g. a person trying to achieve X). It's okay to write "Unknown" and fill this field in later.
-->

### User experience goal

<!-- What is the single user experience workflow this problem addresses?
For example, "The user should be able to use the UI/command line/API with Saffron <step> to <perform a specific task>"
-->


### Proposal

<!-- How are we going to solve the problem? Try to include the user journey! -->

### Further details

<!-- Include use cases, benefits, goals, or any other details that will help us understand the problem better. -->


### Support Documentation

<!--
Add any supporting documentation on the best approaches to implement this feature.
-->

### Availability & Testing

<!-- This section needs to be filled during the design phase of a new Saffron version.
What risks does this change pose to current features in Saffron? (don't say 'none')
How might it affect the quality of the final product?
What additional test coverage or changes to tests will be needed?
Will it require cross-browser testing?
-->

### What does success look like, and how can we measure that?

<!--
Define both the success metrics and acceptance criteria. Note that success metrics indicate the desired business outcomes, while acceptance criteria indicate when the solution is working correctly. If there is no way to measure success, link to an issue that will implement a way to measure this. -->


### Is this a cross-module feature?

<!-- Communicate if this change will affect multiple Saffron modules. If so, which modules? -->


### Links / references

<!-- Any additional links or references you may want to add -->

/label ~Feature
40 changes: 40 additions & 0 deletions .gitlab/issue_templates/Refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## Summary

<!--
Please briefly describe what part of the code base needs to be refactored.
-->

## Improvements

<!--
Explain the benefits of refactoring this code.
-->

## Risks

<!--
Please list features that can break because of this refactoring and how you intend to solve that.
-->

## Involved components

<!--
List files or directories that will be changed by the refactoring.
-->

## Optional: Intended side effects

<!--
If the refactoring involves changes apart from the main improvements (such as a better UI), list them here.
It may be a good idea to create separate issues and link them here.
-->


## Optional: Missing test coverage

<!--
If you are aware of tests that need to be written or adjusted apart from unit tests for the changed components,
please list them here.
-->

/label ~Improvement
14 changes: 8 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,21 @@ distinct analysis of text. These modules are as follows
them for later components
2. *Term Extraction*: Extracts keyphrases that are the terms of each single
document in a collection
3. *Author Consolidation*: Detects and removes name variations from the list
3. *Concept Consolidation*: Detects and removes variations from the list
of terms of each document
4. *Author Consolidation*: Detects and removes name variations from the list
of authors of each document
4. *DBpedia Lookup*: Links terms extracted from a document to URLs on the
5. *DBpedia Lookup*: Links terms extracted from a document to URLs on the
Semantic Web
5. *Document-term Analysis*: Analyses the terms of a document and finds the relative
importance of these terms
6. *Author-Term Analysis*: Associates authors with particular documents and
identifies the importance of the document to each author
6. *Author Connection*: Associates authors with terms from the documents and
identifies the importance of the term to each author
7. *Term Similarity*: Measures the relevance of each term to each other term
8. *Author Similarity*: Measures the relevance of each author to each other
author
9. *Taxonomy Extraction*: Organizes the terms into a single hierarchical
graph that allows for easy browsing of the corpus and deep insights.
10. *RDF Extraction*: Creates a knowledge graph


<img src="https://gitlab.insight-centre.org/saffron/saffron/raw/master/docs/Saffron%20Services.png" alt="Saffron Service Workflow" width="400"/>

Expand Down
2 changes: 1 addition & 1 deletion authors/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<name>Saffron Author Processing</name>
<artifactId>saffron-authors</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,11 @@ public boolean equals(Object obj) {

}
private final int top_n;
private final int minDocs;

public ConnectAuthorTerm(AuthorTermConfiguration config) {
this.top_n = config.topN;
this.minDocs = config.minDocs;
}

public Collection<AuthorTerm> connectResearchers(List<Term> terms, List<DocumentTerm> documentTerms,
Expand All @@ -76,7 +78,9 @@ public Collection<AuthorTerm> connectResearchers(List<Term> terms, List<Document
public Collection<AuthorTerm> connectResearchers(List<Term> terms, List<DocumentTerm> documentTerms,
Iterable<Document> documents, SaffronListener log) {

Map<String, Document> docById = buildDocById(documents);
DocById docByIdReturnValue = buildDocById(documents);
Map<String, Document> docById = docByIdReturnValue.docById;
Object2IntMap<Author> authorFreq = docByIdReturnValue.authorFreq;
Map<Author, List<String>> author2Term = buildAuthor2Term(documentTerms, docById, log);
//Map<Author, List<String>> author2Doc = buildAuthor2Doc(documentTerms, docById);
Map<String, Term> termById = buildTermById(terms);
Expand All @@ -90,6 +94,8 @@ public Collection<AuthorTerm> connectResearchers(List<Term> terms, List<Document

List<AuthorTerm> ats = new ArrayList<>();
for(Map.Entry<Author, List<String>> e : author2Term.entrySet()) {
if(authorFreq.getInt(e.getKey()) < minDocs)
continue;
TreeSet<AuthorTerm> topN = new TreeSet<>(new Comparator<AuthorTerm>() {

@Override
Expand Down Expand Up @@ -120,6 +126,7 @@ public int compare(AuthorTerm arg0, AuthorTerm arg1) {
at.setOccurrences(occurrences.getInt(atKey));
at.setPaperCount(paper_count.getInt(atKey));
at.setScore(at.getTfIrf() * at.getPaperCount());
at.setResearcherScore((double)at.getPaperCount() * Math.log(1 + at.getMatches()));
if(topN.size() < top_n) {
topN.add(at);
} else if(topN.size() >= top_n && at.getScore() > topN.first().getScore()) {
Expand Down Expand Up @@ -158,11 +165,28 @@ private Map<String, Term> buildTermById(List<Term> terms) {
return termById;
}

private Map<String, Document> buildDocById(Iterable<Document> documents) {
private class DocById {
Map<String, Document> docById;
Object2IntMap<Author> authorFreq;

public DocById(Map<String, Document> docById, Object2IntMap<Author> authorFreq) {
this.docById = docById;
this.authorFreq = authorFreq;
}


}

private DocById buildDocById(Iterable<Document> documents) {
Map<String, Document> docById = new HashMap<>();
for(Document document : documents)
Object2IntMap<Author> authorFreq = new Object2IntOpenHashMap<>();
for(Document document : documents) {
docById.put(document.id, document);
return docById;
for(Author a : document.getAuthors()) {
authorFreq.put(a, authorFreq.getInt(a) + 1);
}
}
return new DocById(docById, authorFreq);
}

private void countOccurrence(Map<Author, List<String>> author2Term, Map<String, Term> terms, Object2IntMap<AT> occurrences,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import joptsimple.OptionParser;
import joptsimple.OptionSet;
import org.insightcentre.nlp.saffron.config.AuthorTermConfiguration;
import org.insightcentre.nlp.saffron.config.Configuration;
import org.insightcentre.nlp.saffron.data.Corpus;
import org.insightcentre.nlp.saffron.data.Term;
import org.insightcentre.nlp.saffron.data.connections.AuthorTerm;
Expand Down Expand Up @@ -68,7 +69,7 @@ public static void main(String[] args) {

ObjectMapper mapper = new ObjectMapper();

AuthorTermConfiguration config = configurationFile == null ? new AuthorTermConfiguration() : mapper.readValue(configurationFile, AuthorTermConfiguration.class);
AuthorTermConfiguration config = configurationFile == null ? new AuthorTermConfiguration() : mapper.readValue(configurationFile, Configuration.class).authorTerm;
Corpus corpus = CorpusTools.readFile(corpusFile);
List<DocumentTerm> docTerms = mapper.readValue(docTermFile, mapper.getTypeFactory().constructCollectionType(List.class, DocumentTerm.class));
List<Term> terms = mapper.readValue(termFile, mapper.getTypeFactory().constructCollectionType(List.class, Term.class));
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<groupId>org.insightcentre</groupId>
<artifactId>benchmarks</artifactId>
Expand Down
2 changes: 1 addition & 1 deletion chuliu-edmonds/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<name>Chu-Liu Edmonds Algorithm</name>
<artifactId>chuliu-edmonds</artifactId>
Expand Down
2 changes: 1 addition & 1 deletion concept/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>

<name>Saffron Concept Consolidation</name>
Expand Down
2 changes: 1 addition & 1 deletion core/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<name>Saffron Core</name>
<artifactId>saffron-core</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@
public class AuthorTermConfiguration {
/** The maximum number of total author-term pairs to extract */
public int topN = 1000;
/** Exclude authors who are not authors of the minimum number of documents */
public int minDocs = 1;
}
2 changes: 1 addition & 1 deletion crawler/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<artifactId>crawler</artifactId>
<packaging>jar</packaging>
Expand Down
2 changes: 1 addition & 1 deletion documentindex/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
<relativePath>..</relativePath>
</parent>
<artifactId>saffron-documentindex</artifactId>
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
<packaging>pom</packaging>
<name>Saffron</name>

Expand Down
2 changes: 1 addition & 1 deletion saffron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ fi
echo "########################################"
echo "## Step 6: Connect Authors ##"
echo "########################################"
$DIR/connect-authors -t $CORPUS -p $OUTPUT/terms.json -d $OUTPUT/doc-terms.json -o $OUTPUT/author-terms.json
$DIR/connect-authors -c $CONFIG -t $CORPUS -p $OUTPUT/terms.json -d $OUTPUT/doc-terms.json -o $OUTPUT/author-terms.json

echo "########################################"
echo "## Step 7: Term Similarity ##"
Expand Down
2 changes: 1 addition & 1 deletion taxonomy/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
<parent>
<groupId>org.insightcentre</groupId>
<artifactId>saffron</artifactId>
<version>3.4.1</version>
<version>3.4.2</version>
</parent>
<name>Saffron Taxonomy Extraction</name>
<artifactId>saffron-taxonomy</artifactId>
Expand Down
Loading

0 comments on commit 4946d85

Please sign in to comment.