Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [Feature request] - Add Seshat API example notebook showing prediction of complexity characteristics #17

Open
2 tasks
edwardchalstrey1 opened this issue Oct 21, 2024 · 4 comments
Assignees

Comments

@edwardchalstrey1
Copy link
Collaborator

Description of Improvement

Initial Hypotheses/ideas:

  • Predict something, use sklearn or another ML package.
  • Can we predict whether a polity is likely to have variable A, given the presence of variables B, C & D

After having read this paper:

  • They fitted a predictive model based on the Complexity Characteristics (CCs) of most of the world regions (training set), then were able to use it to predict the CCs of North America (test set)
  • Their most useful Principal Component ("PC") called "PC1" (unsure how calculated) shows general increase across polities or regions over time
  • "The tight relationships between different CCs provide support for the idea that there are functional relationships between these characteristics that cause them to coevolve"

Notebook idea: Rather than replicating the Principal Component analysis, which Matilda is doing, a simpler ML notebook could involve:

  1. loading the data for several CCs that the paper says are linked
  2. Training a model to predict one CC based on others
  3. Evaluating the performance of the model

Dependencies

No response

Technical Notes

  • Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

Definition of Done

  • The feature has been developed on a feature branch.
  • A pull request has been created for the feature branch to be merged into the main branch.
@kallewesterling
Copy link
Member

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

@edwardchalstrey1
Copy link
Collaborator Author

Because of the way the database/api is set up, it's very hard to get for a single polity, or set of polities, all the values of all the variables

I have experimented with creating an API end point to access all variables associated with a given polity. Perhaps this should be opened as a different issue on the side of the Seshat API Django app?

@kallewesterling Ok good idea - it might be worth coordinating with @matildaperuzzo on that as she may be writing code which creates data in that format after downloading via the API - but if it can already be retrieved that way, this seems better

@kallewesterling
Copy link
Member

Yeah -- Django has ways of optimising queries for Postgres so it's definitely worth us looking into it. @matildaperuzzo would you be able to share your code somehow, you think?

@matildaperuzzo
Copy link

In the code that I write the first thing I do after downloading the data is to group it by polity.
I can imagine being able to pull data by polity would have been useful when writing my script and definitely could help debugging or for projects that are only concerned with specific regions. For the making of the template that I spoke of in the workshop the only change would have been to gather by polity then sort by variable rather than gather by variable then sort by polity. And given I don't want to download all variables every time but I do want to download all polities the current way is more effective.
My code is in the SeshatDatasetAnalysis git folder in the Template.py file.
The code below is a snippet showing the function add_dataset which is called for every downloaded variable, inside it splits the dataset by polity and then adds each polity to the template in the row for the variable.

` def add_dataset(self, key, url):

    # check if the dataset is already in the dataframe
    if key in self.template.columns:
        print(f"Dataset {key} already in dataframe")
        return
    
    # download the data
    tic = time.time()
    df = download_data(url)
    toc = time.time()
    print(f"Downloaded {key} dataset with {len(df)} rows in {toc-tic} seconds")
    if len(df) == 0:
        print(f"Empty dataset for {key}")
        return
    
    variable_name = df.name.unique()[0].lower()
    range_var =  variable_name + "_from" in df.columns
    col_name = key.split('/')[-1]
    self.add_empty_col(col_name)
    polities = self.template.PolityID.unique()
    
    for pol in polities:

        pol_df = df.loc[df.polity_id == pol]
        if pol_df.empty:
            continue
        self.add_polity(pol_df, range_var, variable_name, col_name)
    
    self.perform_tests(df, variable_name, range_var, col_name)
    print(f"Added {key} dataset to template")

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants