ContentCoder

ContentCoder is a Python-based text analysis tool that enables users to process and analyze text using custom linguistic dictionaries. It is inspired by tools like LIWC (Linguistic Inquiry and Word Count) and provides robust methods for tokenization, text analysis, and frequency calculations.

Note: Approximately 98% of this README was generated by ChatGPT — it may not be entirely accurate, but at a quick glance, it looks pretty spot-on.

Features

Custom Dictionary-Based Analysis
Support for LIWC-style dictionaries (2007 & 2022 formats)
Efficient text tokenization
Wildcard and abbreviation handling
Punctuation and big word analysis
Dictionary export in multiple formats (JSON, CSV, Poster format, etc.)
High-performance wildcard matching with memory optimization

Installation

Ensure you have Python 3.9+ installed. ContentCoder is all native Python and does not require dependencies for installation.

pip install contentcoder

Folder Structure

src/contentcoder/
│── __init__.py
│─ ContentCoder.py
│─ ContentCodingDictionary.py
│─ happiestfuntokenizing.py
│─ create_export_dir.py

Quick Start

1. Import the `ContentCoder` class

from contentcoder.ContentCoder import ContentCoder

2. Initialize the Analyzer

cc = ContentCoder(dicFilename='path/to/dictionary.dic', fileEncoding='utf-8-sig')

3. Analyze a Text Sample

text = "An abrupt sound startled him. Off to the right he heard it, and his ears, expert in such matters, could not be mistaken. Again he heard the sound, and again. Somewhere, off in the blackness, someone had fired a gun three times."
results = cc.Analyze(text, relativeFreq=True, dropPunct=True, retainCaptures=False, returnTokens=True, wildcardMem=True)
print(results)

Expected output:

{
  "WC": 23,
  "Dic": 5.4,
  "BigWords": 6.0,
  "Numbers": 3.0,
  "AllPunct": 0.0,
  "Period": 3.0,
  "Comma": 0.0,
  "QMark": 0.0,
  "Exclam": 0.0,
  "Apostro": 0.0
}

Main Functions & Usage

1. `Analyze(text, **options)`

Analyzes a given text and returns a dictionary of results.

Parameters:

inputText (str): The text to analyze.
relativeFreq (bool): If True, returns relative frequencies. Otherwise, raw frequencies.
dropPunct (bool): If True, punctuation is removed before processing.
retainCaptures (bool): If True, captures and stores wildcard-matched words.
returnTokens (bool): If True, returns tokenized text.
wildcardMem (bool): If True, speeds up wildcard processing by storing past matches.

Example Usage:

result = cc.Analyze("Hello world! This is a test sentence.", returnTokens=True)

2. `GetResultsHeader()`

Returns a list of all available output categories.

Example Usage:

print(cc.GetResultsHeader())

Expected output:

["WC", "Dic", "BigWords", "Numbers", "AllPunct", "Period", "Comma", "QMark", "Exclam", "Apostro"]

3. `GetResultsArray(resultsDICT, rounding=4)`

Formats the results of Analyze() into a CSV-friendly list.

Example Usage:

text = "The government plays an important role."
result = cc.Analyze(text)
csv_row = cc.GetResultsArray(result)
print(csv_row)

Expected output:

[6, 4.3, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

4. `ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)`

Exports wildcard-captured words and their frequencies to a CSV file.

Example Usage:

cc.ExportCaptures("captured_words.csv")

5. `ExportDict2022Format(dicOutFilename, fileEncoding, **options)`

Exports the loaded dictionary in LIWC-22 format.

Example Usage:

cc.dict.ExportDict2022Format("dictionary_2022.dicx")

6. `UpdateCategories(dicTerm, newCategories)`

Updates the categories associated with a dictionary term.

Example Usage:

cc.dict.UpdateCategories(dicTerm="happiness", newCategories={"positive_emotion": 1.0, "joy": 0.5})

Example: Processing a Large CSV File with `tqdm`

This script reads a large CSV file and processes each text in the "body" column.

import csv
from tqdm import tqdm
from contentcoder.ContentCoder import ContentCoder

cc = ContentCoder(dicFilename='dictionary.dic', fileEncoding='utf-8-sig')

with open("Comments.csv", "r", encoding="utf-8-sig") as csvfile, \
     open("Output.csv", "w", encoding="utf-8-sig", newline="") as csvfile_out:

    reader = csv.DictReader(csvfile)
    writer = csv.writer(csvfile_out)
    writer.writerow(["id"] + cc.GetResultsHeader())

    for row in tqdm(reader, desc="Processing", unit=" comments"):
        row_id = row["id"]
        text = row["comment_text"]
        result = cc.Analyze(text)
        csv_row = cc.GetResultsArray(result)
        writer.writerow([row_id] + csv_row)

print("Finished!")

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
images		images
src/contentcoder		src/contentcoder
.gitattributes		.gitattributes
.gitignore		.gitignore
ContentCoder_Examples.ipynb		ContentCoder_Examples.ipynb
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContentCoder

Features

Installation

Folder Structure

Quick Start

1. Import the `ContentCoder` class

2. Initialize the Analyzer

3. Analyze a Text Sample

Main Functions & Usage

1. `Analyze(text, **options)`

Parameters:

Example Usage:

2. `GetResultsHeader()`

Example Usage:

3. `GetResultsArray(resultsDICT, rounding=4)`

Example Usage:

4. `ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)`

Example Usage:

5. `ExportDict2022Format(dicOutFilename, fileEncoding, **options)`

Example Usage:

6. `UpdateCategories(dicTerm, newCategories)`

Example Usage:

Example: Processing a Large CSV File with `tqdm`

License

About

Releases

Packages

Languages

License

ryanboyd/ContentCoder-Py

Folders and files

Latest commit

History

Repository files navigation

ContentCoder

Features

Installation

Folder Structure

Quick Start

1. Import the ContentCoder class

2. Initialize the Analyzer

3. Analyze a Text Sample

Main Functions & Usage

1. Analyze(text, **options)

Parameters:

Example Usage:

2. GetResultsHeader()

Example Usage:

3. GetResultsArray(resultsDICT, rounding=4)

Example Usage:

4. ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)

Example Usage:

5. ExportDict2022Format(dicOutFilename, fileEncoding, **options)

Example Usage:

6. UpdateCategories(dicTerm, newCategories)

Example Usage:

Example: Processing a Large CSV File with tqdm

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Import the `ContentCoder` class

1. `Analyze(text, **options)`

2. `GetResultsHeader()`

3. `GetResultsArray(resultsDICT, rounding=4)`

4. `ExportCaptures(filename, fileEncoding='utf-8-sig', wildcardsOnly=False, fullset=True)`

5. `ExportDict2022Format(dicOutFilename, fileEncoding, **options)`

6. `UpdateCategories(dicTerm, newCategories)`

Example: Processing a Large CSV File with `tqdm`

Packages