Skip to content

Latest commit

 

History

History
executable file
·
99 lines (85 loc) · 4.53 KB

jack-the-leaf.org

File metadata and controls

executable file
·
99 lines (85 loc) · 4.53 KB

Action Items

Prep

Review next steps for SageMaker in documentation

Push to Github

Upload to S3

Tech

Run through the process with the data subset (~6000 points)

Load the model in Deeplens

Prep ALL the data (.lst, and upload to S3)

Make SageMaker model with full data set

Deploy model to Deeplens

Trees

Identify the species for some for some life samples

Test Deeplens+Model on real leaves

plant photo data set

idea 1idea 2
<50><50>
initial20GB44M
unzipped22GB350M
imagesincludeddownload
data pointjpg + xmla line in CSV (but need to download the corresponding image)
number of data points0.25 mil1.1+ mil
image size~100K~500K
effort neededneed code to parse the data points to produce .lst file and organize corresponding imagesneed code to take a subset of points, download and organize corresponding images and produce a .lst file
subset size200
# <Image>
# 	<FileName>248738.jpg</FileName>
# 	<Species>Justicia nyassana Lindau</Species>
# 	<Origin>eol</Origin>
# 	<Author>mark hyde, bart wursten and petra ballings, bart wursten, flora of zimbabwe</Author>
# 	<Content></Content>
# 	<Genus>Justicia</Genus>
# 	<Family>Acanthaceae</Family>
# 	<ObservationId>174381</ObservationId>
# 	<MediaId>248738</MediaId>
# 	<YearInCLEF>PlantCLEF2017</YearInCLEF>
# 	<LearnTag>Train</LearnTag>
# 	<ClassId>2365</ClassId>
# </Image>

# https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html

# A .lst file is a tab-separated file with three columns that contains
# a list of image files. The first column specifies the image index,
# the second column specifies the class label index for the image, and
# the third column specifies the relative path of the image file. The
# image index in the first column must be unique across all of the
# images.

from bs4 import BeautifulSoup
import os
import os.path

lst_filename = "all.lst"
lst_contents = []
index = 1

def parseme(f):
    r = open(f).read()
    x = BeautifulSoup(r, features="xml")
    fn = x.Image.FileName.contents[0]
    klass = x.Image.ClassId.contents[0]
    return [fn, klass]


for root, dirs, files in os.walk("."):
    for f in files:
        extension = os.path.splitext(f)[1]
        if extension == ".xml":
            fn, klass = parseme(f)
            lst_contents.append(str(str(index) + "\t" + klass + "\t" + "./" + fn + "\n"))
            index += 1

with open(lst_filename, "w") as f:
    for line in lst_contents:
        f.write(line)