86'ing the measurement

by kirstenr

I'm curious as to why you 86'ed the measurement on the new Plankton Portal. I would agree that it was not that accurate. It was very difficult to measure wavy creatures like larvaceans or shrimp when they were folded/bent, etc. However wasn't the purpose partially to get the direction of the plankters in the current? Is there still a way to do that without measuring? Or did the team decide (after looking at the preliminary data) that no patterns were emerging? I'm just curious. I noticed in my own classifications that different plankters were often pointed every which way. Thanks.

Posted May 16, 2015 3:27 AM
by jo.irisson translator, scientist

We cannot get every information we want from one single source. The strength of planktonportal is that it provides relatively fast and accurate identification, backed by several people (much better than what we would get in the lab, with just one taxonomist seeing all images). We decided to play that strength.

Size measurements were approximate and length x width is not the best estimate of size (using the width of copepods with the antennas largely overestimates their volume for example). Orientation was also difficult to tell for several groups. Once we have the identifications and location of organisms, we will run a "particle detector" on images which will automatically select all the light objects on a frame (the potential organisms) and very accurately measure some characteristics (among which: area, length, width, angle, etc.). Then we will filter those to keep only the measurements that correspond to an actual identified organism. This way we get the best of both worlds: accurate measurements of accurately identified organisms. That's more work for us but it can easily be automated (actually, we already ran a particle detector on the Med dataset --at least-- to pre-select frames with stuff on them).

PS: regarding orientation, of course there is a 180º uncertainty when it is measured automatically. The algorithm can tell which is the longest direction of the organism and measure its angle relative to the vertical, but it cannot tell the "head" from the "tail". Yet it would be enough to detect strong patterns and come back to those interesting groups afterwards.

We debated this at length before the reboot of PP with the Med dataset. I may have forgotten other arguments. Jessica, Cédric, please chime in.

Posted May 16, 2015 7:28 AM
by Quia

Speaking of strengths and weaknesses, I'm curious how the data coming out of Plankton Portal compares to the algorithms developed for this competition: https://www.kaggle.com/c/datasciencebowl I think the number of images classified by PP is already larger than the data set that was given to train with(30,000 individually cropped organisms), I wonder how much more accurate they could be with more data...

I am rather fascinated by the feedback loops possible with citizen science and machine learning; citizen science classifications feeding neural networks that classify larger datasets, and pass the subjects it can't classify back to the citizen science loop. I know the first part has been done before, Supernova Zoo comes to mind, I don't think anyone's done the latter.

Posted May 16, 2015 11:50 AM
by kirstenr in response to jo.irisson's comment.

Jo, Thank you for such an in depth and informative response. I found your answer really interesting, and I think the work with the particle detector may yield some surprising results!

Posted May 16, 2015 3:57 PM
by jo.irisson translator, scientist

@Kirsten, no problem, my pleasure

@Quia That's exactly what we are going to do.

The amount of data on PP is enormous but it actually is only a fraction of what we have. To give you size orders, 1 second of ISIIS filming the water yields between 100 and 120 potential planktonportal frames. We have hours and hours of ISIIS footage. In the Med, which I know best, we precisely have 93.4 hours which would yield 16,809,250 frames for planktonportal! And ISIIS has been deployed in the California Current (which you know) but also in the Gulf of Mexico and in the Florida current (the OSTRICH cruise which was mentioned on the blog).

Of course we already reduce that number quite drastically by automatically removing frames with nothing or only small (unidentifiable here) stuff on them. For the Med data set, that process selected ~92,000 frames out of ~3,000,000 processed from a part of the cruise. That is about a 3% rate but we were particularly stringent in our selection criteria. If we want planktonportal to be our only source of data, we need to provide you with exhaustive data, to minimise the amount of organisms missed, and that would probably mean to bump this to 10% (but then you'll see many more frames with just one small piece of junk 😉 ). That's 1.6 million frames for the Med dataset alone! There's probably 3 to 5 times that between all the ISIIS cruises.

So we'll use machine learning algorithms. I won't go in the details of those but all have two things in common: they need a large set of example images to be any good and they are never perfect. So we need a very large sample of correctly identified images for the algorithm both to learn from and to evaluate its performance (by running it blind on a set of data where we already know the answer). Images labelled on planktonportal are perfect for this because the identifications are likely to be much more accurate than what we get when we put a (biology trained) student in front of a computer 24/7 (which is the usual process 😉 ) because they are confirmed by several persons. So... thanks! 😉

Finally, even the best algorithms are far from perfect. To give you an idea, with the algo we usually use at my lab, trained with an extensive dataset, we get 50 to 60% of correct identifications on average. The best algorithms in the kaggle competition (which were extremely complex) ended with a 85% average correct recognition rate. You, collectively, are 100% accurate (OK, let's say 99% to be fair). I've never thought about giving the problematic images a spin here, at planktonportal, but that is a very interesting idea. I would need to look more into it because you may very well end up with many blurry, noisy, shapeless stuff to identify but that is a very good suggestion.

And now I'm thinking that I need to make that into a blog post!

Posted May 17, 2015 8:09 AM
by yshish moderator, translator in response to jo.irisson's comment.

Wow. This definitely deserves a blog attention! 😃 There have been many users interested in such details (including me). Thanks.

(Great to know we're not going to run out of data 😄)

Posted May 17, 2015 11:11 AM
by cguigand scientist, admin

Yes! this would be a great blog spot J.O.! This a great explanation of why PP exist and why we need all this hard work from all of you guys to improve classification methods. I just watched the new sic fi movie "Ex Machina" and realize that A.I. is scary!!! 😃

Regarding the earlier discussion on Orientation. I was at first hopefully it would give us information hat could difficult to extract. The problem is that it makes the interaction with PP a bit more "complicated" with new users, we would like not to scare them after using PP for a few minutes:( I know that you guys are top notch and not scare with a few more complex tasks. But we need to harvest the data from the not so committed users as well.
So Like J.O. mentioned there are ways around it so we do not lose much and you guys can classify even faster!!!
Thank you all! you rock...
Cedric

Posted May 19, 2015 12:49 AM
by Quia

@jo.irisson Thank you for all the details, it's fascinating to hear about all the work that goes in before we ever see the data, and plans for the future!

Are we really that accurate? I know sometimes I'll go to discuss an image after classifying because I wasn't sure and there'll be a post(most likely by yshish!) with a different species ID. I bet yshish is 99-100% accurate, I don't know about the rest of us. 😉

Posted May 19, 2015 3:56 AM
by mkmcguir

I am an English teacher. I doubt that I am anywhere close to that accurate. However, once you can describe in words why things are what they are -- fins vs. legs, round vs. square, etc. I think can get about 65%. I also think @yshish is super accurate -- she is the one that told me to look for eyelash-like tentacles for #solmaris. 😃 I do love the explanations, also.

Posted May 20, 2015 2:44 AM
by jo.irisson translator, scientist

Any one person is not 100% accurate (even the extraordinary @yshish !). Actually, there was an experiment in which taxonomists were provided with samples to identify and the best they got to was 95% (those were difficult species-level identifications but still, they were experts). However, the strength of plankton portal (and zooniverse in general I think) is that several persons need to agree before an image is actually considered classified. The threshold on PP now is 5. When 5 persons say the exact same thing on an image, we can be pretty confident that it is right.

Posted May 22, 2015 11:17 AM
by DZM admin in response to jo.irisson's comment.

However, the strength of plankton portal (and zooniverse in general I think) is that several persons need to agree before an image is actually considered classified.

Can confirm, this is how Zooniverse works. 😃

Posted May 22, 2015 5:20 PM
by Quia

Every project has it's own 'finished' criteria, always interesting to hear the details of them.

Five's a pretty small number! If you've figured this out already by looking at the California data set, I am curious about the distribution of classifications on finished subjects, 5, 5-1, 5-2, 5-2-1, 5-4-1, etc, and at what point you cut off images that are too uncertain. If you haven't figured it out yet, I can wait for the paper. 😉

Posted May 24, 2015 10:48 PM
by danita_maslan

This is a great improvement. I printed out the guide so I can refer to it while looking at the images. Maybe I'll get to the point where I won't have to double check. 😃

Posted June 12, 2015 9:50 PM
by yshish moderator, translator in response to danita maslan's comment.

Hi,

Glad to hear that you find it helpful.. Actually, we're working on some little changes there, so check the online version as well to don't miss the changes! 😉

Thanks,

Zuzi

Posted June 12, 2015 9:56 PM