data mining the dharma overground and kenneth folk dharma forums; calling software developers, natural language processing, and statistics people

I can’t find the post now, but, a few months ago, a post on Insane Brain Train reminded me of the meditation practice journals on the Dharma Overground and Kenneth Folk Dharma sites:

There are threads on those forums where people journal their meditation progress over hundreds of posts. All issues aside, it’s probably the largest repository of  “hardcore” meditation practice data available. (If you disagree, let me know!)

Since the inception of this blog, I’ve of course been thinking about how to be as empirical and falsifiable-model-driven as possible, and this dataset represents one possible means to that end.

I expect almost all of the meditators on the forums are doing Mahasi-style noting, so there would be a bit of a challenge to generalize beyond that. But, I do agree with Daniel Ingram that there is a huge (though not complete) overlap of the stages that meditators go through, between different mediation styles and traditions. So there is possibly, probably, some signal to be gleaned, here.

Now, I can and may do some or all of this eventually (I did a bunch of these puzzle pieces, over and over again, during my PhD), but I could use your help, and it would be my pleasure if you beat me to it:

  • Scrape these forums into XML or a DB, weed out non-practice-journal threads, and non-practitioner posts from those threads. Or, just even download the forums into an archive that we can back up. Or, recommend to me/us a scraper library. Or, hack together some example code that will speed me/us up.
  • Clean stuff up (e.g. bag of words, among other things) using something like NLTK, possibly as part of the above bullet.
  • Grab some models from Natural Language Processing-land or make up some really simple stuff, pick an appropriate statistical test, correct for multiple statistical tests, and demonstrate statistically significant… stuff. (Suggestions welcome!)
  • For example, as a first pass, I have some expectation that meditators will use some words with greater or lesser frequency the farther along they are. It might be very noisy.
  • It may be necessary to iterate on higher-level observational codes to find signal, and that would be very labor-intensive.
  • If this moves along, I would put a lot more time into building up some more elaborate falsifiable hypotheses about what I think is going on and what we think we can actually get out of this dataset.
  • I also have my eye on the buzz-phrase “Intensive Longitudinal Methods,” but, again, I’m totally open to ideas for the best way to frame this investigation.
  • Donate money to fund me, or a developer, or a stats guy to do some of these pieces.
  • Affiliate me as a postdoc with your lab so I would be eligible for more grants. You wouldn’t even have to pay me if you basically just left me alone. 🙂 I might even be willing to do stuff on your critical path if you paid me. 🙂
  • Hunt down grants that I or someone else could apply for. For example, nothing at the Mind and Life Institute fits my current situation:
  • Help me write a grant or write me into a grant or let me write myself into your grant.
  • In any case, the overarching goals are to describe normative progress and to look for clues on how to safely accelerate that progress.

Anyway, this is only just one of the things I’m toying with. But this data is there, and there might be usable signal. I don’t think I’m at the point where I’m ready to start collecting or collaborating on (a) designer dataset(s).

If you need a project, say, for school, please see if there’s something you could bite off in there. Depending on where I’m at with other stuff, I would, at minimum, be your cheerleader, if not a lot more.

data mining the dharma overground and kenneth folk dharma forums; calling software developers, natural language processing, and statistics people

