(Above illustration: some of Tolkien’s Elvish source material)

I found this notification a few days ago from the SFWA, and I’ll share it in its entirety here:

Dear SFWA Member, Opening up the Atlantic article yesterday was a shock. So many of us scrolled down to search the Library Genesis data set for our own names. According to the article, Meta (Facebook) used millions of pirated works to train its AI. I found two of my works, and started searching for other SFWA members as well.  That little blue box has been all over social media this morning.  As the Atlantic notes, “millions of books and scientific papers are captured in the collection’s current iteration.” Personally, I did not give permission for my work to be used. Did you? SFWA’s number one principle in regards to AI is that Creators must be compensated for the use of their work. If you were not compensated, what can you do? We recommend you follow Author Guild’s list of actions, including protecting your work. There are other actions that may fit your personal circumstances as well.  As an organization, SFWA will continue to fight for our principles. Writers must be paid, credited, and protected, following expected norms.  We will follow up with more information as we investigate further and take next steps. Thank you, Kate Ristau
SFWA Board President

Atlantic Article: https://archive.ph/99Yum

Well, I’m human. I went to look at my own works.
Results of my search: https://libgen.is/fiction/?q=Karen+Myers  (10 works). Try it for yourself.

No question, my literary work has all been sucked into the maw of this particular AI.

And you know what I think?

So what?

Unless LibGen or its AISourcing successors/equivalents has the ability to churn out my individual works in the same sequence of words and structure as the sources, and then the desire to offer them for sale as if they had directly republished them, why would I care, any more than shaking random words (metaphorically) out of an upside down chopped-up dictionary generates utility in locating definitions by promising to reconstruct the form of the original?

I am unmoved by the shrieks of the SFWA. In fact, I’m actually genuinely amused, for very silly reasons.

You see, for one of my series, I actually commissioned chunks of four constructed languages (“con-langs”) (like, say, Dothraki), as well as a few more language fragments, to “label” (represent) cultural artifacts that defined individual cultures. (I wrote about this in detail starting here, and continued in related posts.)

As an example, for a vaguely Mongolian fantasy culture, my characters don’t sleep in “yurts”, a word which would tie them in too closely to a real-world item which would break the illusion of a created world other than our own. Instead I use an invented term for an analogue use, with a flavor of, say, pseudo-Arabic, to subtly convey to my readers the sense of cultural terms for a migratory tribe.

And, now, all of those invented conlang terms will join Dothraki and the languages of, say, Tolkien’s Lothlorien, as “real” worlds in the AI source/training material stew, no doubt to show up sometimes in spelling corrections and explanations and proffered terms by helpful AI editors.

I find that hilarious. I’d rather laugh at the unavoidable corruption of the source material feeding the AI systems than complain about what it does with the fragmented material. If we’re going to shrug at the inclusion of, say, Tolkien’s mathom as a source word for AI (from non-fiction articles as well as his actual work), then why shouldn’t my modest invented terms live on in this way, to the confusion of posterity?

What’s your outrage/amusement/indifference setting for news of this sort?

17 responses to “Kerfuffles about AI Training materials”

  1. The data base has one of mine. I confess what’s irritating about it is that its inclusion suggests it’s on a pirate site somewhere.

  2. It’s got a bunch of mine. I do hope (cough, cough) that genetically engineered telepathy doesn’t start showing up in non-SF/F works . . .

    1. Good luck with that. I’ll raise you 10 made-up-words for your bio-scientific technology.

  3. The only one they have of mine is my first published work, which was on permafreebie status for around seven years, from late 2017 to late 2024. Can’t even complain about piracy. Rather tends to contribute to the impression that I am pointlessly dropping rose petals down the Grand Canyon, but I’m the only person to blame for that. 😀

  4. If it won’t endanger my legal rights, and won’t cause headaches with being in Kindle Select/KDP, then I’ll wait and see. If it could be considered allowing use of my IP, or “publishing through another outlet,” then I might lawyer up.

    I’ve got a few other, larger fish to fry at the moment.

  5. Well, after pursuing this, I know at least one author who will be amused to learn they are now a Novelist with two published novels (of the exact same name) under their belt.

    1. Those inconvenient replacement edition releases, no doubt. 🙂 Welcome to the world of real data: duds, squibs, trash, and all.

      1. B-b-but I thought AI was immune to the laws of computers???

        I guess Garbage In Garbage Out didn’t get the memo. 😎

  6. Geez, I’m kinda disappointed that the only thing of mine in that database is the Posleen story I wrote for an anthology that fell through, and was subsequently published on the freebie disc in the fourth Posleen book.

    But seriously, I’m still rather bemused by this whole kerfluffel about training data for AI’s, which seems to have the underlying assumption that it needs special affirmative permission, rather than just the ordinary permission one gains by legally buying a digital copy of the book (assuming it’s still under copyright). All of us have learned our craft by copious reading of prior art, whether stuff that’s fallen into the public domain, stuff we bought copies of, stuff we borrowed from a library, or just stuff we read off the Internet.

    1. This tracks with my suspicion that the SF/F part of the dataset might have been formed by hoovering up freebies or books on sale – my only book found in the search was a permafreebie for several years.

      1. Free Pirate sites, I suspect. I doubt they paid Amazon for 73 of mine, and while I occasionally go free with a few of the early stuff, not these.

        1. yikes, well there goes that theory!

    2. Yeah, I’m kinda at a loss here. If they are (re)publishing the books I could see an issue. Using a pirate site could be an issue, but how many of the books were on sale at some point for 99¢ or even free? I have several thousand eBooks and have barely spent $200 over the years. Do they need to buy special permission?

      1. That’s the argument that opponents of AI are using — that machine learning is so different from human learning that the ordinary permission we humans get by buying a book is not sufficient, and the operators of an AI need to obtain special affirmative permission for every item that is not in the public domain.

    3. It’s the piracy that’s an issue.

  7. Do I sense a bit of fear here that the majority of their members may be replaced in the marketplace by Grok? (An apt example here of a con-lang word becoming widespread in the culture at large.)

    The “best” SF that they showcase these days could easily be generated by a not overly bright (and somewhat hallucinatory) AI – and would likely sell better…

  8. The database has none of mine, and I’m sure that’s because I’m WAY off down the algorithm in the sub-sub-basement of the Amazon server. There is probably actual AI generated trash higher up the rankings than me.

    So I’m a little put out, if I’m honest. ~:D

    Still, it is fun to see the SFWA, the SCIENCE FICTION Writers of America, getting their knickers in a twist over modern technology.

    The truth of the matter is, if your work is for sale on Amazon in e-book form, you -know- they scraped it to train AI. And they sold it to other people to train AI too. Because of course they did. Tech bros restrained by morality? Don’t make me laugh.

    I’m not saying this is a good thing, but I am saying it is certainly a thing. SFWA is tilting at windmills, playing the outrage card (again) to get members.

    As for me, would I like to be paid every time they run my books through an AI model? Yes. SHOULD I be paid? Yes, I should. They should damn well buy a copy, at least.

    But will I ever be paid? No. No way. not a chance in Hell. Never going to happen. Because this is the race for Artificial General Intelligence, my friends. They think The Future!!! is at stake. It’s the race to build Skynet for real.

    On the bright side, the people running this push are the Post Modernist “Humans are meat robots” guys. The ones who think ant hills are the ultimate and perfect social order. I don’t think they’re going to get very far down this road before they hit the wall.

Trending