The Changing Nature of the Internet
Why Google isn't AltaVista.
My response to a back-and-forth with @Matt in last Friday’s open forum got too long, so I decided to pitch it as a front page post.
By way of background, I’ve been an internet user since before there was an internet and have participated in dozens of social networking systems over the course of that time, most long gone, and watched the legal battles that emerged almost from the beginning, as well as the legislation designed to head off some of those conflicts. I’m an engineer by trade and have developed firmware and software applications since 1980, and nowadays those software apps are networked or connected in some fashion or another.
I kicked off Friday morning’s conversation observing,
I can’t believe how naive the Supreme Court’s opinion on algorithms is. “If your product does harm but you don’t deliberately program it to do harm and instead use machine learning, you can’t be held responsible.” What could go wrong?
This kicked off a lively back-and-forth that I won’t rehash here but @Matt posted this after I logged off for the night:
I don’t see what you call the “changing nature of the internet”. If you could assist me in seeing what you mean I’d appreciate it.
So first, some history. At the time Section 230 was being crafted, the Internet had existed for a while but the World Wide Web was hardly a blip. The act was passed in 1996 and was therefore being crafted and debated at least as far back as 1994, perhaps earlier. The Web was created in 1990, made available to the public in 1991, and the first browser that made its way into the hands of the general public (as opposed to geeks like me) was Netscape in 1995. So when I say that the internet has fundamentally changed since then, I’m not talking about the specifications or infrastructure. I’m talking about the fact that so much of what billions of people say, think, and do is online, as well as companies, governments, and any other entity you can think of. More important, however, is the way in which this information is used to directly impact all of our lives.
Let’s do a comparison of AltaVista to Google. The former was one of the first web-crawling search engines and the first to gain widespread use. Simplifying, it allowed you to enter search terms and returned pages that matched. You could add things like “AND”, “NOT”, and “NEAR”, but fundamentally it was a predictable machine. Today’s version of Google, on the other hand, does not do anything predictable like that. The algorithms it runs on have been developed by machine learning. There is no human-understandable reason why (non-sponsored) pages rank at the top. In the end, the algorithms predict that more people are likely to clink on link A rather than link B and so A goes to the top. And even when things are deliberately excluded by a human decision (“don’t link to graphic violence,” “don’t link to COVID misinformation”), the method is to train machine learning systems on what types of things fall under those categories.
What does this mean in a practical sense? Take Amazon’s machine learning-based job applicant screener they created in 2014 and used for several years. They took exemplary employees, gathered up their application information (resume, background check, test results, etc) and used it to train a machine learning system. They hoped to eliminate bias by letting a neutral algorithm pick the best candidates. It turns out it was dramatically worse than human screeners when it came to bias. But the information didn’t contain anything about age, sex, race or religion, so how could it give back biased choices? Well, even a human could infer a lot from that information. Did the applicant attend a historically black college? Did they play on a college championship women’s soccer team? Machine learning takes that to a whole new level. It was essentially asked to give back the applicants most like their existing star performers and the algorithm found all kinds of ways to determine and rank “alikeness.”
Don’t think this can affect you? What categories of people are more likely to embezzle from a company? Who is more likely to miss days of work? Who is more likely to quit for a better job? If you have a close relative addicted to drugs, are you more likely to get behind in your car payments? How might caring for a parent with dementia impact your work performance, financial health, or willingness to go on business trips if you got a promotion? Companies won’t ask “are you a white male from a privileged background?” (Because that is who is more likely to embezzle) or “Is your wife thinking about getting pregnant?” or “How careful are you with your birth control” or “How old are your parents?” And they won’t task the algorithm with sussing that information out, even indirectly. But they don’t have to! This type of information gets incorporated by secondary, tertiary, or even more indirect routes. And there is no smoking gun. Machine learning algorithms aren’t logic trees, at least in the way we are used to them. You can’t look at the final algorithm and deduce that it is looking for questions about race or relatives’ drug use. It’s not. It’s looking for patterns of words and data.
Primarily due to ChatGPT, this has finally reached the public’s awareness. But it’s already here and has been here for decades. It is increasing exponentially, though. The newest Apple processor has a whole bunch of cores in it. More than a dozen CPUs. 8 or more GPUs (cores just for graphics). But it also has 16 Neural cores optimized for machine learning.
Why so many? When you talk to Siri or type in a question, an Apple iPhone or Mac does most of the processing on the device itself out of security concerns. Alexa sends it up to the Net for processing. Alexa can correlate your speech or text with everything else the Internet (well, Amazon, Google, etc) knows about you. Siri can only correlate it with things your device specifically knows about you and so has to work a lot harder just to produce poorer results.
We have no idea where this information is being used. I can tell you one place you might not expect, though. Ten years ago, at the largest Health Informatics trade show, a few small but very well-appointed vendor booths showed up with scant and vague signage. Nonetheless they had a pretty steady flow of traffic. It took some digging but I finally figured out what they were offering. They were using machine learning and other techniques to help hospitals identify their most lucrative patients versus the ones they were most likely to lose money on. That’s just hospitals. And that was ten years ago. Imagine what your bank is doing, your credit card company, your employer, heck everybody!
Entering into the more paranoid realm, I’ve been following two things in China I suspect are more powered by machine learning than is being discussed. The first is the Chinese Social Credit System. While they publicly talk about penalizing people for having their dog off a leash or cheating on government job exams, it is widely known that it is also used to hound and harass people who speak out against the government. It would not surprise me at all if they are using machine learning to identify undesirables and deduct points from them. And I speculate an even more insidious use might be occurring in Xinjiang. Uyghurs have been put into re-education camps by the millions. One of the repeating themes from those affected is the arbitrary nature. Suddenly state agents show up and haul people away, seemingly without any warning indicators showing up. While it may be just due to coerced informants or a random policy of intimidation, I have to wonder if it could be a Chinese experiment in Precrime?
The Supreme Court was so pleased with themselves that they dodged a bullet by not addressing Section 230. But instead, they appear to have given a blanket tort defense to anyone who interposes an algorithm between themselves and the harm they do. I expect it won’t stand, but it is just another example of how clueless and out of touch this court is.