When a computer program is trying to pull meaning from text, it helps if it can tell how alike two words are. When the words look similar this is easy. Just by counting shared letters a computer should have a good hunch that “medicine” and “medical” are in the same ballpark. But what about “apple” and “strawberry”? The average six-year old knows these two words are in the same category but a program looking for overlapping letters will be stumped.
To fix this problem computer scientists came up with a clever way to represent a word as a list of numbers called a word embedding. The list of numbers, or vector, is generated by reading in a giant body of text (say all of Wikipedia or the Apple Terms of Service Agreement) and looking at context that appears around that word. Two words that show up in similar contexts will have similar vectors. So for our apple/strawberry example, our embeddings would reflect the fact that both words are likely to show up in a context like “I ate an ___” or “Why are we always out of __ flavored seltzer?.” The distance between the two vectors should be relatively small.
Word embeddings work by looking at the words or phrases that commonly appear near the word you are trying to encode. This works well in most cases but can have sneaky unintended consequences. A professor of mine gave the example of the embeddings for “doctor” and “nurse.” If you compared those words to the embeddings for “man” and “woman”, doctor would be closer to “man” and nurse would be closer to “woman.” This could become a problem if you are using word embeddings in a program to filter resumes or recommend loan applications. By trying to favor one attribute, you may unintentionally favor others. In the worst case this can be both unethical and illegal.
HAL from 2001 is creepy because it reasons in a way that feels foreign, applying logic precisely and consistently to reach stark, cold hearted conclusions that we squishy, thinking, feeling humans could never accept. Given the prevalence of machine learning systems that leverage large amounts of user generated data, this should probably be secondary to a more immediate concern -- that AI will look very much like us.