# encryption - Cryptography: hash of string correlates to hash of substrings of string

I'm processing human-written text documents and I do a dictionary based string matching to find specific strings in the document.

For security reasons, I can not input the document in unencrypted text format, but rather in a strong encrypted format. I can not allow developers working on the unit access the unencrypted input string, but they can access the matched strings.

To make it clearer:

Dictionary = {"Apple", "Apple pie", "World War II"}

Document1 = "apple is my favorite fruit." -> Should match "apple"

Document2 = "apple pie was invented during world war II" -> Should match "apple pie" and "world war II"

So the string matching is case-insensitive and only matches longest occurrence (I'm using Aho-Corasick).

The options I see are:

1. Find an encryption function F where F("ABCD") = F("A")+F("B")+F("C")+F("D") = F("AB")+F("CD").

2. Chunk the document by whitespace, hash both the chunks and the dictionary and then look for similarities. (complicated)

3. Make a separate unit responsible for encryption and string matching with obfuscated code. (most obvious way)

As I'm not good at cryptography, I might be missing something here. Can anyone see a better way of achieving this?

Firstly, any encryption function that satisfies your condition:

F("ABCD") = F("A")+F("B")+F("C")+F("D")

is inherently not strong encryption (assuming + here means concatenation). The problem is that this condition implies that F("A") is invariant, which means that it the encryption is equivalent to a simple substitution cipher, vulnerable to frequency analysis.

A bigger problem however is that any solution is going to be vulnerable to a dictionary attack. If you can determine that a word in the unknown document is a particular word in your limited dictionary, then you can also search for it in a complete dictionary - in this way, you can quickly discover the entire plaintext.

If I understand correctly, the goal is to prevent someone who has physical access to the machine and access to the processes running on it from being able to determine the contents of the document. I don't think that is possible if the "bad guy" is extremely dedicated. He will be able to extract key information necessary to decrypt the document from the process space. As a general rule, if the attacker has physical access, then there is not a lot that can be done.

If the program can match parts of text of a document to known text, then the attacker will be able to observe that and extract the information. Obfuscation of the code may make it harder, but if the information is valuable enough, then the attacker will just work harder.

It seems that it would be better if the server can be run in a secure fashion and limiting physical access as much as possible. There are, of course, still a lot of issues involved (code would need to be audited for malicious code for example since the developers are apparently not trusted) but that at least gets you to a position that has a chance of being defended.

Edit A couple thoughts about encryption in the context of what you are trying to do.  If you are using, for example, AES encryption in CBC (cipher block chaining) mode, then it is not possible to decrypt a single word from the document (assuming the document is encrypted as a whole). Each block of cipher text depends on the preceding block. Thus, it would be necessary to decrypt the entire document up to the point of interest.  In other words, you would have to decrypt the entire document to search it.

Another encryption possibility would be to use AES in CTR mode. CTR mode generates cipher stream (based on the key and some initialization vector) and XORs that against the plain text to produce the cipher text. In this mode, it is possible to decrypt a portion in the middle of the document without decrypting the previous section. But that is somewhat misleading and a bit of a semantics argument. Even though you don't have to decrypt the preceding section, it is still necessary to generate the cipher stream for the entire document up to the point of interest. And from an attacker's standpoint, that would be the same as decrypting the document since the attacker would have access to the encrypted text (presumably in the situation you describe) and the generated XOR stream, which would yield the plain text.

Your proposed solution #1 is a very very difficult problem - known to be solvable, but almost certainly not worth your while to solve.

The technique you would want for it is Homomorphic Encryption. It was first demonstrated in 2009 by Craig Gentry of IBM that arbitrary computation can be performed without revealing the plaintext.

The state-of-the-art is probably too inefficient for almost all applications - while exponential security can be obtained with "polynomial" computation (which is all the theorists really care about), the polynomial is enormous enough to be not valuable. This might change in the near future.

With that said, I don't see any reason why you can't:

hash each entry in the dictionary
(split each entry on whitespace, multiword entries are tuples of hashes)
split document on whitespace, hash each word
do the matching with the hashes

Essentially, you're matching arbitrary items, not inherently words. The client can produce the words-items map, and pass the items to the server. The server doesn't need to know anything about the items, just that an item from the dictionary appears in the text.