Tries - fluffy leaf

Definition:

A Trie is a static tree-like data structure in which each node has 26 children pointers representing the letters of the English alphabet. The strings or words can be retrieved by traversing down a branch path of the tree. When traversing down a path the count will indicate if the string is a word and count its popularity.

1
2
3
4
5
typedef struct trieNode
{
    int count; 
    struct trieNode *children[26];
} trieNode;

Advantages

-Unlike a BST, trie’s do not store strings or keys associated with that node, instead its position in the tree defines what the key is linked with.
-Storing strings in a BST is a O(klog(n)) where k is the length of the string and n is the total amount of strings we need to go through. Also need to take into consideration on having to expand our array.
-There are no collisions of different keys in a trie and no need to provide hash functions in a trie.
-Tries use arrays and pointers - hash tables use arrays and linked lists.
-Tries make it easy to search for a subset of elements, similar to BST, each time we traverse down a branch of a tree, we cut out the number of nodes we need to look at.
-If we insert a dictionary of words into a BST in alphabetical order it devolves into a linked list which in turn will affect runtimes.
-If using a BST it would take up extra memory since each node has two child pointers => 16 bytes

Disadvantages

A hash takes up a lot of memory and space.
Some tries can have long chains and prefixes that are not meaningful.

Basic Structure

To start off, the root node of a trie is empty with 26 links (0-25)
to other nodes - one for each possible alphabetical value and so on.

The shape and structure of a trie is always a set of linked nodes,
all connecting back to an empty root node.

The root has an array of 26 pointers that all point to null at first and as the trie grows,
those pointers start to get filled up with references to other nodes and so on. 

        (" ")
        / | \     <---- 26 links 
       A..K..Z    <---- Here, each node of the root are represented
                        as letters however when coding this up, to access 
                        the index, need CHAR_TO_INDEX(x)    ((int)x - (int)'a') 
                        
                        **For example** inserting the letter 'a' from "acquiesce" 
                        will convert it to the value 0. These characters(a,b,c..) 
                        live at those indexs(0,1,2..) since our root node will hold 
                        an array of indexes 0 through 25, and there are 26 possible 
                        slots for the 26 letters of the alphabet.
                        
                        See conversion/reference chart. 
    
An array representation would look like this -

+----------------------------------------------------------------------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |... | 25 |
+----------------------------------------------------------------------------------------------+
                                           \ 
                                    +-------------------------------+
                                    | 0 |...| 14 | 15 | 16 |...| 25 | <--- malloced that shit
                                    +-------------------------------+
                                                 /
                                    +-------------------------------+
                                    | 0 |...| 17 | 18 | 19 |...| 25 |
                                    +-------------------------------+
This word spells out "kor"
The rest of the pointers are NULL.
                        
   

The size of a trie is directly connected to the size of the alphabet that is being represented.
Each node contains an array of references (as pictured above) and the characters are defined by its corresponding value, a.k.a index, in the array.
All the descendants of a node have a common prefix of the string associated wit that node.
As a trie grows in size, less work must be done to add a value, since the “intermediate nodes” or branches of the trie have already been built up.

Conversion Chart

+=======================================================================================================================+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
------------------------------------------------------------------------------------------------------------------------+
| A | B | C | D | E | F | G | H | I | J |  K |  L |  M |  N |  O |  P |  Q |  R |  S |  T |  U |  V |  W |  X |  Y |  Z |
+=======================================================================================================================+

Applications

Tries have many uses such as storing an autocomplete dictionary, can be used as a spell checker and store frequency of words and autocomplete senteces. For instance when you type into the Google search bar it will autocomplete words for you based on popularity or perform a text prediction operation that will enable the trie to fill in sentences for you based on what you recently type after a specific word. The three common applications for using a trie; insert, deletion and search.

Text Prediction

Implementing text prediction would look similar to this using this corpus.

I like dancing
I like food
She goes swimming
She likes swimming 

Here’s what the trie looks like after inserting the words from the corpus

I color coded each leaf that correlates to the the words subtree as shown in the next photo. This is somewhat the visual of what the word’s subtree looks like when implementing text prediction.

Insertion

When inserting into a trie you start at the root and check to see if the character corresponding to the index is not null, if it is that means that the word does not exist so you need to malloc for the node until after the last letter of the string.

Example:
    
Word: panic

    (   ) <---- empty root node, malloc space at index 15
            
    (   ) 
        \
         P      <---- continue the process
         |
         a
         |
         n
         |
         i
         |
         c
            

You repeat this process for insertion and if you want to insert another word that begins with ‘p’ you start at the root node, index[15] which correlates with the character p has space so you malloc from that node rather than the root.

Here’s the logic translated into code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// key here is our string or word
trieNode *insert(trieNode *root, char *key)
{
    int i, len = strlen(key), index = 0;
    
    if (root == NULL)
        return createNode();
        
    trieNode *trace = NULL;
    trace = root;
    
    for (i = 0; i < ALPHABET_SIZE; i++)
    {
        index = tolower(key[i]);
        index = CHAR_TO_INT(key[i]);
        
        if (trace->children[index] == NULL)
            trace->children[index] = createNode();
        trace = trace->children[index];
    }
    
    trace->count++;
    return root;
        
}

NOTE!! CHAR_TO_INT is a defined function that I placed at the top of my program that simply looks like this -

#define CHAR_TO_INT(c) ((int)c - (int)'a')

   Takes the integer value of some char and subtracts it
   by the int value of the character 'a'. If we put 
   quotes around the c, 'c' it would test the character c, 
   not an arbitrary char value which is incorrect.

I use this in many of my programming assignments and have found it to tremendously helpful in scenarios when I am unable to utilize atoi. Special cases in which atoi would not work is calling it on a character rather than a string. For instance, if *str = “123” then atoi(str) would be 123 as an actual integer. However, atoi functionality is unable to individually call 1 as an integer or 2 as a separate integer.

Deletion

When deleting from the trie, first you must go back and check to see if the node exists from the root and if not there is nothing to delete. If the count is > 1 then decrement the value by one. If the node has children just set the count back to 0. Another thing to take into consideration when deleting from a Trie is to go back up the chain towards the root and deleting each node in a bottom up manner (recursion) until you hit a node with count > 0 or a node that has at least one child.

1
2
3
4
5
6
7
8
9
10
    int i = 0;
    if (root == NULL)
      return NULL;
      
    for (i = 0; i < ALPHABET_SIZE; i++)
      destroyTrie(root->children[i]);
    
    root->count = 0;
    free(root);
    return NULL;

The deletion code above is a basic implementation of deleting all the nodes in a Trie. NOTE! This code does not handle special cases.

The logic behind the code is as follows -

The function is recursive so you would want a base case to know when exactly to return.
By doing so you would want to return when the root passed in is NULL -
In order to destroy, you have to loop through each pointer from the root to see if any of the indexes are non null.
From there you continue to jump to next non-null node until all of the children pointersare all null and free back up from there.

Lets portray this -

Some arbitrary trie

        (   )
        /   \
    (   )   (   )
    /
(   )

NOTE!!!
    > Remember that each node has 26 pointers of its own. Keep this in your mind as we move 
    forward with this example. 



        ( * )               Let's use this "*" symbol as an indicator of where we are and which node we are on in the recursive call. 
        /   \               In the first recursive call we check to see if root == NULL. Which in this case, returns
    (   )   (   )           false, so we enter the forLoop.   
    /
(   )


Call#1 has i = 0. 
Make recursive call to enter the next node passing in the next next node, or in other words roots first child pointer 
    
    * - (   )               Let's use this "*" symbol as an indicator of where we are and which node we are on in the recursive call. 
        /   \               In the first recursive call we check to see if root == NULL. Which in this case, returns
    (   )   (   )           false, so we enter the forLoop.   
    /
(   )

This child is A or index '0' in call#2 so this call will return NULL.
This pattern will continue on, and i will increment in CALL#1 until the 
next call finds a nonnull child. 

Once the program hits the next malloced node, i will set back to 0 and restart the whole process. 
Are you starting to see the pattern here? 

If so, great! So how do we know when to start deleting? 

Ahh, so say you continue with the pattern as described above and you hit the very last node, eventually the looping
variable, i, will reach the limit and the loop will break. Once i = 25 and we are outside of the loop and
we are guaranteed that there are no nonnull pointers following the leaf (Hence why it's considered a leaf),
now we can start freeing!!

Once the first leaf has been freed then it will return back to the previous node. 

            (   )               Let's use this "*" symbol as an indicator of where we are and which node we are on in the recursive call. 
            /   \               In the first recursive call we check to see if root == NULL. Which in this case, returns
        (   )   (   )           false, so we enter the forLoop.   
        / \
      X    * <---- now remember, when we return we return back to whatever i was left off in the previous recursive call. 
                   Since it's null from and out this node will get freed once i hits the limit. 
                   
The process repeats itself and is simply a pattern to follow up on, using the logic as applied from above. 

Search

When searching for a string, you have to keep a couple of things in mind.
A. Know how to convert the character in the string to a certain index value. *this is when the CHAR_TO_INT formula comes in handy
B. Know when to exit the loop
C. Know when to continue looping (move forward in the path)

In the code below I had created a string called “s_holder” to handle cases in which the string you’re trying to search for has special cases.

For tackling B and C, you would want to move forward onto the next character in the string that you are trying to search for. Well, how do you access nodes generally? If you recall back in the deletion example, we would loop through every individual node, null or nonNUll, hitting all the letters in the English alphabet. But in this case we would want to look for a specific letter, using the basic format root->children[i], what part here specifies the next letter?

Therefore, supstitute i will an integer value that represents the next character and now you’re telling the program to go to a specific node :).

What happens if that character doesn’t exist? In such case, that implies the string by defualt does not exist within the Trie and will exit out immediately. cough cough return.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
  // iterative search implementation

  TrieNode *search(TrieNode *root, char *str)
  {
    if (root == NULL)
      return NULL;
    if (str == NULL)
      return NULL;
    
    TrieNode *fishy = root;
    int index = 0,i = 0;
    int len = strlen(str);
    char s_holder[1023 + 1] = {};
    strcpy(s_holder,str);

    while(i < len && s_holder[i] != '\0')
    {
      
      if (isalpha(s_holder[i]))
      {
        s_holder[i] = tolower(s_holder[i]);
        index = CHAR_TO_INT(s_holder[i]);
        if (fishy->children[index] == NULL)
          return NULL;
        fishy = fishy->children[index];
      }
      i++;
    }  

    return (fishy->count >= 1) ? fishy : NULL;
  }
  
  
  
 *********************************** 
 ** RECURSIVE VERSION FOR THE LOLS **
 ***********************************
  
  
  IMPORTANT! This is the helper function to the actual search function that would call this. 
  
  TrieNode *helper_searcher(TrieNode *fishy, char *str, int i)
  {
    if (fishy == NULL)
      return NULL;
    if (str == NULL)
      return NULL;
      
    int index = 0;
    int len = strlen(str);
    while(i < len && str[i] != '\0')
    {
      if (isalpha(str[i]))
      {
        index = CHAR_TO_INT(str[i]);
        if (fishy->children[index] == NULL)
          return NULL;
        fishy = fishy->children[index];
      }
      i++;
    }
      return (fishy->count >= 1) ? fishy : NULL;
  }

Time complexity

Worst case runtime for a trie is dependant on how many words the trie contains and the length of each word. In this case the time complexity would be O(m*n) with m being the longest word and n is the total number of words. In the case of insertion all of the cases are O(k) with k being the length of the word you are trying to insert. For deletion and search the runtime would be similar to insertion however their best case would be O(1) runtime in a scenario a user would want to look up or delete a word that does not exist within the trie.

TLDR;

Operations	Worst Case	Best Case	Average Case
Insertion	O(k)	O(k)	O(k)
Deletion	O(k)	O(1)	O(k)
Search	O(k)	O(1)	O(k)

*k represents the length of the string

Glossary and Sources

Search Miss: could not find value for key
Search Hit: able to find value for key
Lexicon: all the valid words in the dictionary
Corpus: body of text
Intermediate: coming inbetween time,place,order and so on. (middle)

information and references came from this resource.