Skip to main content

Grouping Words With Their Counts in Java

In this article, we are going to know how to split the String input (a sentence or a paragraph) into words and count each words’ occurrences.

Image by Author

The picture above is our problem definition and let’s see how to do this in Java.

String text = "We resolve to be brave. We resolve,,  to be good. We resolve to uphold the law according to our oath.";

This is the paragraph or input.
  • String textLower = text.toLowerCase();
  • textLower = textLower.replaceAll("\\W", " ");
  • textLower = textLower.replaceAll("\\s+", " ");
  • String[] words = textLower.split("\\s+");
Here, we have changed all letters to small letters using toLowerCase(). And then we replace characters apart from [a-zA-Z0–9_] with a space using replaceAll(“\\W”, “ ”). We can also replaceAll(“[^a-zA-Z0–9]”, “ ”). 

Then we removed the spaces using replaceAll(“\\s+”, “ ”) and added a single space; this is to bypass the additional spaces and consecutive non-word characters or marks. 

Now, we split the string with their spaces using split(“\\s+”) and have set them in an array called words.
  • Set<String> noDup = new LinkedHashSet<String>(Arrays.asList(words));
  • String [] noDupWords = new String[noDup.size()];
  • noDupWords = noDup.toArray(noDupWords);
Now, we have brought the word array to Sets so that we can remove duplicate words easily. 

Now, we have created a new String array with the size of a distinct word count and put all values of the set to that array.

String retText = "";

Then we’ve initiated an empty String variable to append the output.


Here, we check the occurrence of each word in the distinct elements’ array with the original array and get the count of each word. 

Then we append them to the String Variable which was initialized earlier called retText.

Put a print statement after this loop and see the output.

we,3
resolve,3
to,4
be,2
brave,1
good,1
uphold,1
the,1
law,1
according,1
our,1
oath,1

This is what the program will give. 

The full implementation.



Hope the article can help. Share your thoughts too.

Comments

Popular posts from this blog

A 3000 Years Old Love Story

Pharaoh Ramesses the Great and Queen Nefertari Pharaoh Ramesses II the Great ruled ancient Egypt during the 19th dynasty (1292-1190 BCE). His reign was the second-longest in Egyptian history, lasting from 1279 to 1213 BCE. He assumed the throne in 1279 BC as a royal member of the Nineteenth Dynasty and ruled for 67 years. In Greek sources, Ramesses II was also known as Ozymandias, with the first half of the appellation deriving from Ramesses' regnal name, Usermaatre Setepenre, which means 'The Maat of Ra is mighty, Chosen of Ra'.  He is also recognized as the Egyptian Empire's greatest, most renowned, and most dominating pharaoh. His successors and subsequent Egyptians are reported to have referred to him as the Great Ancestor. Ramesses II was a famous explorer, monarch, and warrior who conducted multiple military excursions to the Levant to reestablish Egyptian dominance over Canaan. He is also supposed to have conducted journeys south to Nubia, which are documented in...

Parallel A* Search on GPU

A* search is a fundamental topic in Artificial Intelligence. In this article, let’s see how we can implement this marvelous algorithm in parallel on Graphics Processing Unit (GPU). Traditional A* Search Classical A* search implementations typically use two lists, the open list, and the closed list, to store the states during expansion. The closed list stores all of the visited states and is used to prevent the same state from being expanded multiple times. To detect duplicated nodes, this list is frequently implemented by a linked hash table. The open list normally contains states whose successors have not yet been thoroughly investigated. The open list’s data structure is a priority queue, which is typically implemented by a binary heap. The open list of states is sorted using the heuristic function  f(x) : f(x) = g(x) + h(x). The distance or cost from the starting node to the current state  x  is defined by the function  g(x) , and the estimated distance or co...

Multiclass Classification Using Support Vector Machines

Binary Classification The machine should only categorize an instance as one of two classes of this type: yes/no, 1/0, or true/false. In this sort of categorization inquiry, the answer is always yes or no. Is there a human in this photograph, for example? Is there a good tone to this text? Will the price of a specific stock rise in the coming month? Multiclass Classification In this case, the machine must categorize an instance into only one of three or more classes. Multiclass categorization may be shown in the following examples: Positive, negative, or neutral classification of a text Identifying a dog breed from a photograph A news item might be classified as sports, politics, economics, or social issue. Support Vector Machines (SVM) SVM is a supervised machine learning method that may be used to aid with both classification and regression problems. It aims to determine the best border (hyperplane) between distinct classes. In basic terms, SVM performs complicated data modificati...