Grouping Words With Their Counts in Java

In this article, we are going to know how to split the String input (a sentence or a paragraph) into words and count each words’ occurrences.

Image by Author

The picture above is our problem definition and let’s see how to do this in Java.

String text = "We resolve to be brave. We resolve,, to be good. We resolve to uphold the law according to our oath.";

This is the paragraph or input.

String textLower = text.toLowerCase();
textLower = textLower.replaceAll("\\W", " ");
textLower = textLower.replaceAll("\\s+", " ");
String[] words = textLower.split("\\s+");

Here, we have changed all letters to small letters using toLowerCase(). And then we replace characters apart from [a-zA-Z0–9_] with a space using replaceAll(“\\W”, “ ”). We can also replaceAll(“[^a-zA-Z0–9]”, “ ”).

Then we removed the spaces using replaceAll(“\\s+”, “ ”) and added a single space; this is to bypass the additional spaces and consecutive non-word characters or marks.

Now, we split the string with their spaces using split(“\\s+”) and have set them in an array called words.

Set<String> noDup = new LinkedHashSet<String>(Arrays.asList(words));
String [] noDupWords = new String[noDup.size()];
noDupWords = noDup.toArray(noDupWords);

Now, we have brought the word array to Sets so that we can remove duplicate words easily.

Now, we have created a new String array with the size of a distinct word count and put all values of the set to that array.

String retText = "";

Then we’ve initiated an empty String variable to append the output.

	for(int i = 0; i < noDupWords.length; i++) {
	int count = 0;
	for (int j = 0; j < words.length; j++) {
	if(noDupWords[i].equals(words [j])){
	count = count + 1;
	}
	}
	retText = retText +noDupWords[i]+","+count+"\n";
	}

view raw PartitionPart.java hosted with ❤ by GitHub

Here, we check the occurrence of each word in the distinct elements’ array with the original array and get the count of each word.

Then we append them to the String Variable which was initialized earlier called retText.

Put a print statement after this loop and see the output.

we,3
resolve,3
to,4
be,2
brave,1
good,1
uphold,1
the,1
law,1
according,1
our,1
oath,1

This is what the program will give.

The full implementation.

	import java.util.*;
	public class WordExtraction{
	public static void main(String [] args){
	String text = "We resolve to be brave. We resolve,, to be good. We resolve to uphold the law according to our oath.";
	String textLower = text.toLowerCase();
	textLower = textLower.replaceAll("\\W", " ");
	textLower = textLower.replaceAll("\\s+", " ");
	System.out.println(textLower);
	String[] words = textLower.split("\\s+");

	Set<String> noDup = new LinkedHashSet<String>(Arrays.asList(words));
	String [] noDupWords = new String[noDup.size()];
	noDupWords = noDup.toArray(noDupWords);
	String retText = "";

	for(int i = 0; i < noDupWords.length; i++) {
	int count = 0;
	for (int j = 0; j < words.length; j++) {
	if(noDupWords[i].equals(words [j])){
	count = count + 1;
	}
	}
	retText = retText +noDupWords[i]+","+count+"\n";
	}
	System.out.println(retText);
	}
	}

view raw WordExtraction.java hosted with ❤ by GitHub

Hope the article can help. Share your thoughts too.

Decrypt Here

Search This Blog

Grouping Words With Their Counts in Java

Labels

Comments

Post a Comment

Popular posts from this blog

Parallel A* Search on GPU

Multiclass Classification Using Support Vector Machines

A 3000 Years Old Love Story