当前位置: 动力学知识库 > 问答 > 编程问答 >

sorting - How to implement a bag-of-words in java

问题描述:

Basically, I need a double-connected Map, which can retrieve value from key and the inverse, I have checked this link, BUT also should be sorted according to the values AND should take multiple values for a single key (I cannot guarantee that there won't be an exact freq for different keys).

So is there any structure with that criteria out there?

Below is the specific problem that imposed this need (maybe I have something missing while I was implementing this but if you know an answer to the above question then you can probably skip it):

I want to implement a bag-of-words method for some features. The idea is to keep only the top k bins with the greatest frequency of occurrence.

To make it more specific let's say I have a codebook

double[10000][d] codebook and a set of features double[][] features. For each line in features, which represent a feature I check the distance from each line in codebook and assign it to the bin having this line a centroid.

Then I increment the index of this bin by 1 until all features have been used.

Then, I would like to keep only the top k bins as results.

The part I am a bit stuck is the one for keeping only the top-k bins. I use a BoundedPriorityQueue<Feature> collection to achieve but I am not sure if there is some simpler approach.

public static BoundedPriorityQueue<Feature> boWquantizerLargerK(double[][] codebook, double[][] features, int featureLength, int maxNumWords) {

HashMap<Integer, Integer> boWMap = new HashMap<Integer, Integer>();

BoundedPriorityQueue<Feature> nn = new BoundedPriorityQueue<Feature>(new Feature(), maxNumWords);

for(int i = 0; i < features.length; i++) {

double[] distCodebook = new double[codebook.length];

for(int j = 0; j < codebook.length; j++) {

double[] dist = new double[featureLength];

for(int k = 0; k < featureLength; k++)

dist[k] = (codebook[j][k] - features[i][k])*(codebook[j][k] - features[i][k]);

distCodebook[j] = MathUtils.sum(dist);

}

Integer index = MathUtils.indexOfMin(distCodebook) + 1;

Integer freq;

if((freq = boWMap.get(index)) == null) {

boWMap.put(index, 1);

nn.offer(new Feature(1, index));

}

else {

boWMap.put(index, ++freq);

nn.offer(new Feature(freq, index));

}

}

return nn;

}

The Feature class is a simple implementation of Comparator:

public class Feature implements Comparator<Feature> {

private Integer freq;

private Integer word;

public Feature() {}

public Feature(Integer freq, Integer word) {

this.freq = freq;

this.word = word;}

public int compare(Feature o1, Feature o2) {

if ((o1).getFrequency() > (o2).getFrequency())

return -1;

else if ((o1).getFrequency() < (o2).getFrequency())

return 1;

else

return 0;}

public double getFrequency() {

return freq;}

}

To summarize the problem, I have a collection which has members pairs of values, the first representing the bin and the second the frequency. This collection is updated until all features have been processed and at which point I just want to keep the bins with the greatest values.

I am using a HashMap<Integer, Integer> structure for the collection and a BoundedPriorityQueue<Feature> for the top k bins.

分享给朋友:
您可能感兴趣的文章:
随机阅读: