This Bugzilla instance is a read-only archive of historic NetBeans bug reports. To report a bug in NetBeans please follow the project's instructions for reporting issues.

Bug 66160 - [50cat] poor search capabilities
Summary: [50cat] poor search capabilities
Status: RESOLVED INVALID
Alias: None
Product: obsolete
Classification: Unclassified
Component: collabnet (show other bugs)
Version: 5.x
Hardware: PC Windows XP
: P3 blocker with 1 vote (vote)
Assignee: support
URL: http://www.netbeans.org/servlets/Read...
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-06 19:10 UTC by rrochat
Modified: 2009-11-08 02:35 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Exception Reporter:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rrochat 2005-10-06 19:10:38 UTC
When I search for "genfile.properties" from the "Entire netbeans.org domain"
with everything selected, all I get are the result of my recent posting to
[50cat] asking about it.  Jirka came up with a couple of nbusers messages and
http://projects.netbeans.org/buildsys/design.html#vcs-deps which had exactly the
information I wanted. 
Why weren't these listed in the search results?  

API Javadocs, other project documents, FAQs -- all of these seem to be
frequently omitted from the results.  With such poor luck from the NetBeans
search engine that, you're not encouraged to use it unless all other options
have failed, whereas it should be the first place to look.
Comment 1 jcatchpoole 2005-10-21 18:37:27 UTC
My guess is this is a dup of issue 13172 - meta and non-ascii chars cause
searches to fail.  Note the age of the issue :-(  We've long had problems with
the built in search, and for a long time simply did not use it, and used Google
instead.  We recently (~6mhts) started using it again as it was supposedly much
enhanced, and does now search multiple data sets (lists, IZ, html content ...
etc).  That is good, but clearly there are still problems.

Can you try your search again, without the ".", and check if you turn up the
extra results you missed first time around ?
Comment 2 rrochat 2005-10-21 20:43:43 UTC
Searched for "genfile" and got
"No matching search results found."
Comment 3 jcatchpoole 2005-10-24 10:36:16 UTC
I can reproduce.

Collab, the word "genfile" appears at this URL 
http://projects.netbeans.org/buildsys/design.html#vcs-deps, and yet going to
http://www.netbeans.org/servlets/Search?mode=advanced, selecting "entire nb
domain", and selecting all "artifact" options (or leaving them all unselected)
does not return that match.  Why ?

rrochat, you mention there were other msgs where this string appears - can you
include URLs here?
Comment 4 rrochat 2005-10-24 17:29:18 UTC
These were the ones Jirka cited that did not show up for me:
http://www.netbeans.org/servlets/ReadMsg?list=nbusers&msgId=973678 or
http://www.netbeans.org/servlets/ReadMsg?list=nbusers&msgId=833959

The first one is in the list I pulled up just now.  When I tried to check
the 2nd one, I got two "www.netbeans.org could not be found".
I know it's just temporary, but oh well.  I'll cross my fingers
that I can submit this.
Roxie
Comment 5 lpintilie 2006-03-22 12:51:26 UTC
The issue seems to be getting old: today, March 22nd 2006 I've searched the
"nbuser" mailing list archive and no matter what I've searched, there were no
results from the current year. This is in great contrast with SUN's touted
attitude of "promoting communities": a significant part of the mailing list
traffic is effectively kept closed, against of what one expects from an open
mailing list.

I understand that the infrastructure is managed by Collab, but as their customer
SUN heave the means of fixing this problem. I hope it will be fixed soon... 
Comment 6 lpintilie 2006-03-22 12:54:48 UTC
One way of addressing the search problem is to use a Google search appliance.
Comment 7 jcatchpoole 2006-03-22 16:15:15 UTC
Ping ... ?  This issue is over 5mths old and still no response from support. 
This particular case is a pretty minor issue but the bigger picture (of search
being quite badly broken) is not.

BTW lpintilie I could not reproduce what you saw - I was able to get plenty of
results back, eg even turning up ~30 or more msgs matching "lpintilie".  There
are many problems with search (note particularly the comments in this issue
about meta-chars), but it is not entirely broken.  What kind of terms did you
search for ?
Comment 8 Unknown 2006-03-22 18:43:32 UTC
Hi Jack
           Taking up the issue right away,I shall research a bit more on the
issue and provide you a detailed update as soon as possible.

Thanks,
Karthik
Support Operations
Comment 9 Unknown 2006-03-22 20:05:45 UTC
Updating status whiteboard.

Regards,
Karthik
Support Operations
Comment 10 Unknown 2006-03-22 21:14:20 UTC
As much as we would like to see this search feature improved, it is still under 
consideration for future releases. All mailing-list search features (and other 
related issues) are going to be completley re-architected in the planned 
Discussion Services.

The (.) in the search keyword is obviously an issue (as was noted in other 
issues as well).

I will check internally if some "short-term" workarounds are possible to make 
the search feature work a little better, till such time the new Discussion 
Services are introduced.
Comment 11 Unknown 2006-03-22 22:58:44 UTC
Updating status whiteboard.

Regards,
Karthik
Support Operations

Comment 12 jcatchpoole 2006-03-23 11:07:22 UTC
"Short term" !?  The meta-char issue (issue 13172) was reported almost *5 years*
ago!  A working search tool is surely one of the most important things a web app
should have.
Comment 13 Unknown 2006-03-28 23:20:48 UTC
Hi rrochat
          Searching for keyword "genfiles.properties" returns the expected HTML
document
ie. http://projects.netbeans.org/buildsys/design.html now. Please note that
Lucene treats "genfiles.properties" as one word and so searching for "genfile"
or "genfiles" won't return expected results. Again, note its 

http://www.netbeans.org/servlets/Search?mode=advanced&resultsPerPage=40&query=%22genfiles.properties%22&scope=domain&artifact=apache+content&Button=Search


Regards,
Karthik
Support Operations
Comment 14 jcatchpoole 2006-03-29 15:55:20 UTC
1. Thanks, yes I see the results now;

2. "Please note that Lucene treats "genfiles.properties" as one word and so
searching for "genfile" or "genfiles" won't return expected results.".

I guess Lucene is the search utility ?  Anyway, that's a bug (treating
genfiles.properties as 1 word).  I verified that searching for "genfiles" does
not return the expected results.

Is this at least fixed in SC 3.5 ?
Comment 15 rrochat 2006-03-29 16:49:07 UTC
I still have some issues with this:
1) The default search doesn't generate the same results that your query does
when I go to the Community page and type in "genfiles.properties".  
Your link has different parameters.  Perhaps "mode" is the key (if so, why does
it have to be "advanced"?) or are you dealing with a version that's not live on
the website for me?

2) Even with your query, I can't get it to find the nbusers messages that Jirka
found.  If I select mailing list archives, it shows 5 (why not more if it's the
only thing selected?). If I click on "Browse all results," I see 0 results --
the query is apparently blank.

Roxie
 
Comment 16 jcatchpoole 2006-03-29 17:25:23 UTC
Roxie, I'm guessing your comments are addressed to Karthik ?  Some answers :

1. Going to community and doing the search will search all "artefacts" on nb.org
- that includes eg html, issues, mailing list msgs, etc.  The URL Karthik
includes here is the URL to show all (instead of just the first 5) html results.
 You will get the same results if you click "Browse all results from HTML
content" from the normall community search.

2. I can't verify this - I get 5 msgs (as you mention, I'd rather see all of
them since I selected only mailing list msgs), but if I click show all I do get
a paginated list of 62 msgs.  Strangely I did first get an entirely blank page,
but I hit reload and it worked.  Can you try the same again ?
Comment 17 rrochat 2006-03-29 18:02:03 UTC
Yes, thanks, Jack, my message was in response to Karthik's.
I had indeed selected just "HTML content" for my query and had tried hitting
reload.  I'd also tried it with and without the double quotes.

When I try it now, it does work.  
I have no explanation.  I just tried "genfile.properties" (without the "s") and
get different results than "genfiles.properties" but they're not the same as I
was seeing before.  Oh well, as long as it's working now, I guess it doesn't matter.

Are we indeed searching javadocs and API documents (category "HTML Content"?)?  
I can't get that to work.  These are critical.
i.e. Module API
http://www.netbeans.org/download/dev/javadoc/org-openide-modules/org/openide/modules/doc-files/api.html
has things like "OpenIDE-Module-Specification-Version" and
"OpenIDE-Module-Requires-Message" that never come up from a search (even when I
select all categories).
A simpler test case: "myWidgetsMode" is in the Windows API document:
http://www.netbeans.org/download/dev/javadoc/org-openide-windows/org/openide/windows/doc-files/api.html




Comment 18 jcatchpoole 2006-04-03 14:03:46 UTC
Sorry I should clarify my comment :

> I guess Lucene is the search utility ?  Anyway, that's a bug (treating
> genfiles.properties as 1 word).  I verified that searching for "genfiles" does
> not return the expected results.

Maybe its a Lucene internal thing, but whether "genfiles.properties" is one word
or not doesn't matter; what matters is that searching for partial words should
match, eg searching for "genfile" (or "gen" or "file" etc) should match.
Comment 19 Unknown 2006-04-04 01:32:44 UTC
Hi Jack
            I will convey your update to my engineers and will work on it.Will
get back to you asap on this.

Regards,
Karthik
Support Operations
Comment 20 _ mihmax 2006-04-06 16:57:26 UTC
Why don't use google?
Comment 21 Unknown 2006-04-28 08:22:44 UTC
Hi Jack
            If users want to search using partial words then he/she has to do
wildcard search. To know more about lucene wild card search, please refer:
http://lucene.apache.org/java/docs/queryparsersyntax.html
http://www.netbeans.org/scdocs/Search

In this case, if user wants to search for documents containing
"genfiles.properties" but using a partial word say "genfile" or "gen" then he
has to use query string with "*" wild card char. ie should use search string as
genfile* or gen*

http://www.netbeans.org/servlets/Search?mode=advanced&resultsPerPage=40&query=genfile*&scope=domain&artifact=apache+content&Button=Search
http://www.netbeans.org/servlets/Search?mode=advanced&resultsPerPage=40&query=gen*&scope=domain&artifact=apache+content&Button=Search

Hope this helps. let me know if you need more info.



FYI pasting here snip from doc: http://www.netbeans.org/scdocs/Search


Search does "stemming". If you enter the search string 'dip', you will be
returned pages that contain the word 'dip' but also pages that contain 'dipping'
and 'dips' since 'dip' is the word-stem of 'dipping' and 'dips'. Search will not
return pages that contain 'diphthong'.


Regards,
Karthik
Support Operations
Comment 22 Unknown 2006-05-03 04:17:38 UTC
Hi Jack
          Please let me know if you need any more help with this.

Thanks,
Karthik
Support Operations
Comment 23 rrochat 2006-05-03 05:22:30 UTC
1. If you're going to use special wild search rules, they need to be linked to
from the NetBeans "Search" page,
i.e. 
http://platform.netbeans.org/servlets/Search?scope=domain&resultsPerPage=40&query=&Button.x=27&Button.y=7

2. I'm more concerned with what's being searched, or not, as the case might be.
Why do javadocs and API documents not show up in search results?  I think these
are critical and somehow need to be weighted (if possible) so they show up first.
See my March 29 entry for examples.

Maybe google is the answer, but I'd hope you could do better with an intelligent
search engine that understands what's important to NetBeans users.

Comment 24 Unknown 2006-05-12 11:53:07 UTC
Hi 
   Thanks for the update given,I shall work with this and give a suitable update
in a couple of days.

Thanks,
Karthik
Support Operations
Comment 25 Unknown 2006-06-06 21:13:06 UTC
An update to this issue. CEE has robots inclusion starting Danube-S.

Looking at: http://www.netbeans.org/robots.txt you can see the reason why files
under /download folder has not been indexed. Please note that even CEE's local 
indexers respect robots.txt file.

Referring the files below notice that these are under 
location "upload/download" and no indexer plugin would look in to it. 

http://www.netbeans.org/download/dev/javadoc/org-openide-
modules/org/openide/modules/doc-files/api.html
http://www.netbeans.org/download/dev/javadoc/org-openide-
windows/org/openide/windows/doc-files/api.html

Is this "download/upload" area intended to be indexed?
Comment 26 jcatchpoole 2006-06-07 14:31:19 UTC
> Looking at: http://www.netbeans.org/robots.txt you can see the reason why 
> files under /download folder has not been indexed. 

Why ?  You just mentioned that SC only supports robots.txt as of Monday June 5th
for us (the date we upgraded to Danube-S).  So going forward sounds like they
wont be indexed (I'll update robots.txt to address that).  Why weren't they before ?

> Please note that even CEE's local indexers respect robots.txt file.

As of Danube-S, right ?  Also see issue 22183.

> Referring the files below notice that these are under 
> location "upload/download" and no indexer plugin would look in to it. 

Why not ?  Sorry Ani, I don't understand your response here.
Comment 27 jcatchpoole 2006-06-13 18:39:15 UTC
> Please note that even CEE's local indexers respect robots.txt file.

Actually that appears to not be true at all.  Try searching for "JAM", 2 of the
first 5 results displayed for HTML content are on testwww.
Comment 28 Unknown 2006-07-21 01:01:56 UTC
There were some limitations around the robots.txt file, prior to the current 
release. That being said we will try to verify this issue again.
Comment 29 Unknown 2006-09-11 11:13:40 UTC
Jack,

> Referring the files below notice that these are under 
> location "upload/download" and no indexer plugin would look in to it. 

As "$data_dir/home/upload" which seems to be a data location specific to NB 
and thats the reason why Ani said that indexer would not look into it. 

>Actually that appears to not be true at all.  Try searching for "JAM", 2 of 
>the first 5 results displayed for HTML content are on testwww.

I just tried now searching "JAM" and i got the results but nothing is from 
the "testwww" project as you mentioned. Please let me know if i miss something 
here. 

-Priya 

Comment 30 Unknown 2006-09-26 06:40:43 UTC
Jack?
Comment 31 Unknown 2006-10-06 06:47:28 UTC
Let me summarise the issue again and the status.

JAM keyword search issue:
------------------------
 As we have changed the robots settings now, nothing from 'testwww' will be 
indexed so it will not be searchable also. 

Javadoc search not working:
---------------------------

It will not work as those docs are in "$data_dir/home/upload" which seems to 
be a data location specific to NB and thats the reason why indexer would not 
look into it. Was there any instructions given to Collab on indexing Javadocs 
in upload folder also since it is in the location specific to NB? Jack, i 
think we need to work with Shilpa on this if this is the priority to you.

"genfiles.properties" not searchable as "genfile":
--------------------------------------------------

As explained above Lucene is not taking this as a seperate string. This can be 
searchable using the wildcard as Karthik mentioned above. The following 
explanation from the engineer should help. 

<Snip>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Using the specific text: BUGGEN.***MDT.20030327.01.A as an example I'll give a
brief breakdown of the tokenizing process.

1. "*" are not recognized as part of a token and are therefore treated as
separators so the text is broken into:
a. BUGGEN
b. MDT.20030327.01.A

2. The second token is recognized as a potential serial number and so it is not
broken down further.  Punctuation including {"_"|"-"|"/"|"."|","} sandwiched
between alphanumeric characters are flagged as serial numbers.  It is possible,
if the separator is a ".", that it could be treated as a hostname.  But
regardless of which it is, both are treated as a single token.

Now as to whether this behavior is a bug. It is very difficult to determine the
correct behavior for a generalized text search when the contents are not
strictly defined as words separated by spaces.  Google does very little 
guessing or trying to determine meaning.  If I enter "1912.09.15" into google 
I get about 25 answers, all of which have that exact string in the text.  If I 
enter the same into Yahoo, I get results which seem to indicate, Yahoo treated 
it as a date as it returns results which have 09-15-2006 and other 
variations.  It also returns results where it looks like "2006" might be taken 
in isolation from the "09" and "15" parts.  Yahoo returns many more results, 
about 6 million.

I would say lucene's behavior is closer to Google.  But it does try to apply
some standard grammar/punctuation rules to determine tokens, possibly more than
Google does.  But it would appear both do far less than what Yahoo does.

Now we can alter Lucene's behavior to treat "." as a token delimiter.  But it
will result in the variation of behavior as shown by the Google and Yahoo
results above.  A user entering the exact text "BUGGEN.***MDT.20030327.01.A" is
going to get all artifacts which contain "BUGGEN", but the "MDT.20030327.01.A"
will give a precise match.  If it was broken up all artifacts which contain
"MDT" or "01" as part of this field will also be returned.  "A" will be ignored
as a common word.

If we change the behavior I could see some users as being disappointed that 
they now cannot get the correct artifact by entering the serial number 
contained within it, whereas before they could.

I don't think one behavior or the other can be classified as a defect.  They 
are simply alternatives.
</snip>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Comment 32 Unknown 2006-10-17 09:46:55 UTC
Jack? Could you please review the above update. 
Comment 33 Unknown 2006-11-06 08:55:01 UTC
Jack, Any updates on this or can we close this issue out as the recent changes 
have fixed these search issues?
 
Comment 34 Unknown 2006-11-21 09:08:04 UTC
Closing this as fixed. Please reopen if you have anymore questions. 

-Priya
Comment 35 Unknown 2007-04-03 15:28:14 UTC
closing..
Comment 36 Marian Mirilovic 2009-11-08 02:35:06 UTC
We recently moved out from Collabnet's infrastructure