Subversion Repositories wimsdev

Rev

Rev 6879 | Show entire file | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 6879 Rev 7690
Line 1... Line 1...
1
WIMS' search engine and als
1
WIMS' search engine and als
2
===========================
2
===========================
3
 
3
 
4
WIMS' search engine works in two stages:
4
WIMS' search engine works in two stages:
5
 
5
 
6
1) update of index files when server data is changed (module added...), 
6
1) update of index files when server data is changed (module added...),
7
   typically once a day.
7
   typically once a day.
8
2) use of index files at each user's request to find some activities
8
2) use of index files at each user's request to find some activities
9
 
9
 
10
 
10
 
11
Here are some details : 
11
Here are some details :
12
 
12
 
13
1) update of index files       
13
1) update of index files
14
===========================
14
===========================
15
A series of scripts creates a set of auxiliary files (generally
15
A series of scripts creates a set of auxiliary files (generally
16
stored in ~/public_html/bases/sys/, see description further down) and
16
stored in ~/public_html/bases/sys/, see description further down) and
17
a list of "keywords" (stored in ~/public_html/bases/site/).
17
a list of "keywords" (stored in ~/public_html/bases/site/).
18
 
18
 
19
(the scripts must be run in the order given here, as some files
19
(the scripts must be run in the order given here, as some files
20
created on earlier stages are used in subsequent stages). In general
20
created on earlier stages are used in subsequent stages). In general
21
the whole process is run by the script ~/bin/mkindex.
21
the whole process is run by the script ~/bin/mkindex.
22
 
22
 
23
* Firstly a series of 3 perl scripts (mkdomain, mkwgrp, modindclass), 
23
* Firstly a series of 3 perl scripts (mkdomain, mkwgrp, modindclass),
24
that ~/bin/mkindex.sh calls via ~/public_html/bases/sys/mkindex.sh : 
24
that ~/bin/mkindex calls via ~/public_html/bases/sys/mkindex.sh :
25
 
25
 
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
26
- the programm ~/public_html/bases/sys/mkdomain.pl creates the lists
27
  of domains from the graph in domain/domain with its translations
27
  of domains from the graph in domain/domain with its translations
28
  (domain/domain.$lang) and in json format (english) to be used for
28
  (domain/domain.$lang) and in json format (english) to be used for
29
  completion in modtool properties ; create also the domain/domaindic.xx
29
  completion in modtool properties ; create also the domain/domaindic.xx
30
  to be used as a dictionnary in modind and in the search engine
30
  to be used as a dictionnary in modind and in the search engine
31
 
31
 
32
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
32
- the perl program ~/public_html/bases/sys/mkwgrp.pl reads the INDEX
33
  files of all the modules on the site and generates 
33
  files of all the modules on the site and generates
34
 
34
 
35
  - keywords (in format .json) to be used for completion in the search
35
  - keywords (in format .json) to be used for completion in the search
36
    engine)
36
    engine)
37
  - the files in wgrp
37
  - the files in wgrp
38
 
38
 
Line 53... Line 53...
53
  description, language, level, title (no ranking is done).
53
  description, language, level, title (no ranking is done).
54
 
54
 
55
Be careful : to be used as dictionary, must be sorted by the command
55
Be careful : to be used as dictionary, must be sorted by the command
56
  bin/dicsort (for example for domaindic).
56
  bin/dicsort (for example for domaindic).
57
 
57
 
58
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads 
58
* Secondly the binary program "modind" (compiled from ~/src/Misc/modind.c) reads
59
 
59
 
60
  -- the INDEX files of all the modules on the site 
60
  -- the INDEX files of all the modules on the site
61
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
61
  -- the auxiliary files in ~/public_html/bases/sys/ (see description
62
     below)
62
     below)
63
 
63
 
64
  and produces keywords lists stored in ~wims/public_html/bases/site :
64
  and produces keywords lists stored in ~wims/public_html/bases/site :
65
  they contains the words (or words groups) coming from the variable
65
  they contains the words (or words groups) coming from the variable
Line 77... Line 77...
77
 
77
 
78
  -- separately "modind" reads also the files in
78
  -- separately "modind" reads also the files in
79
  ~/public_html/bases/sys/sheet and do the same type of works.
79
  ~/public_html/bases/sys/sheet and do the same type of works.
80
 
80
 
81
 
81
 
82
2) use of index files       
82
2) use of index files
83
===========================
83
===========================
84
The script ~/public_html/modules/home/search.proc (called by the
84
The script ~/public_html/modules/home/search.proc (called by the
85
"Search" form) reads the lists above, do the actual search in such
85
"Search" form) reads the lists above, do the actual search in such
86
lists and displays the modules found. It reads also the files of
86
lists and displays the modules found. It reads also the files of
87
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
87
~/public_html/bases/sys/class and ~/public_html/bases/sys/sheets
Line 107... Line 107...
107
If any of the files described below is omitted, then the corresponding
107
If any of the files described below is omitted, then the corresponding
108
feature in the corresponding language is disabled.
108
feature in the corresponding language is disabled.
109
 
109
 
110
  In version < 4.05c, if there is no file words.$lang, the file
110
  In version < 4.05c, if there is no file words.$lang, the file
111
  suffix.$lang was not used (correction in Misc/translator.c to check
111
  suffix.$lang was not used (correction in Misc/translator.c to check
112
  in other situations). 
112
  in other situations).
113
  The group words were badly treated when the words were already in 
113
  The group words were badly treated when the words were already in
114
  the title, properties, etc. because of
114
  the title, properties, etc. because of
115
  some option unknown_type=unk_delete in modind.c but it has other consequences
115
  some option unknown_type=unk_delete in modind.c but it has other consequences
116
  so it is not the situation.
116
  so it is not the situation.
117
 
117
 
118
, will be done by the script in the stable release if we are OK)
118
, will be done by the script in the stable release if we are OK)
Line 127... Line 127...
127
 
127
 
128
Files
128
Files
129
=====
129
=====
130
 
130
 
131
words.xx : correct misprints in the search words
131
words.xx : correct misprints in the search words
132
(used both by "mkindex" and "search.proc"). 
132
(used both by "mkindex" and "search.proc").
133
 
133
 
134
E.g. if the file words.en contains the line
134
E.g. if the file words.en contains the line
135
 
135
 
136
==
136
==
137
analytical:analytic
137
analytical:analytic
138
==
138
==
Line 150... Line 150...
150
~/public_html/modules/tool/wcalc.en/dic )
150
~/public_html/modules/tool/wcalc.en/dic )
151
 
151
 
152
=====================
152
=====================
153
 
153
 
154
suffix.xx : process common suffixes in the search words
154
suffix.xx : process common suffixes in the search words
155
(used both by "mkindex" and "search.proc"). 
155
(used both by "mkindex" and "search.proc").
156
 
156
 
157
E.g. if the file suffix.en contains the line
157
E.g. if the file suffix.en contains the line
158
 
158
 
159
==
159
==
160
ertem:meter
160
ertem:meter
Line 186... Line 186...
186
would return both the modules containing the word "affine" and the
186
would return both the modules containing the word "affine" and the
187
modules containing the word "geometry").
187
modules containing the word "geometry").
188
 
188
 
189
The "wgrp" files are now generated from the modules' keywords by the
189
The "wgrp" files are now generated from the modules' keywords by the
190
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
190
script ~/public_html/bases/sys/mkwgrp.pl : whenever a module contains
191
multiple words keywords, such keywords are added to the wgrp files. 
191
multiple words keywords, such keywords are added to the wgrp files.
192
 
192
 
193
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword 
193
E.g. tool/algebra/smallgroup.fr/INDEX contains the keyword
194
 
194
 
195
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
195
keywords=group, finite group, order, subgroup, conjugacy class, center, normal subgroup, subgroup lattice
196
 
196
 
197
so for each of the groups of words between two commas the
197
so for each of the groups of words between two commas the
198
corresponding groups of words are created
198
corresponding groups of words are created
Line 209... Line 209...
209
 
209
 
210
=====================
210
=====================
211
 
211
 
212
domaindic.xx
212
domaindic.xx
213
 
213
 
214
use the files domain/domain.xx to replace the "langugage" domain in the
214
use the files domain/domain.xx to replace the "language" domain in the
215
  english/technic way.
215
  english/technic way.
216
 
216
 
217
=====================
217
=====================
218
 
218
 
219
indignore.xx : ignored words
219
indignore.xx : ignored words
220
(used by "mkindex")
220
(used by "mkindex")
221
 
221
 
222
All the words listed in the file are ignored by the search engine. 
222
All the words listed in the file are ignored by the search engine.
223
 
223
 
224
=====================
224
=====================
225
 
225
 
226
abuse.xx : swearwords to be ignored by the search engine
226
abuse.xx : swearwords to be ignored by the search engine
227
(used by ??)
227
(used by ??)
228
 
228
 
229
=====================
229
=====================
230
 
230
 
231
andor.xx : conjunctions ("and", "or") to be ignored by the 
231
andor.xx : conjunctions ("and", "or") to be ignored by the
232
search engine
232
search engine
233
 
233
 
234
The file andor.xx is mentioned in src/insmath.c (processing logic
234
The file andor.xx is mentioned in src/insmath.c (processing logic
235
statements in math formulas) but this is for the moment used by no
235
statements in math formulas) but this is for the moment used by no
236
modules (to be used, one must have insmath_logic=yes which do not
236
modules (to be used, one must have insmath_logic=yes which do not
Line 252... Line 252...
252
 
252
 
253
As this is an exercise module it is indexed in the lists A.$lang (All)
253
As this is an exercise module it is indexed in the lists A.$lang (All)
254
and X.$lang (eXercise).
254
and X.$lang (eXercise).
255
 
255
 
256
This is a multilanguage module (main language "en", translation
256
This is a multilanguage module (main language "en", translation
257
language "it"). 
257
language "it").
258
 
258
 
259
The index file contains the following (nonempty) lines
259
The index file contains the following (nonempty) lines
260
 
260
 
261
  title=Vector shoot
261
  title=Vector shoot
262
  description=click on a linear combination of 2D vectors.
262
  description=click on a linear combination of 2D vectors.
Line 288... Line 288...
288
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
288
translated module (1002 resp 2004). --> HELP there is no A.cn file!!
289
 
289
 
290
The files A.en contains the following lines related to this module.
290
The files A.en contains the following lines related to this module.
291
 
291
 
292
?2 or ?4 is the ranking
292
?2 or ?4 is the ranking
293
It is a weight -- see name of variable in modind.c -- 
293
It is a weight -- see name of variable in modind.c --
294
giving more importance to the title words : 4 if the word appears 
294
giving more importance to the title words : 4 if the word appears
295
in the module title, 2 otherwise
295
in the module title, 2 otherwise
296
 
296
 
297
2d:1003?2                           from description and description_it
297
2d:1003?2                           from description and description_it
298
algebra:1003?2			    from domain
298
algebra:1003?2			    from domain
299
bersaglio:1003?2		    from keywords_it
299
bersaglio:1003?2		    from keywords_it
Line 301... Line 301...
301
combination:1003?2		    from description (_not_ from keywords)
301
combination:1003?2		    from description (_not_ from keywords)
302
combinazione:1003?2		    from description_it
302
combinazione:1003?2		    from description_it
303
combinazione lineare:1003?2	    from keywords + wgrp.en
303
combinazione lineare:1003?2	    from keywords + wgrp.en
304
gang:1003?2  			    from author
304
gang:1003?2  			    from author
305
levelh4:1003?2			    from level=h4 (and so on)
305
levelh4:1003?2			    from level=h4 (and so on)
306
levelh5:1003?2			    
306
levelh5:1003?2
307
levelh6:1003?2
307
levelh6:1003?2
308
levelu1:1003?2
308
levelu1:1003?2
309
levelu2:1003?2
309
levelu2:1003?2
310
linear:1003?2		            from description
310
linear:1003?2		            from description
311
linear algebra:1003?2		    from keywords
311
linear algebra:1003?2		    from keywords
312
linear combination:1003?2	    from keywords
312
linear combination:1003?2	    from keywords
313
lineare:1003?2			    from description_it
313
lineare:1003?2			    from description_it
314
shoot:1003?4			    from title
314
shoot:1003?4			    from title
315
vector:1003?4                       from title + description 
315
vector:1003?4                       from title + description
316
				    (vectors --> vector because of 
316
				    (vectors --> vector because of
317
				    directive "sr:r" in suffix.en)
317
				    directive "sr:r" in suffix.en)
318
vettore:1003?2			    from keywords_it
318
vettore:1003?2			    from keywords_it
319
xiao:1003?2			    from author
319
xiao:1003?2			    from author
320
 
320
 
321
The file A.it contains the following lines related to this module.
321
The file A.it contains the following lines related to this module.
Line 341... Line 341...
341
linear algebra:1003?2
341
linear algebra:1003?2
342
linear combination:1003?2
342
linear combination:1003?2
343
lineare:1003?2
343
lineare:1003?2
344
shoot:1003?4
344
shoot:1003?4
345
vector:1003?4
345
vector:1003?4
346
vectors:1003?2			        no corresponding in A.en because 
346
vectors:1003?2			        no corresponding in A.en because
347
                                        of directive in suffix.en
347
                                        of directive in suffix.en
348
vettore:1003?2
348
vettore:1003?2
349
xiao:1003?2
349
xiao:1003?2
350
 
350
 
351
NOTE: title_it is missing from the index: you cannot find the module
351
NOTE: title_it is missing from the index: you cannot find the module
Line 354... Line 354...
354
The file A.$lang for languages different from the above contains lines
354
The file A.$lang for languages different from the above contains lines
355
related to this module.
355
related to this module.
356
 
356
 
357
E.g. A.nl
357
E.g. A.nl
358
 
358
 
359
2d:					
359
2d:
360
algebraisch:			directive "algebra:algebraisch" in words.nl
360
algebraisch:			directive "algebra:algebraisch" in words.nl
361
bersaglio:			
361
bersaglio:
362
clicking:			directive "click:clicking" in words.nl
362
clicking:			directive "click:clicking" in words.nl
363
combinaison:			"combination:combinaison" in words.nl
363
combinaison:			"combination:combinaison" in words.nl
364
combinazione:
364
combinazione:
365
combinazione lineare:
365
combinazione lineare:
366
gang:
366
gang:
Line 430... Line 430...
430
  (so only them are in the list of completion)
430
  (so only them are in the list of completion)
431
  - modind.c creates files A.$lang etc which are based on words of keywords,
431
  - modind.c creates files A.$lang etc which are based on words of keywords,
432
  title, description. They are not all of them in the "completion list"
432
  title, description. They are not all of them in the "completion list"
433
  but can be written and found by the search engine.
433
  but can be written and found by the search engine.
434
 
434
 
435
  
435
 
436
 
436
 
437
Technical things about modind.c (ER. just to avoid forgetting work in progress)
437
Technical things about modind.c (ER. just to avoid forgetting work in progress)
438
===============================
438
===============================
439
 
439
 
440
The tasks done are in order : 
440
The tasks done are in order :
441
 
441
 
442
- prep() : * replaces if possible the default language list (defined at top of file)
442
- prep() : * replaces if possible the default language list (defined at top of file)
443
             by the list of languages installed on the server.
443
             by the list of languages installed on the server.
444
           * gets the list of all modules prepared by a previous script
444
           * gets the list of all modules prepared by a previous script
445
           * opens files bases/site2/author|description|language|...
445
           * opens files bases/site2/author|description|language|...
Line 450... Line 450...
450
 
450
 
451
- sprep(),sheets() : idem for sheets.
451
- sprep(),sheets() : idem for sheets.
452
 
452
 
453
 
453
 
454
 
454
 
455
Extracting information from one module for a given language (function onemodule) : 
455
Extracting information from one module for a given language (function onemodule) :
456
 
456
 
457
- write author,description,language,etc. information in each corresponding file
457
- write author,description,language,etc. information in each corresponding file
458
  bases/site2/author|description|language|...
458
  bases/site2/author|description|language|...
459
 
459
 
460
- normalizes data (suppress uppercase, accents, apostrophe, plural) 
460
- normalizes data (suppress uppercase, accents, apostrophe, plural)
461
  according to dictionary domaindic, then maindic with suffix, to get normalized 
461
  according to dictionary domaindic, then maindic with suffix, to get normalized
462
  author, description, title, etc.
462
  author, description, title, etc.
463
  This is done in the loop for(i=0;i<trcnt;i++){...}
463
  This is done in the loop for(i=0;i<trcnt;i++){...}
464
 
464
 
465
- transforms the (normalized) title into words (change commas to spaces) 
465
- transforms the (normalized) title into words (change commas to spaces)
466
  and for each word, appends it with weight 4 using function appenditem.
466
  and for each word, appends it with weight 4 using function appenditem.
467
  the variables are the word itself, the current language treated, the serial number of module,
467
  the variables are the word itself, the current language treated, the serial number of module,
468
  the weight=4, and the module language. 
468
  the weight=4, and the module language.
469
 
469
 
470
- put every information other than title (description, keywords, foreign titles, author...) 
470
- put every information other than title (description, keywords, foreign titles, author...)
471
  in a buffer, transforms it into words and appends this as above except than weight=2.
471
  in a buffer, transforms it into words and appends this as above except than weight=2.
472
 
472
 
473
- the 2 preceeding points (treatment of title and other info) are repeated with the difference
473
- the 2 preceeding points (treatment of title and other info) are repeated with the difference
474
  that the transformation into words is replaced by a translation : 
474
  that the transformation into words is replaced by a translation :
475
  the commas are kept, but some usual words are deleted.
475
  the commas are kept, but some usual words are deleted.
476
  BUG ? : Another difference is that part of "other information than title" is missing, 
476
  BUG ? : Another difference is that part of "other information than title" is missing,
477
          for instance the foreign titles, require, author.
477
          for instance the foreign titles, require, author.
478
 
478
 
479
ER : I don't know why the process is repeated : should look at appenditem 
479
ER : I don't know why the process is repeated : should look at appenditem
480
to see where it is appended, maybe the second time is somewhere else.
480
to see where it is appended, maybe the second time is somewhere else.
481
 
481
 
482
 
482
 
483
===============================
483
===============================